add new md files
This commit is contained in:
596
README.md
596
README.md
@@ -1,430 +1,314 @@
|
||||
# mitho AI Agent – Developer Deep Dive
|
||||
|
||||
# mitho AI Agent – Developer README
|
||||
Enterprise Hybrid RAG System (Symfony + NDJSON + FAISS)
|
||||
|
||||
---
|
||||
|
||||
# 1. System Overview
|
||||
|
||||
This system implements a deterministic, governance-stable Retrieval Augmented Generation (RAG) architecture based on:
|
||||
|
||||
- Symfony (PHP backend)
|
||||
- NDJSON-based knowledge index
|
||||
- Full FAISS vector rebuild strategy
|
||||
- Hybrid retrieval (keyword + vector)
|
||||
- Deterministic ingest pipeline
|
||||
- Strict versioning & guardrails
|
||||
- Lock-based reindex protection
|
||||
|
||||
No incremental vector mutation is allowed.
|
||||
FAISS is always rebuilt from `index.ndjson`.
|
||||
Stand: Februar 2026
|
||||
Status: Produktiv stabil – Job-basierte Ingest-Architektur vollständig integriert
|
||||
|
||||
---
|
||||
|
||||
# 2. High-Level Architecture
|
||||
# 1. Systemüberblick
|
||||
|
||||
User Query
|
||||
→ Hybrid Retrieval
|
||||
→ Context Assembly
|
||||
→ Prompt Builder
|
||||
→ LLM
|
||||
→ Streaming Response (SSE)
|
||||
Dieses System implementiert eine deterministische, governance-stabile
|
||||
Hybrid-RAG-Architektur mit:
|
||||
|
||||
Knowledge Flow:
|
||||
Document → Version → Extract → Chunk → NDJSON → FAISS → Retrieval
|
||||
- Symfony (PHP Backend)
|
||||
- NDJSON als Single Source of Truth
|
||||
- FAISS als Vektorindex (immer Full Rebuild)
|
||||
- Hybrid Retrieval (Keyword + Vektor)
|
||||
- Versioniertes Dokumentmodell
|
||||
- Job-basierte Ingest-Pipeline
|
||||
- Lock-geschützte Reindex-Operationen
|
||||
- SSE-Streaming im Frontend
|
||||
|
||||
Grundprinzip:
|
||||
Keine inkrementellen Vektor-Updates.
|
||||
FAISS wird immer vollständig aus index.ndjson neu gebaut.
|
||||
|
||||
---
|
||||
|
||||
# 3. Directory Structure (Knowledge Layer)
|
||||
# 2. Architekturprinzipien
|
||||
|
||||
```
|
||||
var/knowledge/
|
||||
├── uploads/
|
||||
├── chunks/
|
||||
├── index.ndjson
|
||||
├── index_meta.json
|
||||
├── vector.index
|
||||
└── vector_meta.json
|
||||
```
|
||||
Determinismus:
|
||||
- Gleiche Dokumente + gleiche Konfiguration → identisches index.ndjson
|
||||
- Gleiches index.ndjson → identisches FAISS
|
||||
- Gleiche Query → identisches Retrieval-Ergebnis
|
||||
|
||||
Governance:
|
||||
- Eine aktive Version pro Dokument
|
||||
- Keine impliziten Index-Änderungen
|
||||
- Strukturänderungen erzwingen Global Reindex
|
||||
- Keine Selbstmodifikation durch KI
|
||||
|
||||
Skalierbarkeit:
|
||||
- NDJSON (streamingfähig)
|
||||
- Keine RAM-basierte JSON-Arrays
|
||||
- Zielgröße > 200k Chunks
|
||||
|
||||
---
|
||||
|
||||
# 4. NDJSON Index
|
||||
# 3. Wissensspeicher
|
||||
|
||||
## 4.1 index.ndjson
|
||||
## 3.1 index.ndjson
|
||||
|
||||
- Single Source of Truth
|
||||
- One JSON object per line
|
||||
- Streaming-readable
|
||||
- No JSON array wrapper
|
||||
- Scales beyond 200k chunks
|
||||
Single Source of Truth.
|
||||
|
||||
Each line contains:
|
||||
- 1 JSON-Objekt pro Zeile
|
||||
- Streaming-Append
|
||||
- Deterministische Compaction by document_id
|
||||
|
||||
Beispielstruktur:
|
||||
|
||||
```json
|
||||
{
|
||||
"chunk_id": "uuid",
|
||||
"document_id": "uuid",
|
||||
"version": 3,
|
||||
"text": "...",
|
||||
"meta": { ... }
|
||||
"chunk_id": "uuid",
|
||||
"document_id": "uuid",
|
||||
"document_version_id": "uuid",
|
||||
"text": "...",
|
||||
"meta": {...}
|
||||
}
|
||||
```
|
||||
|
||||
NDJSON enables:
|
||||
- Append-based writes
|
||||
- Compaction per document
|
||||
- Memory-safe streaming
|
||||
- Deterministic rebuilds
|
||||
Keine JSON-Array-Datei.
|
||||
Keine Mutation einzelner Chunks.
|
||||
Nur Append + deterministische Entfernung per document_id.
|
||||
|
||||
---
|
||||
|
||||
# 5. Index Metadata
|
||||
## 3.2 index_meta.json
|
||||
|
||||
## index_meta.json
|
||||
|
||||
Managed by:
|
||||
|
||||
- IndexMetaManager
|
||||
- IndexConfiguration
|
||||
|
||||
Contains:
|
||||
Enthält Strukturparameter:
|
||||
|
||||
- index_version
|
||||
- embedding_model
|
||||
- embedding_dimension
|
||||
- chunk_size
|
||||
- overlap
|
||||
- chunk_overlap
|
||||
- scoring_version
|
||||
- index_format
|
||||
- vector_backend
|
||||
|
||||
If configuration changes → Global Reindex required.
|
||||
|
||||
Guarded by:
|
||||
`IndexStructureChangedException`
|
||||
Wenn einer dieser Werte sich ändert:
|
||||
→ Global Reindex zwingend erforderlich.
|
||||
|
||||
---
|
||||
|
||||
# 6. Ingest Pipeline
|
||||
## 3.3 FAISS
|
||||
|
||||
## 6.1 Core Services
|
||||
Dateien:
|
||||
|
||||
| Service | Responsibility |
|
||||
|----------|----------------|
|
||||
| DocumentService | Document lifecycle |
|
||||
| DocumentVersionRepository | Version persistence |
|
||||
| KnowledgeIngestService | Chunk generation |
|
||||
| SimpleChunker | Deterministic splitting |
|
||||
| TextNormalizer | Text cleanup |
|
||||
| StopWords | Keyword filtering |
|
||||
| ChunkManager | NDJSON append + compaction |
|
||||
| ChunkWriter | Chunk persistence |
|
||||
| IngestFlow | Step orchestration |
|
||||
| IngestOrchestrator | Full ingest coordination |
|
||||
| IngestJobService | Job tracking |
|
||||
| LockService | Concurrency guard |
|
||||
- vector.index
|
||||
- vector_meta.json (Chunk-ID Mapping)
|
||||
|
||||
FAISS wird IMMER vollständig aus index.ndjson gebaut.
|
||||
Keine Partial Updates.
|
||||
|
||||
---
|
||||
|
||||
## 6.2 Local Ingest
|
||||
# 4. Dokument- & Versionsmodell
|
||||
|
||||
Used when:
|
||||
- A single document version changes
|
||||
Document
|
||||
→ enthält mehrere DocumentVersion
|
||||
→ genau eine Version ist aktiv
|
||||
|
||||
Process:
|
||||
Regel:
|
||||
Es darf immer nur eine aktive Version pro Dokument existieren.
|
||||
|
||||
1. Extract document
|
||||
2. Normalize text
|
||||
3. Chunk deterministically
|
||||
4. Remove previous chunks of document_id
|
||||
5. Append new chunks to index.ndjson
|
||||
6. Rebuild FAISS completely
|
||||
|
||||
index_version does NOT change.
|
||||
Beim Aktivieren einer Version:
|
||||
- Alle anderen Versionen werden inaktiv
|
||||
- IngestStatus → PENDING
|
||||
- Re-Ingest via Job
|
||||
|
||||
---
|
||||
|
||||
## 6.3 Global Reindex
|
||||
# 5. Ingest-Architektur (vollständig Job-basiert)
|
||||
|
||||
Used when:
|
||||
- Embedding model changes
|
||||
- Chunk size changes
|
||||
- Overlap changes
|
||||
- Scoring logic changes
|
||||
- index_format changes
|
||||
Ingest läuft NIEMALS synchron im HTTP-Request.
|
||||
|
||||
Process:
|
||||
Jede Mutation am Index läuft über:
|
||||
|
||||
1. Re-extract all active document versions
|
||||
2. Recreate full index.ndjson
|
||||
3. Rebuild FAISS
|
||||
IngestJob → CLI Runner → IngestOrchestrator → IngestFlow
|
||||
|
||||
---
|
||||
|
||||
## 5.1 Job-Typen
|
||||
|
||||
DOCUMENT_VERSION_ACTIVATE
|
||||
- Wird genutzt für:
|
||||
- Version aktivieren
|
||||
- Neue Datei hochladen (Auto-Ingest)
|
||||
|
||||
DOCUMENT
|
||||
- Manuelles Ingest einer Version
|
||||
|
||||
GLOBAL_REINDEX
|
||||
- Strukturänderungen
|
||||
|
||||
---
|
||||
|
||||
## 5.2 Job-Status
|
||||
|
||||
- QUEUED
|
||||
- RUNNING
|
||||
- COMPLETED
|
||||
- FAILED
|
||||
- ABORTED
|
||||
|
||||
Jobs werden über CLI ausgeführt:
|
||||
|
||||
php bin/console mto:agent:ingest:run <jobId>
|
||||
|
||||
Start erfolgt asynchron per exec() aus dem Controller.
|
||||
|
||||
---
|
||||
|
||||
# 6. Admin-Flows (aktueller Stand)
|
||||
|
||||
## 6.1 Neue Datei hochladen (NEU: Auto-Ingest)
|
||||
|
||||
Beim Upload:
|
||||
|
||||
1. Datei speichern
|
||||
2. Document + Version 1 erzeugen
|
||||
3. Version 1 aktiv setzen
|
||||
4. IngestJob vom Typ DOCUMENT_VERSION_ACTIVATE anlegen
|
||||
5. Job asynchron starten
|
||||
6. Redirect auf Job-Detailseite
|
||||
|
||||
Ergebnis:
|
||||
Neue Dokumente werden automatisch indexiert.
|
||||
|
||||
---
|
||||
|
||||
## 6.2 Version aktivieren
|
||||
|
||||
1. DB-Status anpassen
|
||||
2. IngestStatus → PENDING
|
||||
3. DOCUMENT_VERSION_ACTIVATE Job erzeugen
|
||||
4. Async Runner starten
|
||||
5. Redirect zur Job-Seite
|
||||
|
||||
---
|
||||
|
||||
## 6.3 Manuelles Ingest
|
||||
|
||||
1. DOCUMENT Job erzeugen
|
||||
2. Async Runner starten
|
||||
3. Redirect zur Job-Seite
|
||||
|
||||
---
|
||||
|
||||
## 6.4 Reset
|
||||
|
||||
Reset löscht:
|
||||
|
||||
- index.ndjson
|
||||
- vector.index
|
||||
- vector_meta.json
|
||||
- Upload-Verzeichnis
|
||||
- Tabellen:
|
||||
- document
|
||||
- document_version
|
||||
- ingest_job
|
||||
|
||||
Nur möglich, wenn exec() aktiv ist.
|
||||
|
||||
---
|
||||
|
||||
# 7. Ingest-Flow Details
|
||||
|
||||
Local Ingest (ein Dokument):
|
||||
|
||||
1. Extract
|
||||
2. Normalize
|
||||
3. Chunk deterministisch
|
||||
4. Entferne alte Chunks per document_id
|
||||
5. Append neue Chunks
|
||||
6. Full FAISS Rebuild
|
||||
|
||||
Global Reindex:
|
||||
|
||||
1. Alle aktiven Versionen neu verarbeiten
|
||||
2. Komplettes index.ndjson neu schreiben
|
||||
3. FAISS neu bauen
|
||||
4. index_version++
|
||||
|
||||
---
|
||||
|
||||
# 7. Vector Architecture
|
||||
|
||||
## 7.1 vector_ingest.py
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- Stream-read index.ndjson
|
||||
- Extract text + chunk_id
|
||||
- Build embeddings
|
||||
- Normalize embeddings
|
||||
- Build FAISS IndexFlatIP
|
||||
- Write vector.index
|
||||
- Write vector.meta.json
|
||||
|
||||
Execution:
|
||||
|
||||
```bash
|
||||
python vector_ingest.py --index path/to/index.ndjson --out path/to/vector.index
|
||||
```
|
||||
|
||||
Characteristics:
|
||||
|
||||
- No partial updates
|
||||
- No incremental mutation
|
||||
- Always full rebuild
|
||||
- Batch size = 64
|
||||
- normalize_embeddings=True
|
||||
|
||||
---
|
||||
|
||||
## 7.2 vector_search.py
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- Load vector.index
|
||||
- Load vector_meta.json
|
||||
- Encode query
|
||||
- Search top-K
|
||||
- Return JSON
|
||||
|
||||
Execution:
|
||||
|
||||
```bash
|
||||
python vector_search.py "query" 5
|
||||
```
|
||||
|
||||
Output:
|
||||
|
||||
```json
|
||||
[
|
||||
{ "chunk_id": "...", "score": 0.82 }
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7.3 VectorSearchClient (PHP)
|
||||
|
||||
- Executes Python search script
|
||||
- Parses JSON response
|
||||
- Returns structured results
|
||||
- Handles timeout + error states
|
||||
|
||||
---
|
||||
|
||||
# 8. Hybrid Retrieval
|
||||
|
||||
## 8.1 Components
|
||||
Ablauf:
|
||||
|
||||
| Class | Role |
|
||||
|--------|------|
|
||||
| NdjsonHybridRetriever | Orchestrator |
|
||||
| NdjsonKeywordSearch | Keyword scoring |
|
||||
| NdjsonChunkLookup | Chunk resolution |
|
||||
| VectorSearchClient | Vector bridge |
|
||||
| CachedRetriever | Cache layer |
|
||||
User Query
|
||||
→ Keyword Retrieval
|
||||
→ FAISS Vector Retrieval
|
||||
→ Score Fusion
|
||||
→ NDJSON Chunk Lookup
|
||||
→ Context Builder
|
||||
→ LLM
|
||||
→ SSE Streaming
|
||||
|
||||
Keyword ist Primärsignal.
|
||||
Vector ergänzt Semantik.
|
||||
|
||||
---
|
||||
|
||||
## 8.2 Retrieval Flow
|
||||
# 9. Locking & Concurrency
|
||||
|
||||
1. Extract terms (StopWords + normalization)
|
||||
2. Keyword scoring
|
||||
3. Vector search
|
||||
4. Score fusion
|
||||
5. Limit to N chunks
|
||||
6. Resolve chunk text
|
||||
7. Build LLM context
|
||||
LockService verhindert:
|
||||
|
||||
Keyword score remains primary signal.
|
||||
Vector score augments semantic similarity.
|
||||
- parallelen Ingest
|
||||
- gleichzeitige Reindex-Vorgänge
|
||||
- NDJSON-Korruption
|
||||
|
||||
Keine gleichzeitigen Mutationen erlaubt.
|
||||
|
||||
---
|
||||
|
||||
# 9. Document Extraction
|
||||
# 10. CLI Commands
|
||||
|
||||
Supported via:
|
||||
mto:agent:ingest:run <jobId>
|
||||
mto:agent:vector:ingest
|
||||
mto:agent:vector:search
|
||||
|
||||
- DocumentExtractorInterface
|
||||
- ExtractorResolver
|
||||
- PdfExtractor
|
||||
- DocumentLoader
|
||||
|
||||
Extraction must return clean UTF-8 text.
|
||||
Chunking must remain deterministic.
|
||||
|
||||
---
|
||||
|
||||
# 10. Admin Layer (Symfony)
|
||||
|
||||
## Controllers
|
||||
|
||||
- DashboardController
|
||||
- DocumentController
|
||||
- IngestJobController
|
||||
- SecurityController
|
||||
|
||||
## Entities
|
||||
|
||||
- Document
|
||||
- DocumentVersion
|
||||
- IngestJob
|
||||
- User
|
||||
|
||||
## Repositories
|
||||
|
||||
- DocumentVersionRepository
|
||||
- UserRepository
|
||||
|
||||
---
|
||||
|
||||
# 11. Concurrency & Locks
|
||||
|
||||
LockService ensures:
|
||||
|
||||
- No parallel reindex
|
||||
- No parallel ingest conflict
|
||||
- Controlled mutation of index.ndjson
|
||||
|
||||
File-based or service-based locking.
|
||||
|
||||
---
|
||||
|
||||
# 12. Determinism Rules
|
||||
|
||||
The system guarantees:
|
||||
|
||||
- Same documents + same config = identical index.ndjson
|
||||
- Same index.ndjson = identical FAISS
|
||||
- Same query + same index = identical results
|
||||
|
||||
No randomness.
|
||||
No adaptive mutation.
|
||||
No auto-learning.
|
||||
|
||||
---
|
||||
|
||||
# 13. LLM Integration
|
||||
|
||||
- Context strictly limited to retrieved chunks
|
||||
- PromptBuilder constructs deterministic system prompt
|
||||
- ContextService manages history
|
||||
- SSE streaming enabled
|
||||
- Model endpoint configurable
|
||||
|
||||
LLM never has direct access to full knowledge base.
|
||||
Only retrieved chunks are injected.
|
||||
|
||||
---
|
||||
|
||||
# 14. Scalability
|
||||
|
||||
Designed for:
|
||||
|
||||
- >200k chunks
|
||||
- Streaming NDJSON reads
|
||||
- Full FAISS rebuild
|
||||
- Cache layer for retrieval
|
||||
- Controlled memory usage
|
||||
|
||||
No full-array JSON loads.
|
||||
|
||||
---
|
||||
|
||||
# 15. Failure Modes
|
||||
|
||||
Handled via:
|
||||
|
||||
- Missing vector index detection
|
||||
- Structure drift detection
|
||||
- Lock collision detection
|
||||
- Embedding dependency checks
|
||||
- Python execution errors
|
||||
- Empty chunk fallback
|
||||
|
||||
---
|
||||
|
||||
# 16. Non-Goals
|
||||
|
||||
This system intentionally does NOT include:
|
||||
|
||||
- Online learning
|
||||
- Embedding mutation
|
||||
- Incremental FAISS update
|
||||
- Auto chunk merging
|
||||
- Self-modifying prompts
|
||||
|
||||
All structural changes require explicit reindex.
|
||||
|
||||
---
|
||||
|
||||
# 17. Design Philosophy
|
||||
|
||||
This is a governance-first RAG architecture:
|
||||
|
||||
- Deterministic
|
||||
- Reproducible
|
||||
- Drift-safe
|
||||
- Audit-friendly
|
||||
- Version-controlled
|
||||
|
||||
It prioritizes correctness and control over dynamic mutation.
|
||||
|
||||
---
|
||||
|
||||
# 18. Development Guidelines
|
||||
|
||||
When extending the system:
|
||||
|
||||
- Never mutate FAISS directly
|
||||
- Never edit index.ndjson manually
|
||||
- Always preserve determinism
|
||||
- Increment index_version only via Global Reindex
|
||||
- Guard all structural changes
|
||||
- Maintain streaming compatibility
|
||||
|
||||
---
|
||||
|
||||
# 19. CLI Commands (Symfony)
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
php bin/console mto:agent:vector:ingest
|
||||
```
|
||||
|
||||
Custom commands follow namespace:
|
||||
|
||||
```
|
||||
Alle Commands unter:
|
||||
mto:agent:*
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# 20. Summary
|
||||
# 11. Failure Modes
|
||||
|
||||
This system is a deterministic, enterprise-grade hybrid RAG engine with:
|
||||
- Vector index fehlt → vector ingest ausführen
|
||||
- index_meta mismatch → Global Reindex
|
||||
- exec deaktiviert → Async-Start schlägt fehl
|
||||
- Lock aktiv → Parallel-Ingest blockiert
|
||||
|
||||
- NDJSON-based streaming index
|
||||
- Full FAISS rebuild strategy
|
||||
- Structured ingest pipeline
|
||||
- Hybrid retrieval
|
||||
- Admin governance layer
|
||||
- Strict guardrails
|
||||
---
|
||||
|
||||
It is designed for controlled enterprise deployment, not experimental AI workflows.
|
||||
# 12. Non-Goals
|
||||
|
||||
- Kein Online-Learning
|
||||
- Keine inkrementellen FAISS Updates
|
||||
- Keine selbstverändernden Prompts
|
||||
- Kein Auto-Merging von Chunks
|
||||
|
||||
Strukturänderungen → explizit + reindex.
|
||||
|
||||
---
|
||||
|
||||
# 13. Zusammenfassung
|
||||
|
||||
Dieses System ist:
|
||||
|
||||
- deterministisch
|
||||
- reproduzierbar
|
||||
- drift-sicher
|
||||
- governance-stabil
|
||||
- enterprise-ready
|
||||
- job-basiert
|
||||
- versionssicher
|
||||
|
||||
Wichtige Neuerung:
|
||||
Neue Dokumente lösen jetzt automatisch einen IngestJob aus
|
||||
(exakt derselbe Mechanismus wie bei Version-Aktivierung).
|
||||
|
||||
Kein HTTP-Ingest mehr.
|
||||
Keine Inline-Rebuilds.
|
||||
Alles läuft über das Job-System.
|
||||
|
||||
Reference in New Issue
Block a user