add new md files

This commit is contained in:
team 1
2026-02-16 09:38:40 +01:00
parent c0b90c9dfd
commit 413c76d710
3 changed files with 608 additions and 690 deletions

596
README.md
View File

@@ -1,430 +1,314 @@
# mitho AI Agent Developer Deep Dive
# mitho AI Agent Developer README
Enterprise Hybrid RAG System (Symfony + NDJSON + FAISS)
---
# 1. System Overview
This system implements a deterministic, governance-stable Retrieval Augmented Generation (RAG) architecture based on:
- Symfony (PHP backend)
- NDJSON-based knowledge index
- Full FAISS vector rebuild strategy
- Hybrid retrieval (keyword + vector)
- Deterministic ingest pipeline
- Strict versioning & guardrails
- Lock-based reindex protection
No incremental vector mutation is allowed.
FAISS is always rebuilt from `index.ndjson`.
Stand: Februar 2026
Status: Produktiv stabil Job-basierte Ingest-Architektur vollständig integriert
---
# 2. High-Level Architecture
# 1. Systemüberblick
User Query
Hybrid Retrieval
→ Context Assembly
→ Prompt Builder
→ LLM
→ Streaming Response (SSE)
Dieses System implementiert eine deterministische, governance-stabile
Hybrid-RAG-Architektur mit:
Knowledge Flow:
Document → Version → Extract → Chunk → NDJSON → FAISS → Retrieval
- Symfony (PHP Backend)
- NDJSON als Single Source of Truth
- FAISS als Vektorindex (immer Full Rebuild)
- Hybrid Retrieval (Keyword + Vektor)
- Versioniertes Dokumentmodell
- Job-basierte Ingest-Pipeline
- Lock-geschützte Reindex-Operationen
- SSE-Streaming im Frontend
Grundprinzip:
Keine inkrementellen Vektor-Updates.
FAISS wird immer vollständig aus index.ndjson neu gebaut.
---
# 3. Directory Structure (Knowledge Layer)
# 2. Architekturprinzipien
```
var/knowledge/
├── uploads/
├── chunks/
├── index.ndjson
├── index_meta.json
├── vector.index
└── vector_meta.json
```
Determinismus:
- Gleiche Dokumente + gleiche Konfiguration → identisches index.ndjson
- Gleiches index.ndjson → identisches FAISS
- Gleiche Query → identisches Retrieval-Ergebnis
Governance:
- Eine aktive Version pro Dokument
- Keine impliziten Index-Änderungen
- Strukturänderungen erzwingen Global Reindex
- Keine Selbstmodifikation durch KI
Skalierbarkeit:
- NDJSON (streamingfähig)
- Keine RAM-basierte JSON-Arrays
- Zielgröße > 200k Chunks
---
# 4. NDJSON Index
# 3. Wissensspeicher
## 4.1 index.ndjson
## 3.1 index.ndjson
- Single Source of Truth
- One JSON object per line
- Streaming-readable
- No JSON array wrapper
- Scales beyond 200k chunks
Single Source of Truth.
Each line contains:
- 1 JSON-Objekt pro Zeile
- Streaming-Append
- Deterministische Compaction by document_id
Beispielstruktur:
```json
{
"chunk_id": "uuid",
"document_id": "uuid",
"version": 3,
"text": "...",
"meta": { ... }
"chunk_id": "uuid",
"document_id": "uuid",
"document_version_id": "uuid",
"text": "...",
"meta": {...}
}
```
NDJSON enables:
- Append-based writes
- Compaction per document
- Memory-safe streaming
- Deterministic rebuilds
Keine JSON-Array-Datei.
Keine Mutation einzelner Chunks.
Nur Append + deterministische Entfernung per document_id.
---
# 5. Index Metadata
## 3.2 index_meta.json
## index_meta.json
Managed by:
- IndexMetaManager
- IndexConfiguration
Contains:
Enthält Strukturparameter:
- index_version
- embedding_model
- embedding_dimension
- chunk_size
- overlap
- chunk_overlap
- scoring_version
- index_format
- vector_backend
If configuration changes → Global Reindex required.
Guarded by:
`IndexStructureChangedException`
Wenn einer dieser Werte sich ändert:
→ Global Reindex zwingend erforderlich.
---
# 6. Ingest Pipeline
## 3.3 FAISS
## 6.1 Core Services
Dateien:
| Service | Responsibility |
|----------|----------------|
| DocumentService | Document lifecycle |
| DocumentVersionRepository | Version persistence |
| KnowledgeIngestService | Chunk generation |
| SimpleChunker | Deterministic splitting |
| TextNormalizer | Text cleanup |
| StopWords | Keyword filtering |
| ChunkManager | NDJSON append + compaction |
| ChunkWriter | Chunk persistence |
| IngestFlow | Step orchestration |
| IngestOrchestrator | Full ingest coordination |
| IngestJobService | Job tracking |
| LockService | Concurrency guard |
- vector.index
- vector_meta.json (Chunk-ID Mapping)
FAISS wird IMMER vollständig aus index.ndjson gebaut.
Keine Partial Updates.
---
## 6.2 Local Ingest
# 4. Dokument- & Versionsmodell
Used when:
- A single document version changes
Document
→ enthält mehrere DocumentVersion
→ genau eine Version ist aktiv
Process:
Regel:
Es darf immer nur eine aktive Version pro Dokument existieren.
1. Extract document
2. Normalize text
3. Chunk deterministically
4. Remove previous chunks of document_id
5. Append new chunks to index.ndjson
6. Rebuild FAISS completely
index_version does NOT change.
Beim Aktivieren einer Version:
- Alle anderen Versionen werden inaktiv
- IngestStatus → PENDING
- Re-Ingest via Job
---
## 6.3 Global Reindex
# 5. Ingest-Architektur (vollständig Job-basiert)
Used when:
- Embedding model changes
- Chunk size changes
- Overlap changes
- Scoring logic changes
- index_format changes
Ingest läuft NIEMALS synchron im HTTP-Request.
Process:
Jede Mutation am Index läuft über:
1. Re-extract all active document versions
2. Recreate full index.ndjson
3. Rebuild FAISS
IngestJob → CLI Runner → IngestOrchestrator → IngestFlow
---
## 5.1 Job-Typen
DOCUMENT_VERSION_ACTIVATE
- Wird genutzt für:
- Version aktivieren
- Neue Datei hochladen (Auto-Ingest)
DOCUMENT
- Manuelles Ingest einer Version
GLOBAL_REINDEX
- Strukturänderungen
---
## 5.2 Job-Status
- QUEUED
- RUNNING
- COMPLETED
- FAILED
- ABORTED
Jobs werden über CLI ausgeführt:
php bin/console mto:agent:ingest:run <jobId>
Start erfolgt asynchron per exec() aus dem Controller.
---
# 6. Admin-Flows (aktueller Stand)
## 6.1 Neue Datei hochladen (NEU: Auto-Ingest)
Beim Upload:
1. Datei speichern
2. Document + Version 1 erzeugen
3. Version 1 aktiv setzen
4. IngestJob vom Typ DOCUMENT_VERSION_ACTIVATE anlegen
5. Job asynchron starten
6. Redirect auf Job-Detailseite
Ergebnis:
Neue Dokumente werden automatisch indexiert.
---
## 6.2 Version aktivieren
1. DB-Status anpassen
2. IngestStatus → PENDING
3. DOCUMENT_VERSION_ACTIVATE Job erzeugen
4. Async Runner starten
5. Redirect zur Job-Seite
---
## 6.3 Manuelles Ingest
1. DOCUMENT Job erzeugen
2. Async Runner starten
3. Redirect zur Job-Seite
---
## 6.4 Reset
Reset löscht:
- index.ndjson
- vector.index
- vector_meta.json
- Upload-Verzeichnis
- Tabellen:
- document
- document_version
- ingest_job
Nur möglich, wenn exec() aktiv ist.
---
# 7. Ingest-Flow Details
Local Ingest (ein Dokument):
1. Extract
2. Normalize
3. Chunk deterministisch
4. Entferne alte Chunks per document_id
5. Append neue Chunks
6. Full FAISS Rebuild
Global Reindex:
1. Alle aktiven Versionen neu verarbeiten
2. Komplettes index.ndjson neu schreiben
3. FAISS neu bauen
4. index_version++
---
# 7. Vector Architecture
## 7.1 vector_ingest.py
Responsibilities:
- Stream-read index.ndjson
- Extract text + chunk_id
- Build embeddings
- Normalize embeddings
- Build FAISS IndexFlatIP
- Write vector.index
- Write vector.meta.json
Execution:
```bash
python vector_ingest.py --index path/to/index.ndjson --out path/to/vector.index
```
Characteristics:
- No partial updates
- No incremental mutation
- Always full rebuild
- Batch size = 64
- normalize_embeddings=True
---
## 7.2 vector_search.py
Responsibilities:
- Load vector.index
- Load vector_meta.json
- Encode query
- Search top-K
- Return JSON
Execution:
```bash
python vector_search.py "query" 5
```
Output:
```json
[
{ "chunk_id": "...", "score": 0.82 }
]
```
---
## 7.3 VectorSearchClient (PHP)
- Executes Python search script
- Parses JSON response
- Returns structured results
- Handles timeout + error states
---
# 8. Hybrid Retrieval
## 8.1 Components
Ablauf:
| Class | Role |
|--------|------|
| NdjsonHybridRetriever | Orchestrator |
| NdjsonKeywordSearch | Keyword scoring |
| NdjsonChunkLookup | Chunk resolution |
| VectorSearchClient | Vector bridge |
| CachedRetriever | Cache layer |
User Query
→ Keyword Retrieval
→ FAISS Vector Retrieval
→ Score Fusion
NDJSON Chunk Lookup
→ Context Builder
→ LLM
→ SSE Streaming
Keyword ist Primärsignal.
Vector ergänzt Semantik.
---
## 8.2 Retrieval Flow
# 9. Locking & Concurrency
1. Extract terms (StopWords + normalization)
2. Keyword scoring
3. Vector search
4. Score fusion
5. Limit to N chunks
6. Resolve chunk text
7. Build LLM context
LockService verhindert:
Keyword score remains primary signal.
Vector score augments semantic similarity.
- parallelen Ingest
- gleichzeitige Reindex-Vorgänge
- NDJSON-Korruption
Keine gleichzeitigen Mutationen erlaubt.
---
# 9. Document Extraction
# 10. CLI Commands
Supported via:
mto:agent:ingest:run <jobId>
mto:agent:vector:ingest
mto:agent:vector:search
- DocumentExtractorInterface
- ExtractorResolver
- PdfExtractor
- DocumentLoader
Extraction must return clean UTF-8 text.
Chunking must remain deterministic.
---
# 10. Admin Layer (Symfony)
## Controllers
- DashboardController
- DocumentController
- IngestJobController
- SecurityController
## Entities
- Document
- DocumentVersion
- IngestJob
- User
## Repositories
- DocumentVersionRepository
- UserRepository
---
# 11. Concurrency & Locks
LockService ensures:
- No parallel reindex
- No parallel ingest conflict
- Controlled mutation of index.ndjson
File-based or service-based locking.
---
# 12. Determinism Rules
The system guarantees:
- Same documents + same config = identical index.ndjson
- Same index.ndjson = identical FAISS
- Same query + same index = identical results
No randomness.
No adaptive mutation.
No auto-learning.
---
# 13. LLM Integration
- Context strictly limited to retrieved chunks
- PromptBuilder constructs deterministic system prompt
- ContextService manages history
- SSE streaming enabled
- Model endpoint configurable
LLM never has direct access to full knowledge base.
Only retrieved chunks are injected.
---
# 14. Scalability
Designed for:
- >200k chunks
- Streaming NDJSON reads
- Full FAISS rebuild
- Cache layer for retrieval
- Controlled memory usage
No full-array JSON loads.
---
# 15. Failure Modes
Handled via:
- Missing vector index detection
- Structure drift detection
- Lock collision detection
- Embedding dependency checks
- Python execution errors
- Empty chunk fallback
---
# 16. Non-Goals
This system intentionally does NOT include:
- Online learning
- Embedding mutation
- Incremental FAISS update
- Auto chunk merging
- Self-modifying prompts
All structural changes require explicit reindex.
---
# 17. Design Philosophy
This is a governance-first RAG architecture:
- Deterministic
- Reproducible
- Drift-safe
- Audit-friendly
- Version-controlled
It prioritizes correctness and control over dynamic mutation.
---
# 18. Development Guidelines
When extending the system:
- Never mutate FAISS directly
- Never edit index.ndjson manually
- Always preserve determinism
- Increment index_version only via Global Reindex
- Guard all structural changes
- Maintain streaming compatibility
---
# 19. CLI Commands (Symfony)
Example:
```bash
php bin/console mto:agent:vector:ingest
```
Custom commands follow namespace:
```
Alle Commands unter:
mto:agent:*
```
---
# 20. Summary
# 11. Failure Modes
This system is a deterministic, enterprise-grade hybrid RAG engine with:
- Vector index fehlt → vector ingest ausführen
- index_meta mismatch → Global Reindex
- exec deaktiviert → Async-Start schlägt fehl
- Lock aktiv → Parallel-Ingest blockiert
- NDJSON-based streaming index
- Full FAISS rebuild strategy
- Structured ingest pipeline
- Hybrid retrieval
- Admin governance layer
- Strict guardrails
---
It is designed for controlled enterprise deployment, not experimental AI workflows.
# 12. Non-Goals
- Kein Online-Learning
- Keine inkrementellen FAISS Updates
- Keine selbstverändernden Prompts
- Kein Auto-Merging von Chunks
Strukturänderungen → explizit + reindex.
---
# 13. Zusammenfassung
Dieses System ist:
- deterministisch
- reproduzierbar
- drift-sicher
- governance-stabil
- enterprise-ready
- job-basiert
- versionssicher
Wichtige Neuerung:
Neue Dokumente lösen jetzt automatisch einen IngestJob aus
(exakt derselbe Mechanismus wie bei Version-Aktivierung).
Kein HTTP-Ingest mehr.
Keine Inline-Rebuilds.
Alles läuft über das Job-System.