add new md files

2026-02-16 09:38:40 +01:00
parent c0b90c9dfd
commit 413c76d710
3 changed files with 608 additions and 690 deletions
--- a/README.md
+++ b/README.md
@@ -1,430 +1,314 @@
-# mitho AI Agent – Developer Deep Dive
-
+# mitho AI Agent – Developer README
 Enterprise Hybrid RAG System (Symfony + NDJSON + FAISS)

---
-
-# 1. System Overview
-
-This system implements a deterministic, governance-stable Retrieval Augmented Generation (RAG) architecture based on:
-
- Symfony (PHP backend)
- NDJSON-based knowledge index
- Full FAISS vector rebuild strategy
- Hybrid retrieval (keyword + vector)
- Deterministic ingest pipeline
- Strict versioning & guardrails
- Lock-based reindex protection
-
-No incremental vector mutation is allowed.  
-FAISS is always rebuilt from `index.ndjson`.
+Stand: Februar 2026  
+Status: Produktiv stabil – Job-basierte Ingest-Architektur vollständig integriert

 ---

-# 2. High-Level Architecture
+# 1. Systemüberblick

-User Query
-→ Hybrid Retrieval
-→ Context Assembly
-→ Prompt Builder
-→ LLM
-→ Streaming Response (SSE)
+Dieses System implementiert eine deterministische, governance-stabile
+Hybrid-RAG-Architektur mit:

-Knowledge Flow:
-Document → Version → Extract → Chunk → NDJSON → FAISS → Retrieval
+- Symfony (PHP Backend)
+- NDJSON als Single Source of Truth
+- FAISS als Vektorindex (immer Full Rebuild)
+- Hybrid Retrieval (Keyword + Vektor)
+- Versioniertes Dokumentmodell
+- Job-basierte Ingest-Pipeline
+- Lock-geschützte Reindex-Operationen
+- SSE-Streaming im Frontend
+
+Grundprinzip:
+Keine inkrementellen Vektor-Updates.
+FAISS wird immer vollständig aus index.ndjson neu gebaut.

 ---

-# 3. Directory Structure (Knowledge Layer)
+# 2. Architekturprinzipien

-```
-var/knowledge/
- ├── uploads/
- ├── chunks/
- ├── index.ndjson
- ├── index_meta.json
- ├── vector.index
- └── vector_meta.json
-```
+Determinismus:
+- Gleiche Dokumente + gleiche Konfiguration → identisches index.ndjson
+- Gleiches index.ndjson → identisches FAISS
+- Gleiche Query → identisches Retrieval-Ergebnis
+
+Governance:
+- Eine aktive Version pro Dokument
+- Keine impliziten Index-Änderungen
+- Strukturänderungen erzwingen Global Reindex
+- Keine Selbstmodifikation durch KI
+
+Skalierbarkeit:
+- NDJSON (streamingfähig)
+- Keine RAM-basierte JSON-Arrays
+- Zielgröße > 200k Chunks

 ---

-# 4. NDJSON Index
+# 3. Wissensspeicher

-## 4.1 index.ndjson
+## 3.1 index.ndjson

- Single Source of Truth
- One JSON object per line
- Streaming-readable
- No JSON array wrapper
- Scales beyond 200k chunks
+Single Source of Truth.

-Each line contains:
+- 1 JSON-Objekt pro Zeile
+- Streaming-Append
+- Deterministische Compaction by document_id
+
+Beispielstruktur:

-```json
 {
-  "chunk_id": "uuid",
-  "document_id": "uuid",
-  "version": 3,
-  "text": "...",
-  "meta": { ... }
+"chunk_id": "uuid",
+"document_id": "uuid",
+"document_version_id": "uuid",
+"text": "...",
+"meta": {...}
 }
-```

-NDJSON enables:
- Append-based writes
- Compaction per document
- Memory-safe streaming
- Deterministic rebuilds
+Keine JSON-Array-Datei.
+Keine Mutation einzelner Chunks.
+Nur Append + deterministische Entfernung per document_id.

 ---

-# 5. Index Metadata
+## 3.2 index_meta.json

-## index_meta.json
-
-Managed by:
-
- IndexMetaManager
- IndexConfiguration
-
-Contains:
+Enthält Strukturparameter:

 - index_version
 - embedding_model
 - embedding_dimension
 - chunk_size
- overlap
+- chunk_overlap
 - scoring_version
 - index_format
+- vector_backend

-If configuration changes → Global Reindex required.
-
-Guarded by:
-`IndexStructureChangedException`
+Wenn einer dieser Werte sich ändert:
+→ Global Reindex zwingend erforderlich.

 ---

-# 6. Ingest Pipeline
+## 3.3 FAISS

-## 6.1 Core Services
+Dateien:

-| Service | Responsibility |
-|----------|----------------|
-| DocumentService | Document lifecycle |
-| DocumentVersionRepository | Version persistence |
-| KnowledgeIngestService | Chunk generation |
-| SimpleChunker | Deterministic splitting |
-| TextNormalizer | Text cleanup |
-| StopWords | Keyword filtering |
-| ChunkManager | NDJSON append + compaction |
-| ChunkWriter | Chunk persistence |
-| IngestFlow | Step orchestration |
-| IngestOrchestrator | Full ingest coordination |
-| IngestJobService | Job tracking |
-| LockService | Concurrency guard |
+- vector.index
+- vector_meta.json (Chunk-ID Mapping)
+
+FAISS wird IMMER vollständig aus index.ndjson gebaut.
+Keine Partial Updates.

 ---

-## 6.2 Local Ingest
+# 4. Dokument- & Versionsmodell

-Used when:
- A single document version changes
+Document
+→ enthält mehrere DocumentVersion
+→ genau eine Version ist aktiv

-Process:
+Regel:
+Es darf immer nur eine aktive Version pro Dokument existieren.

-1. Extract document
-2. Normalize text
-3. Chunk deterministically
-4. Remove previous chunks of document_id
-5. Append new chunks to index.ndjson
-6. Rebuild FAISS completely
-
-index_version does NOT change.
+Beim Aktivieren einer Version:
+- Alle anderen Versionen werden inaktiv
+- IngestStatus → PENDING
+- Re-Ingest via Job

 ---

-## 6.3 Global Reindex
+# 5. Ingest-Architektur (vollständig Job-basiert)

-Used when:
- Embedding model changes
- Chunk size changes
- Overlap changes
- Scoring logic changes
- index_format changes
+Ingest läuft NIEMALS synchron im HTTP-Request.

-Process:
+Jede Mutation am Index läuft über:

-1. Re-extract all active document versions
-2. Recreate full index.ndjson
-3. Rebuild FAISS
+IngestJob → CLI Runner → IngestOrchestrator → IngestFlow
+
+---
+
+## 5.1 Job-Typen
+
+DOCUMENT_VERSION_ACTIVATE
+- Wird genutzt für:
+    - Version aktivieren
+    - Neue Datei hochladen (Auto-Ingest)
+
+DOCUMENT
+- Manuelles Ingest einer Version
+
+GLOBAL_REINDEX
+- Strukturänderungen
+
+---
+
+## 5.2 Job-Status
+
+- QUEUED
+- RUNNING
+- COMPLETED
+- FAILED
+- ABORTED
+
+Jobs werden über CLI ausgeführt:
+
+php bin/console mto:agent:ingest:run <jobId>
+
+Start erfolgt asynchron per exec() aus dem Controller.
+
+---
+
+# 6. Admin-Flows (aktueller Stand)
+
+## 6.1 Neue Datei hochladen (NEU: Auto-Ingest)
+
+Beim Upload:
+
+1. Datei speichern
+2. Document + Version 1 erzeugen
+3. Version 1 aktiv setzen
+4. IngestJob vom Typ DOCUMENT_VERSION_ACTIVATE anlegen
+5. Job asynchron starten
+6. Redirect auf Job-Detailseite
+
+Ergebnis:
+Neue Dokumente werden automatisch indexiert.
+
+---
+
+## 6.2 Version aktivieren
+
+1. DB-Status anpassen
+2. IngestStatus → PENDING
+3. DOCUMENT_VERSION_ACTIVATE Job erzeugen
+4. Async Runner starten
+5. Redirect zur Job-Seite
+
+---
+
+## 6.3 Manuelles Ingest
+
+1. DOCUMENT Job erzeugen
+2. Async Runner starten
+3. Redirect zur Job-Seite
+
+---
+
+## 6.4 Reset
+
+Reset löscht:
+
+- index.ndjson
+- vector.index
+- vector_meta.json
+- Upload-Verzeichnis
+- Tabellen:
+    - document
+    - document_version
+    - ingest_job
+
+Nur möglich, wenn exec() aktiv ist.
+
+---
+
+# 7. Ingest-Flow Details
+
+Local Ingest (ein Dokument):
+
+1. Extract
+2. Normalize
+3. Chunk deterministisch
+4. Entferne alte Chunks per document_id
+5. Append neue Chunks
+6. Full FAISS Rebuild
+
+Global Reindex:
+
+1. Alle aktiven Versionen neu verarbeiten
+2. Komplettes index.ndjson neu schreiben
+3. FAISS neu bauen
 4. index_version++

 ---

-# 7. Vector Architecture
-
-## 7.1 vector_ingest.py
-
-Responsibilities:
-
- Stream-read index.ndjson
- Extract text + chunk_id
- Build embeddings
- Normalize embeddings
- Build FAISS IndexFlatIP
- Write vector.index
- Write vector.meta.json
-
-Execution:
-
-```bash
-python vector_ingest.py --index path/to/index.ndjson --out path/to/vector.index
-```
-
-Characteristics:
-
- No partial updates
- No incremental mutation
- Always full rebuild
- Batch size = 64
- normalize_embeddings=True
-
---
-
-## 7.2 vector_search.py
-
-Responsibilities:
-
- Load vector.index
- Load vector_meta.json
- Encode query
- Search top-K
- Return JSON
-
-Execution:
-
-```bash
-python vector_search.py "query" 5
-```
-
-Output:
-
-```json
-[
-  { "chunk_id": "...", "score": 0.82 }
-]
-```
-
---
-
-## 7.3 VectorSearchClient (PHP)
-
- Executes Python search script
- Parses JSON response
- Returns structured results
- Handles timeout + error states
-
---
-
 # 8. Hybrid Retrieval

-## 8.1 Components
+Ablauf:

-| Class | Role |
-|--------|------|
-| NdjsonHybridRetriever | Orchestrator |
-| NdjsonKeywordSearch | Keyword scoring |
-| NdjsonChunkLookup | Chunk resolution |
-| VectorSearchClient | Vector bridge |
-| CachedRetriever | Cache layer |
+User Query
+→ Keyword Retrieval
+→ FAISS Vector Retrieval
+→ Score Fusion
+→ NDJSON Chunk Lookup
+→ Context Builder
+→ LLM
+→ SSE Streaming
+
+Keyword ist Primärsignal.
+Vector ergänzt Semantik.

 ---

-## 8.2 Retrieval Flow
+# 9. Locking & Concurrency

-1. Extract terms (StopWords + normalization)
-2. Keyword scoring
-3. Vector search
-4. Score fusion
-5. Limit to N chunks
-6. Resolve chunk text
-7. Build LLM context
+LockService verhindert:

-Keyword score remains primary signal.
-Vector score augments semantic similarity.
+- parallelen Ingest
+- gleichzeitige Reindex-Vorgänge
+- NDJSON-Korruption
+
+Keine gleichzeitigen Mutationen erlaubt.

 ---

-# 9. Document Extraction
+# 10. CLI Commands

-Supported via:
+mto:agent:ingest:run <jobId>
+mto:agent:vector:ingest
+mto:agent:vector:search

- DocumentExtractorInterface
- ExtractorResolver
- PdfExtractor
- DocumentLoader
-
-Extraction must return clean UTF-8 text.
-Chunking must remain deterministic.
-
---
-
-# 10. Admin Layer (Symfony)
-
-## Controllers
-
- DashboardController
- DocumentController
- IngestJobController
- SecurityController
-
-## Entities
-
- Document
- DocumentVersion
- IngestJob
- User
-
-## Repositories
-
- DocumentVersionRepository
- UserRepository
-
---
-
-# 11. Concurrency & Locks
-
-LockService ensures:
-
- No parallel reindex
- No parallel ingest conflict
- Controlled mutation of index.ndjson
-
-File-based or service-based locking.
-
---
-
-# 12. Determinism Rules
-
-The system guarantees:
-
- Same documents + same config = identical index.ndjson
- Same index.ndjson = identical FAISS
- Same query + same index = identical results
-
-No randomness.
-No adaptive mutation.
-No auto-learning.
-
---
-
-# 13. LLM Integration
-
- Context strictly limited to retrieved chunks
- PromptBuilder constructs deterministic system prompt
- ContextService manages history
- SSE streaming enabled
- Model endpoint configurable
-
-LLM never has direct access to full knowledge base.
-Only retrieved chunks are injected.
-
---
-
-# 14. Scalability
-
-Designed for:
-
- >200k chunks
- Streaming NDJSON reads
- Full FAISS rebuild
- Cache layer for retrieval
- Controlled memory usage
-
-No full-array JSON loads.
-
---
-
-# 15. Failure Modes
-
-Handled via:
-
- Missing vector index detection
- Structure drift detection
- Lock collision detection
- Embedding dependency checks
- Python execution errors
- Empty chunk fallback
-
---
-
-# 16. Non-Goals
-
-This system intentionally does NOT include:
-
- Online learning
- Embedding mutation
- Incremental FAISS update
- Auto chunk merging
- Self-modifying prompts
-
-All structural changes require explicit reindex.
-
---
-
-# 17. Design Philosophy
-
-This is a governance-first RAG architecture:
-
- Deterministic
- Reproducible
- Drift-safe
- Audit-friendly
- Version-controlled
-
-It prioritizes correctness and control over dynamic mutation.
-
---
-
-# 18. Development Guidelines
-
-When extending the system:
-
- Never mutate FAISS directly
- Never edit index.ndjson manually
- Always preserve determinism
- Increment index_version only via Global Reindex
- Guard all structural changes
- Maintain streaming compatibility
-
---
-
-# 19. CLI Commands (Symfony)
-
-Example:
-
-```bash
-php bin/console mto:agent:vector:ingest
-```
-
-Custom commands follow namespace:
-
-```
+Alle Commands unter:
 mto:agent:*
-```

 ---

-# 20. Summary
+# 11. Failure Modes

-This system is a deterministic, enterprise-grade hybrid RAG engine with:
+- Vector index fehlt → vector ingest ausführen
+- index_meta mismatch → Global Reindex
+- exec deaktiviert → Async-Start schlägt fehl
+- Lock aktiv → Parallel-Ingest blockiert

- NDJSON-based streaming index
- Full FAISS rebuild strategy
- Structured ingest pipeline
- Hybrid retrieval
- Admin governance layer
- Strict guardrails
+---

-It is designed for controlled enterprise deployment, not experimental AI workflows.
+# 12. Non-Goals
+
+- Kein Online-Learning
+- Keine inkrementellen FAISS Updates
+- Keine selbstverändernden Prompts
+- Kein Auto-Merging von Chunks
+
+Strukturänderungen → explizit + reindex.
+
+---
+
+# 13. Zusammenfassung
+
+Dieses System ist:
+
+- deterministisch
+- reproduzierbar
+- drift-sicher
+- governance-stabil
+- enterprise-ready
+- job-basiert
+- versionssicher
+
+Wichtige Neuerung:
+Neue Dokumente lösen jetzt automatisch einen IngestJob aus
+(exakt derselbe Mechanismus wie bei Version-Aktivierung).
+
+Kein HTTP-Ingest mehr.
+Keine Inline-Rebuilds.
+Alles läuft über das Job-System.