harden code and add messenger services and ne README.md and SYSTEM.,d

2026-02-15 14:36:04 +01:00
parent 993531b268
commit 5b100039e0
8 changed files with 865 additions and 215 deletions
--- a/README.md
+++ b/README.md
@@ -1,250 +1,430 @@
-# mitho AI Agent (Alpha Version)
-**Hybrid RAG System auf Symfony-Basis mit Vektor- & Keyword-Retrieval**
+# mitho AI Agent – Developer Deep Dive
+
+Enterprise Hybrid RAG System (Symfony + NDJSON + FAISS)

 ---

-## Überblick
+# 1. System Overview

-Der **mitho AI Agent** ist ein produktionsreifes, Symfony-basiertes RAG-System (Retrieval Augmented Generation), das KI nicht frei „raten“ lässt, sondern Antworten strikt auf Basis eines kontrollierten Wissenspools erzeugt.
+This system implements a deterministic, governance-stable Retrieval Augmented Generation (RAG) architecture based on:

-> **Leitsatz:**  
-> *„Wir nutzen KI nicht, um kreativ zu raten, sondern um verlässlich auf Basis Ihres Wissens zu antworten.“*
+- Symfony (PHP backend)
+- NDJSON-based knowledge index
+- Full FAISS vector rebuild strategy
+- Hybrid retrieval (keyword + vector)
+- Deterministic ingest pipeline
+- Strict versioning & guardrails
+- Lock-based reindex protection

-Das System kombiniert:
-
- Large Language Model (LLM, z. B. Qwen via Ollama)
- Keyword-basiertes Retrieval
- FAISS-Vektor-Suche
- Versionierte Wissensstruktur (Chunks + Index)
- Streaming-Ausgabe via Server-Sent Events (SSE)
- Persistente Chat-Historie pro Client
+No incremental vector mutation is allowed.  
+FAISS is always rebuilt from `index.ndjson`.

 ---

-# Architektur
+# 2. High-Level Architecture

-## 1. Backend
+User Query
+→ Hybrid Retrieval
+→ Context Assembly
+→ Prompt Builder
+→ LLM
+→ Streaming Response (SSE)

-**Technologie**
-
- PHP 8.2+
- Symfony 7.4
- Monolog Logging
- Symfony Cache
- Session Support
-
-### Zentrale Komponenten
-
-| Komponente | Aufgabe |
-|------------|----------|
-| `AgentRunner` | Orchestriert Prompt, Kontext & LLM |
-| `PromptBuilder` | Baut System- & User-Prompt |
-| `ContextService` | Historienverwaltung |
-| `ChunkKeywordRetriever` | Keyword-Scoring |
-| `VectorSearchClient` | Python-FAISS-Anbindung |
-| `KnowledgeIngestService` | Dokument → Chunks |
-| `ChunkIndexWriter` | index.json Verwaltung |
-| `CachedRetriever` | Performance-Optimierung |
+Knowledge Flow:
+Document → Version → Extract → Chunk → NDJSON → FAISS → Retrieval

 ---

-## 2. Hybrid Retrieval (Produktionsarchitektur)
-
-Das System nutzt eine **hybride Sucharchitektur**:
-
-### A) Keyword-Retrieval (führend)
-
- Stopword-Filter
- Lemma-Logik
- Score-Berechnung
- deterministische Gewichtung
-
-### B) Vektor-Retrieval (ergänzend)
-
- SentenceTransformer: `all-MiniLM-L6-v2`
- FAISS Index (Inner Product)
- Normalisierte Embeddings
- Top-K Suche
-
-### Retrieval-Flow
-
-1. User Prompt
-2. Keyword-Scoring
-3. FAISS-Suche
-4. Score-Fusion
-5. Top-N Chunks
-6. Kontextaufbau
-7. LLM-Antwort
-
---
-
-## 3. Wissensarchitektur
+# 3. Directory Structure (Knowledge Layer)

 ```
 var/knowledge/
 ├── uploads/
 ├── chunks/
- ├── manifest.json
- └── index.json
+ ├── index.ndjson
+ ├── index_meta.json
+ ├── vector.index
+ └── vector_meta.json
 ```

-### Prinzipien
-
- Dokumente sind Primärquelle
- Chunks sind abgeleitete Artefakte
- `index.json` ist Single Source of Truth
- Re-Ingest ist deterministisch
- Keine manuelle Chunk-Manipulation
-
 ---

-## 4. Vektor-Ingest
+# 4. NDJSON Index

-CLI Command:
+## 4.1 index.ndjson
+
+- Single Source of Truth
+- One JSON object per line
+- Streaming-readable
+- No JSON array wrapper
+- Scales beyond 200k chunks
+
+Each line contains:
+
+```json
+{
+  "chunk_id": "uuid",
+  "document_id": "uuid",
+  "version": 3,
+  "text": "...",
+  "meta": { ... }
+}
+```
+
+NDJSON enables:
+- Append-based writes
+- Compaction per document
+- Memory-safe streaming
+- Deterministic rebuilds
+
+---
+
+# 5. Index Metadata
+
+## index_meta.json
+
+Managed by:
+
+- IndexMetaManager
+- IndexConfiguration
+
+Contains:
+
+- index_version
+- embedding_model
+- embedding_dimension
+- chunk_size
+- overlap
+- scoring_version
+- index_format
+
+If configuration changes → Global Reindex required.
+
+Guarded by:
+`IndexStructureChangedException`
+
+---
+
+# 6. Ingest Pipeline
+
+## 6.1 Core Services
+
+| Service | Responsibility |
+|----------|----------------|
+| DocumentService | Document lifecycle |
+| DocumentVersionRepository | Version persistence |
+| KnowledgeIngestService | Chunk generation |
+| SimpleChunker | Deterministic splitting |
+| TextNormalizer | Text cleanup |
+| StopWords | Keyword filtering |
+| ChunkManager | NDJSON append + compaction |
+| ChunkWriter | Chunk persistence |
+| IngestFlow | Step orchestration |
+| IngestOrchestrator | Full ingest coordination |
+| IngestJobService | Job tracking |
+| LockService | Concurrency guard |
+
+---
+
+## 6.2 Local Ingest
+
+Used when:
+- A single document version changes
+
+Process:
+
+1. Extract document
+2. Normalize text
+3. Chunk deterministically
+4. Remove previous chunks of document_id
+5. Append new chunks to index.ndjson
+6. Rebuild FAISS completely
+
+index_version does NOT change.
+
+---
+
+## 6.3 Global Reindex
+
+Used when:
+- Embedding model changes
+- Chunk size changes
+- Overlap changes
+- Scoring logic changes
+- index_format changes
+
+Process:
+
+1. Re-extract all active document versions
+2. Recreate full index.ndjson
+3. Rebuild FAISS
+4. index_version++
+
+---
+
+# 7. Vector Architecture
+
+## 7.1 vector_ingest.py
+
+Responsibilities:
+
+- Stream-read index.ndjson
+- Extract text + chunk_id
+- Build embeddings
+- Normalize embeddings
+- Build FAISS IndexFlatIP
+- Write vector.index
+- Write vector.meta.json
+
+Execution:
+
+```bash
+python vector_ingest.py --index path/to/index.ndjson --out path/to/vector.index
+```
+
+Characteristics:
+
+- No partial updates
+- No incremental mutation
+- Always full rebuild
+- Batch size = 64
+- normalize_embeddings=True
+
+---
+
+## 7.2 vector_search.py
+
+Responsibilities:
+
+- Load vector.index
+- Load vector_meta.json
+- Encode query
+- Search top-K
+- Return JSON
+
+Execution:
+
+```bash
+python vector_search.py "query" 5
+```
+
+Output:
+
+```json
+[
+  { "chunk_id": "...", "score": 0.82 }
+]
+```
+
+---
+
+## 7.3 VectorSearchClient (PHP)
+
+- Executes Python search script
+- Parses JSON response
+- Returns structured results
+- Handles timeout + error states
+
+---
+
+# 8. Hybrid Retrieval
+
+## 8.1 Components
+
+| Class | Role |
+|--------|------|
+| NdjsonHybridRetriever | Orchestrator |
+| NdjsonKeywordSearch | Keyword scoring |
+| NdjsonChunkLookup | Chunk resolution |
+| VectorSearchClient | Vector bridge |
+| CachedRetriever | Cache layer |
+
+---
+
+## 8.2 Retrieval Flow
+
+1. Extract terms (StopWords + normalization)
+2. Keyword scoring
+3. Vector search
+4. Score fusion
+5. Limit to N chunks
+6. Resolve chunk text
+7. Build LLM context
+
+Keyword score remains primary signal.
+Vector score augments semantic similarity.
+
+---
+
+# 9. Document Extraction
+
+Supported via:
+
+- DocumentExtractorInterface
+- ExtractorResolver
+- PdfExtractor
+- DocumentLoader
+
+Extraction must return clean UTF-8 text.
+Chunking must remain deterministic.
+
+---
+
+# 10. Admin Layer (Symfony)
+
+## Controllers
+
+- DashboardController
+- DocumentController
+- IngestJobController
+- SecurityController
+
+## Entities
+
+- Document
+- DocumentVersion
+- IngestJob
+- User
+
+## Repositories
+
+- DocumentVersionRepository
+- UserRepository
+
+---
+
+# 11. Concurrency & Locks
+
+LockService ensures:
+
+- No parallel reindex
+- No parallel ingest conflict
+- Controlled mutation of index.ndjson
+
+File-based or service-based locking.
+
+---
+
+# 12. Determinism Rules
+
+The system guarantees:
+
+- Same documents + same config = identical index.ndjson
+- Same index.ndjson = identical FAISS
+- Same query + same index = identical results
+
+No randomness.
+No adaptive mutation.
+No auto-learning.
+
+---
+
+# 13. LLM Integration
+
+- Context strictly limited to retrieved chunks
+- PromptBuilder constructs deterministic system prompt
+- ContextService manages history
+- SSE streaming enabled
+- Model endpoint configurable
+
+LLM never has direct access to full knowledge base.
+Only retrieved chunks are injected.
+
+---
+
+# 14. Scalability
+
+Designed for:
+
+- >200k chunks
+- Streaming NDJSON reads
+- Full FAISS rebuild
+- Cache layer for retrieval
+- Controlled memory usage
+
+No full-array JSON loads.
+
+---
+
+# 15. Failure Modes
+
+Handled via:
+
+- Missing vector index detection
+- Structure drift detection
+- Lock collision detection
+- Embedding dependency checks
+- Python execution errors
+- Empty chunk fallback
+
+---
+
+# 16. Non-Goals
+
+This system intentionally does NOT include:
+
+- Online learning
+- Embedding mutation
+- Incremental FAISS update
+- Auto chunk merging
+- Self-modifying prompts
+
+All structural changes require explicit reindex.
+
+---
+
+# 17. Design Philosophy
+
+This is a governance-first RAG architecture:
+
+- Deterministic
+- Reproducible
+- Drift-safe
+- Audit-friendly
+- Version-controlled
+
+It prioritizes correctness and control over dynamic mutation.
+
+---
+
+# 18. Development Guidelines
+
+When extending the system:
+
+- Never mutate FAISS directly
+- Never edit index.ndjson manually
+- Always preserve determinism
+- Increment index_version only via Global Reindex
+- Guard all structural changes
+- Maintain streaming compatibility
+
+---
+
+# 19. CLI Commands (Symfony)
+
+Example:

 ```bash
 php bin/console mto:agent:vector:ingest
 ```

-Ablauf:
-
-1. index.json lesen
-2. Chunk-Texte laden
-3. Embeddings erzeugen
-4. FAISS Index erstellen
-5. vector.index speichern
-6. vector_meta.json schreiben
-
---
-
-## 5. LLM-Anbindung
-
-Standardmäßig via Ollama.
-
-Konfiguration über ENV:
+Custom commands follow namespace:

 ```
-AI_LLM_API_URL=
-AI_LLM_MODEL=
-AI_LLM_TIMEOUT=
-AI_DEBUG=
-AI_LOG_PROMPT=
-AI_LOG_CONTEXT=
-AI_HISTORY_DIR=
+mto:agent:*
 ```

-Features:
-
- Streaming-fähig
- Konfigurierbarer Timeout
- Denkmodus unterdrückbar
- Historienintegration
-
 ---

-## 6. Frontend
+# 20. Summary

-Technologie:
+This system is a deterministic, enterprise-grade hybrid RAG engine with:

- Bootstrap
- Marked (Markdown)
- DOMPurify
- SSE Streaming
+- NDJSON-based streaming index
+- Full FAISS rebuild strategy
+- Structured ingest pipeline
+- Hybrid retrieval
+- Admin governance layer
+- Strict guardrails

-Features:
-
- Live-Streaming
- Markdown-Rendering
- Abbruch-Funktion
- Chat-Verlauf
- Client-ID per Cookie
- Verlaufslöschung
-
---
-
-## 7. Logging & Debug
-
-Log-Datei:
-
-```
-var/log/agent.log
-```
-
-Optional aktivierbar:
-
- Prompt Logging
- Kontext Logging
- Debug-Modus
-
---
-
-# Sicherheit & Governance
-
- Rollenmodell (Super Admin / Knowledge Admin / Redaktion)
- Versionierte Dokumente
- Versionierte Ingest-Profile
- Versionierte System-Prompts
- KI-Endpunkt abstrahiert
- Audit-Logs
- Lock-Mechanismen bei Reindex
-
---
-
-# Produktstatus
-
-Das System ist:
-
- Produktionsreif
- Framework-neutral
- Kundenfähig
- Skalierbar
- Erweiterbar (Adminbereich geplant)
-
-Nicht enthalten:
-
- Autonomes Fine-Tuning
- Live-Lernsystem
- Self-Modifying Knowledge
-
---
-
-# Unterschied zu generischen KI-Tools
-
-| Generische KI | mitho AI Agent |
-|---------------|----------------|
-| trainiert auf Internet | basiert auf Ihrem Wissen |
-| keine Governance | volle Kontrolle |
-| keine Versionierung | Dokument-Versionierung |
-| nicht nachvollziehbar | transparente Wissensbasis |
-| generisch | unternehmensspezifisch |
-
---
-
-# Mindestanforderungen
-
- PHP 8.2+
- Python 3.9+
- faiss
- sentence-transformers
- Ollama (oder kompatibles LLM)
-
---
-
-# Vision
-
-Dieses System bildet die Grundlage für:
-
- Agentic Commerce
- Interne Wissenssysteme
- Support-Automatisierung
- Vertriebsassistenz
- Technische Dokumentations-KI
- DSGVO-konforme Unternehmens-KI
-
---
-
-# Fazit
-
-Der mitho AI Agent ist kein Spielzeug-Chatbot.
-
-Er ist ein strukturiertes, kontrolliertes KI-System mit klarer Wissensbasis, deterministischem Retrieval und professioneller Architektur – gebaut für produktiven Unternehmenseinsatz.
+It is designed for controlled enterprise deployment, not experimental AI workflows.