# mitho AI Agent – Developer Deep Dive Enterprise Hybrid RAG System (Symfony + NDJSON + FAISS) --- # 1. System Overview This system implements a deterministic, governance-stable Retrieval Augmented Generation (RAG) architecture based on: - Symfony (PHP backend) - NDJSON-based knowledge index - Full FAISS vector rebuild strategy - Hybrid retrieval (keyword + vector) - Deterministic ingest pipeline - Strict versioning & guardrails - Lock-based reindex protection No incremental vector mutation is allowed. FAISS is always rebuilt from `index.ndjson`. --- # 2. High-Level Architecture User Query → Hybrid Retrieval → Context Assembly → Prompt Builder → LLM → Streaming Response (SSE) Knowledge Flow: Document → Version → Extract → Chunk → NDJSON → FAISS → Retrieval --- # 3. Directory Structure (Knowledge Layer) ``` var/knowledge/ ├── uploads/ ├── chunks/ ├── index.ndjson ├── index_meta.json ├── vector.index └── vector_meta.json ``` --- # 4. NDJSON Index ## 4.1 index.ndjson - Single Source of Truth - One JSON object per line - Streaming-readable - No JSON array wrapper - Scales beyond 200k chunks Each line contains: ```json { "chunk_id": "uuid", "document_id": "uuid", "version": 3, "text": "...", "meta": { ... } } ``` NDJSON enables: - Append-based writes - Compaction per document - Memory-safe streaming - Deterministic rebuilds --- # 5. Index Metadata ## index_meta.json Managed by: - IndexMetaManager - IndexConfiguration Contains: - index_version - embedding_model - embedding_dimension - chunk_size - overlap - scoring_version - index_format If configuration changes → Global Reindex required. Guarded by: `IndexStructureChangedException` --- # 6. Ingest Pipeline ## 6.1 Core Services | Service | Responsibility | |----------|----------------| | DocumentService | Document lifecycle | | DocumentVersionRepository | Version persistence | | KnowledgeIngestService | Chunk generation | | SimpleChunker | Deterministic splitting | | TextNormalizer | Text cleanup | | StopWords | Keyword filtering | | ChunkManager | NDJSON append + compaction | | ChunkWriter | Chunk persistence | | IngestFlow | Step orchestration | | IngestOrchestrator | Full ingest coordination | | IngestJobService | Job tracking | | LockService | Concurrency guard | --- ## 6.2 Local Ingest Used when: - A single document version changes Process: 1. Extract document 2. Normalize text 3. Chunk deterministically 4. Remove previous chunks of document_id 5. Append new chunks to index.ndjson 6. Rebuild FAISS completely index_version does NOT change. --- ## 6.3 Global Reindex Used when: - Embedding model changes - Chunk size changes - Overlap changes - Scoring logic changes - index_format changes Process: 1. Re-extract all active document versions 2. Recreate full index.ndjson 3. Rebuild FAISS 4. index_version++ --- # 7. Vector Architecture ## 7.1 vector_ingest.py Responsibilities: - Stream-read index.ndjson - Extract text + chunk_id - Build embeddings - Normalize embeddings - Build FAISS IndexFlatIP - Write vector.index - Write vector.meta.json Execution: ```bash python vector_ingest.py --index path/to/index.ndjson --out path/to/vector.index ``` Characteristics: - No partial updates - No incremental mutation - Always full rebuild - Batch size = 64 - normalize_embeddings=True --- ## 7.2 vector_search.py Responsibilities: - Load vector.index - Load vector_meta.json - Encode query - Search top-K - Return JSON Execution: ```bash python vector_search.py "query" 5 ``` Output: ```json [ { "chunk_id": "...", "score": 0.82 } ] ``` --- ## 7.3 VectorSearchClient (PHP) - Executes Python search script - Parses JSON response - Returns structured results - Handles timeout + error states --- # 8. Hybrid Retrieval ## 8.1 Components | Class | Role | |--------|------| | NdjsonHybridRetriever | Orchestrator | | NdjsonKeywordSearch | Keyword scoring | | NdjsonChunkLookup | Chunk resolution | | VectorSearchClient | Vector bridge | | CachedRetriever | Cache layer | --- ## 8.2 Retrieval Flow 1. Extract terms (StopWords + normalization) 2. Keyword scoring 3. Vector search 4. Score fusion 5. Limit to N chunks 6. Resolve chunk text 7. Build LLM context Keyword score remains primary signal. Vector score augments semantic similarity. --- # 9. Document Extraction Supported via: - DocumentExtractorInterface - ExtractorResolver - PdfExtractor - DocumentLoader Extraction must return clean UTF-8 text. Chunking must remain deterministic. --- # 10. Admin Layer (Symfony) ## Controllers - DashboardController - DocumentController - IngestJobController - SecurityController ## Entities - Document - DocumentVersion - IngestJob - User ## Repositories - DocumentVersionRepository - UserRepository --- # 11. Concurrency & Locks LockService ensures: - No parallel reindex - No parallel ingest conflict - Controlled mutation of index.ndjson File-based or service-based locking. --- # 12. Determinism Rules The system guarantees: - Same documents + same config = identical index.ndjson - Same index.ndjson = identical FAISS - Same query + same index = identical results No randomness. No adaptive mutation. No auto-learning. --- # 13. LLM Integration - Context strictly limited to retrieved chunks - PromptBuilder constructs deterministic system prompt - ContextService manages history - SSE streaming enabled - Model endpoint configurable LLM never has direct access to full knowledge base. Only retrieved chunks are injected. --- # 14. Scalability Designed for: - >200k chunks - Streaming NDJSON reads - Full FAISS rebuild - Cache layer for retrieval - Controlled memory usage No full-array JSON loads. --- # 15. Failure Modes Handled via: - Missing vector index detection - Structure drift detection - Lock collision detection - Embedding dependency checks - Python execution errors - Empty chunk fallback --- # 16. Non-Goals This system intentionally does NOT include: - Online learning - Embedding mutation - Incremental FAISS update - Auto chunk merging - Self-modifying prompts All structural changes require explicit reindex. --- # 17. Design Philosophy This is a governance-first RAG architecture: - Deterministic - Reproducible - Drift-safe - Audit-friendly - Version-controlled It prioritizes correctness and control over dynamic mutation. --- # 18. Development Guidelines When extending the system: - Never mutate FAISS directly - Never edit index.ndjson manually - Always preserve determinism - Increment index_version only via Global Reindex - Guard all structural changes - Maintain streaming compatibility --- # 19. CLI Commands (Symfony) Example: ```bash php bin/console mto:agent:vector:ingest ``` Custom commands follow namespace: ``` mto:agent:* ``` --- # 20. Summary This system is a deterministic, enterprise-grade hybrid RAG engine with: - NDJSON-based streaming index - Full FAISS rebuild strategy - Structured ingest pipeline - Hybrid retrieval - Admin governance layer - Strict guardrails It is designed for controlled enterprise deployment, not experimental AI workflows.