Files
MtoRagSystem/README.md

431 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# mitho AI Agent Developer Deep Dive
Enterprise Hybrid RAG System (Symfony + NDJSON + FAISS)
---
# 1. System Overview
This system implements a deterministic, governance-stable Retrieval Augmented Generation (RAG) architecture based on:
- Symfony (PHP backend)
- NDJSON-based knowledge index
- Full FAISS vector rebuild strategy
- Hybrid retrieval (keyword + vector)
- Deterministic ingest pipeline
- Strict versioning & guardrails
- Lock-based reindex protection
No incremental vector mutation is allowed.
FAISS is always rebuilt from `index.ndjson`.
---
# 2. High-Level Architecture
User Query
→ Hybrid Retrieval
→ Context Assembly
→ Prompt Builder
→ LLM
→ Streaming Response (SSE)
Knowledge Flow:
Document → Version → Extract → Chunk → NDJSON → FAISS → Retrieval
---
# 3. Directory Structure (Knowledge Layer)
```
var/knowledge/
├── uploads/
├── chunks/
├── index.ndjson
├── index_meta.json
├── vector.index
└── vector_meta.json
```
---
# 4. NDJSON Index
## 4.1 index.ndjson
- Single Source of Truth
- One JSON object per line
- Streaming-readable
- No JSON array wrapper
- Scales beyond 200k chunks
Each line contains:
```json
{
"chunk_id": "uuid",
"document_id": "uuid",
"version": 3,
"text": "...",
"meta": { ... }
}
```
NDJSON enables:
- Append-based writes
- Compaction per document
- Memory-safe streaming
- Deterministic rebuilds
---
# 5. Index Metadata
## index_meta.json
Managed by:
- IndexMetaManager
- IndexConfiguration
Contains:
- index_version
- embedding_model
- embedding_dimension
- chunk_size
- overlap
- scoring_version
- index_format
If configuration changes → Global Reindex required.
Guarded by:
`IndexStructureChangedException`
---
# 6. Ingest Pipeline
## 6.1 Core Services
| Service | Responsibility |
|----------|----------------|
| DocumentService | Document lifecycle |
| DocumentVersionRepository | Version persistence |
| KnowledgeIngestService | Chunk generation |
| SimpleChunker | Deterministic splitting |
| TextNormalizer | Text cleanup |
| StopWords | Keyword filtering |
| ChunkManager | NDJSON append + compaction |
| ChunkWriter | Chunk persistence |
| IngestFlow | Step orchestration |
| IngestOrchestrator | Full ingest coordination |
| IngestJobService | Job tracking |
| LockService | Concurrency guard |
---
## 6.2 Local Ingest
Used when:
- A single document version changes
Process:
1. Extract document
2. Normalize text
3. Chunk deterministically
4. Remove previous chunks of document_id
5. Append new chunks to index.ndjson
6. Rebuild FAISS completely
index_version does NOT change.
---
## 6.3 Global Reindex
Used when:
- Embedding model changes
- Chunk size changes
- Overlap changes
- Scoring logic changes
- index_format changes
Process:
1. Re-extract all active document versions
2. Recreate full index.ndjson
3. Rebuild FAISS
4. index_version++
---
# 7. Vector Architecture
## 7.1 vector_ingest.py
Responsibilities:
- Stream-read index.ndjson
- Extract text + chunk_id
- Build embeddings
- Normalize embeddings
- Build FAISS IndexFlatIP
- Write vector.index
- Write vector.meta.json
Execution:
```bash
python vector_ingest.py --index path/to/index.ndjson --out path/to/vector.index
```
Characteristics:
- No partial updates
- No incremental mutation
- Always full rebuild
- Batch size = 64
- normalize_embeddings=True
---
## 7.2 vector_search.py
Responsibilities:
- Load vector.index
- Load vector_meta.json
- Encode query
- Search top-K
- Return JSON
Execution:
```bash
python vector_search.py "query" 5
```
Output:
```json
[
{ "chunk_id": "...", "score": 0.82 }
]
```
---
## 7.3 VectorSearchClient (PHP)
- Executes Python search script
- Parses JSON response
- Returns structured results
- Handles timeout + error states
---
# 8. Hybrid Retrieval
## 8.1 Components
| Class | Role |
|--------|------|
| NdjsonHybridRetriever | Orchestrator |
| NdjsonKeywordSearch | Keyword scoring |
| NdjsonChunkLookup | Chunk resolution |
| VectorSearchClient | Vector bridge |
| CachedRetriever | Cache layer |
---
## 8.2 Retrieval Flow
1. Extract terms (StopWords + normalization)
2. Keyword scoring
3. Vector search
4. Score fusion
5. Limit to N chunks
6. Resolve chunk text
7. Build LLM context
Keyword score remains primary signal.
Vector score augments semantic similarity.
---
# 9. Document Extraction
Supported via:
- DocumentExtractorInterface
- ExtractorResolver
- PdfExtractor
- DocumentLoader
Extraction must return clean UTF-8 text.
Chunking must remain deterministic.
---
# 10. Admin Layer (Symfony)
## Controllers
- DashboardController
- DocumentController
- IngestJobController
- SecurityController
## Entities
- Document
- DocumentVersion
- IngestJob
- User
## Repositories
- DocumentVersionRepository
- UserRepository
---
# 11. Concurrency & Locks
LockService ensures:
- No parallel reindex
- No parallel ingest conflict
- Controlled mutation of index.ndjson
File-based or service-based locking.
---
# 12. Determinism Rules
The system guarantees:
- Same documents + same config = identical index.ndjson
- Same index.ndjson = identical FAISS
- Same query + same index = identical results
No randomness.
No adaptive mutation.
No auto-learning.
---
# 13. LLM Integration
- Context strictly limited to retrieved chunks
- PromptBuilder constructs deterministic system prompt
- ContextService manages history
- SSE streaming enabled
- Model endpoint configurable
LLM never has direct access to full knowledge base.
Only retrieved chunks are injected.
---
# 14. Scalability
Designed for:
- >200k chunks
- Streaming NDJSON reads
- Full FAISS rebuild
- Cache layer for retrieval
- Controlled memory usage
No full-array JSON loads.
---
# 15. Failure Modes
Handled via:
- Missing vector index detection
- Structure drift detection
- Lock collision detection
- Embedding dependency checks
- Python execution errors
- Empty chunk fallback
---
# 16. Non-Goals
This system intentionally does NOT include:
- Online learning
- Embedding mutation
- Incremental FAISS update
- Auto chunk merging
- Self-modifying prompts
All structural changes require explicit reindex.
---
# 17. Design Philosophy
This is a governance-first RAG architecture:
- Deterministic
- Reproducible
- Drift-safe
- Audit-friendly
- Version-controlled
It prioritizes correctness and control over dynamic mutation.
---
# 18. Development Guidelines
When extending the system:
- Never mutate FAISS directly
- Never edit index.ndjson manually
- Always preserve determinism
- Increment index_version only via Global Reindex
- Guard all structural changes
- Maintain streaming compatibility
---
# 19. CLI Commands (Symfony)
Example:
```bash
php bin/console mto:agent:vector:ingest
```
Custom commands follow namespace:
```
mto:agent:*
```
---
# 20. Summary
This system is a deterministic, enterprise-grade hybrid RAG engine with:
- NDJSON-based streaming index
- Full FAISS rebuild strategy
- Structured ingest pipeline
- Hybrid retrieval
- Admin governance layer
- Strict guardrails
It is designed for controlled enterprise deployment, not experimental AI workflows.