431 lines
7.0 KiB
Markdown
431 lines
7.0 KiB
Markdown
# mitho AI Agent – Developer Deep Dive
|
||
|
||
Enterprise Hybrid RAG System (Symfony + NDJSON + FAISS)
|
||
|
||
---
|
||
|
||
# 1. System Overview
|
||
|
||
This system implements a deterministic, governance-stable Retrieval Augmented Generation (RAG) architecture based on:
|
||
|
||
- Symfony (PHP backend)
|
||
- NDJSON-based knowledge index
|
||
- Full FAISS vector rebuild strategy
|
||
- Hybrid retrieval (keyword + vector)
|
||
- Deterministic ingest pipeline
|
||
- Strict versioning & guardrails
|
||
- Lock-based reindex protection
|
||
|
||
No incremental vector mutation is allowed.
|
||
FAISS is always rebuilt from `index.ndjson`.
|
||
|
||
---
|
||
|
||
# 2. High-Level Architecture
|
||
|
||
User Query
|
||
→ Hybrid Retrieval
|
||
→ Context Assembly
|
||
→ Prompt Builder
|
||
→ LLM
|
||
→ Streaming Response (SSE)
|
||
|
||
Knowledge Flow:
|
||
Document → Version → Extract → Chunk → NDJSON → FAISS → Retrieval
|
||
|
||
---
|
||
|
||
# 3. Directory Structure (Knowledge Layer)
|
||
|
||
```
|
||
var/knowledge/
|
||
├── uploads/
|
||
├── chunks/
|
||
├── index.ndjson
|
||
├── index_meta.json
|
||
├── vector.index
|
||
└── vector_meta.json
|
||
```
|
||
|
||
---
|
||
|
||
# 4. NDJSON Index
|
||
|
||
## 4.1 index.ndjson
|
||
|
||
- Single Source of Truth
|
||
- One JSON object per line
|
||
- Streaming-readable
|
||
- No JSON array wrapper
|
||
- Scales beyond 200k chunks
|
||
|
||
Each line contains:
|
||
|
||
```json
|
||
{
|
||
"chunk_id": "uuid",
|
||
"document_id": "uuid",
|
||
"version": 3,
|
||
"text": "...",
|
||
"meta": { ... }
|
||
}
|
||
```
|
||
|
||
NDJSON enables:
|
||
- Append-based writes
|
||
- Compaction per document
|
||
- Memory-safe streaming
|
||
- Deterministic rebuilds
|
||
|
||
---
|
||
|
||
# 5. Index Metadata
|
||
|
||
## index_meta.json
|
||
|
||
Managed by:
|
||
|
||
- IndexMetaManager
|
||
- IndexConfiguration
|
||
|
||
Contains:
|
||
|
||
- index_version
|
||
- embedding_model
|
||
- embedding_dimension
|
||
- chunk_size
|
||
- overlap
|
||
- scoring_version
|
||
- index_format
|
||
|
||
If configuration changes → Global Reindex required.
|
||
|
||
Guarded by:
|
||
`IndexStructureChangedException`
|
||
|
||
---
|
||
|
||
# 6. Ingest Pipeline
|
||
|
||
## 6.1 Core Services
|
||
|
||
| Service | Responsibility |
|
||
|----------|----------------|
|
||
| DocumentService | Document lifecycle |
|
||
| DocumentVersionRepository | Version persistence |
|
||
| KnowledgeIngestService | Chunk generation |
|
||
| SimpleChunker | Deterministic splitting |
|
||
| TextNormalizer | Text cleanup |
|
||
| StopWords | Keyword filtering |
|
||
| ChunkManager | NDJSON append + compaction |
|
||
| ChunkWriter | Chunk persistence |
|
||
| IngestFlow | Step orchestration |
|
||
| IngestOrchestrator | Full ingest coordination |
|
||
| IngestJobService | Job tracking |
|
||
| LockService | Concurrency guard |
|
||
|
||
---
|
||
|
||
## 6.2 Local Ingest
|
||
|
||
Used when:
|
||
- A single document version changes
|
||
|
||
Process:
|
||
|
||
1. Extract document
|
||
2. Normalize text
|
||
3. Chunk deterministically
|
||
4. Remove previous chunks of document_id
|
||
5. Append new chunks to index.ndjson
|
||
6. Rebuild FAISS completely
|
||
|
||
index_version does NOT change.
|
||
|
||
---
|
||
|
||
## 6.3 Global Reindex
|
||
|
||
Used when:
|
||
- Embedding model changes
|
||
- Chunk size changes
|
||
- Overlap changes
|
||
- Scoring logic changes
|
||
- index_format changes
|
||
|
||
Process:
|
||
|
||
1. Re-extract all active document versions
|
||
2. Recreate full index.ndjson
|
||
3. Rebuild FAISS
|
||
4. index_version++
|
||
|
||
---
|
||
|
||
# 7. Vector Architecture
|
||
|
||
## 7.1 vector_ingest.py
|
||
|
||
Responsibilities:
|
||
|
||
- Stream-read index.ndjson
|
||
- Extract text + chunk_id
|
||
- Build embeddings
|
||
- Normalize embeddings
|
||
- Build FAISS IndexFlatIP
|
||
- Write vector.index
|
||
- Write vector.meta.json
|
||
|
||
Execution:
|
||
|
||
```bash
|
||
python vector_ingest.py --index path/to/index.ndjson --out path/to/vector.index
|
||
```
|
||
|
||
Characteristics:
|
||
|
||
- No partial updates
|
||
- No incremental mutation
|
||
- Always full rebuild
|
||
- Batch size = 64
|
||
- normalize_embeddings=True
|
||
|
||
---
|
||
|
||
## 7.2 vector_search.py
|
||
|
||
Responsibilities:
|
||
|
||
- Load vector.index
|
||
- Load vector_meta.json
|
||
- Encode query
|
||
- Search top-K
|
||
- Return JSON
|
||
|
||
Execution:
|
||
|
||
```bash
|
||
python vector_search.py "query" 5
|
||
```
|
||
|
||
Output:
|
||
|
||
```json
|
||
[
|
||
{ "chunk_id": "...", "score": 0.82 }
|
||
]
|
||
```
|
||
|
||
---
|
||
|
||
## 7.3 VectorSearchClient (PHP)
|
||
|
||
- Executes Python search script
|
||
- Parses JSON response
|
||
- Returns structured results
|
||
- Handles timeout + error states
|
||
|
||
---
|
||
|
||
# 8. Hybrid Retrieval
|
||
|
||
## 8.1 Components
|
||
|
||
| Class | Role |
|
||
|--------|------|
|
||
| NdjsonHybridRetriever | Orchestrator |
|
||
| NdjsonKeywordSearch | Keyword scoring |
|
||
| NdjsonChunkLookup | Chunk resolution |
|
||
| VectorSearchClient | Vector bridge |
|
||
| CachedRetriever | Cache layer |
|
||
|
||
---
|
||
|
||
## 8.2 Retrieval Flow
|
||
|
||
1. Extract terms (StopWords + normalization)
|
||
2. Keyword scoring
|
||
3. Vector search
|
||
4. Score fusion
|
||
5. Limit to N chunks
|
||
6. Resolve chunk text
|
||
7. Build LLM context
|
||
|
||
Keyword score remains primary signal.
|
||
Vector score augments semantic similarity.
|
||
|
||
---
|
||
|
||
# 9. Document Extraction
|
||
|
||
Supported via:
|
||
|
||
- DocumentExtractorInterface
|
||
- ExtractorResolver
|
||
- PdfExtractor
|
||
- DocumentLoader
|
||
|
||
Extraction must return clean UTF-8 text.
|
||
Chunking must remain deterministic.
|
||
|
||
---
|
||
|
||
# 10. Admin Layer (Symfony)
|
||
|
||
## Controllers
|
||
|
||
- DashboardController
|
||
- DocumentController
|
||
- IngestJobController
|
||
- SecurityController
|
||
|
||
## Entities
|
||
|
||
- Document
|
||
- DocumentVersion
|
||
- IngestJob
|
||
- User
|
||
|
||
## Repositories
|
||
|
||
- DocumentVersionRepository
|
||
- UserRepository
|
||
|
||
---
|
||
|
||
# 11. Concurrency & Locks
|
||
|
||
LockService ensures:
|
||
|
||
- No parallel reindex
|
||
- No parallel ingest conflict
|
||
- Controlled mutation of index.ndjson
|
||
|
||
File-based or service-based locking.
|
||
|
||
---
|
||
|
||
# 12. Determinism Rules
|
||
|
||
The system guarantees:
|
||
|
||
- Same documents + same config = identical index.ndjson
|
||
- Same index.ndjson = identical FAISS
|
||
- Same query + same index = identical results
|
||
|
||
No randomness.
|
||
No adaptive mutation.
|
||
No auto-learning.
|
||
|
||
---
|
||
|
||
# 13. LLM Integration
|
||
|
||
- Context strictly limited to retrieved chunks
|
||
- PromptBuilder constructs deterministic system prompt
|
||
- ContextService manages history
|
||
- SSE streaming enabled
|
||
- Model endpoint configurable
|
||
|
||
LLM never has direct access to full knowledge base.
|
||
Only retrieved chunks are injected.
|
||
|
||
---
|
||
|
||
# 14. Scalability
|
||
|
||
Designed for:
|
||
|
||
- >200k chunks
|
||
- Streaming NDJSON reads
|
||
- Full FAISS rebuild
|
||
- Cache layer for retrieval
|
||
- Controlled memory usage
|
||
|
||
No full-array JSON loads.
|
||
|
||
---
|
||
|
||
# 15. Failure Modes
|
||
|
||
Handled via:
|
||
|
||
- Missing vector index detection
|
||
- Structure drift detection
|
||
- Lock collision detection
|
||
- Embedding dependency checks
|
||
- Python execution errors
|
||
- Empty chunk fallback
|
||
|
||
---
|
||
|
||
# 16. Non-Goals
|
||
|
||
This system intentionally does NOT include:
|
||
|
||
- Online learning
|
||
- Embedding mutation
|
||
- Incremental FAISS update
|
||
- Auto chunk merging
|
||
- Self-modifying prompts
|
||
|
||
All structural changes require explicit reindex.
|
||
|
||
---
|
||
|
||
# 17. Design Philosophy
|
||
|
||
This is a governance-first RAG architecture:
|
||
|
||
- Deterministic
|
||
- Reproducible
|
||
- Drift-safe
|
||
- Audit-friendly
|
||
- Version-controlled
|
||
|
||
It prioritizes correctness and control over dynamic mutation.
|
||
|
||
---
|
||
|
||
# 18. Development Guidelines
|
||
|
||
When extending the system:
|
||
|
||
- Never mutate FAISS directly
|
||
- Never edit index.ndjson manually
|
||
- Always preserve determinism
|
||
- Increment index_version only via Global Reindex
|
||
- Guard all structural changes
|
||
- Maintain streaming compatibility
|
||
|
||
---
|
||
|
||
# 19. CLI Commands (Symfony)
|
||
|
||
Example:
|
||
|
||
```bash
|
||
php bin/console mto:agent:vector:ingest
|
||
```
|
||
|
||
Custom commands follow namespace:
|
||
|
||
```
|
||
mto:agent:*
|
||
```
|
||
|
||
---
|
||
|
||
# 20. Summary
|
||
|
||
This system is a deterministic, enterprise-grade hybrid RAG engine with:
|
||
|
||
- NDJSON-based streaming index
|
||
- Full FAISS rebuild strategy
|
||
- Structured ingest pipeline
|
||
- Hybrid retrieval
|
||
- Admin governance layer
|
||
- Strict guardrails
|
||
|
||
It is designed for controlled enterprise deployment, not experimental AI workflows.
|