MtoRagSystem/README.md

# mitho AI Agent – Developer Deep Dive

Enterprise Hybrid RAG System (Symfony + NDJSON + FAISS)

---

# 1. System Overview

This system implements a deterministic, governance-stable Retrieval Augmented Generation (RAG) architecture based on:

- Symfony (PHP backend)
- NDJSON-based knowledge index
- Full FAISS vector rebuild strategy
- Hybrid retrieval (keyword + vector)
- Deterministic ingest pipeline
- Strict versioning & guardrails
- Lock-based reindex protection

No incremental vector mutation is allowed.
FAISS is always rebuilt from `index.ndjson`.

---

# 2. High-Level Architecture

User Query
→ Hybrid Retrieval
→ Context Assembly
→ Prompt Builder
→ LLM
→ Streaming Response (SSE)

Knowledge Flow:
Document → Version → Extract → Chunk → NDJSON → FAISS → Retrieval

---

# 3. Directory Structure (Knowledge Layer)

```
var/knowledge/
 ├── uploads/
 ├── chunks/
 ├── index.ndjson
 ├── index_meta.json
 ├── vector.index
 └── vector_meta.json
```

---

# 4. NDJSON Index

## 4.1 index.ndjson

- Single Source of Truth
- One JSON object per line
- Streaming-readable
- No JSON array wrapper
- Scales beyond 200k chunks

Each line contains:

```json
{
  "chunk_id": "uuid",
  "document_id": "uuid",
  "version": 3,
  "text": "...",
  "meta": { ... }
}
```

NDJSON enables:
- Append-based writes
- Compaction per document
- Memory-safe streaming
- Deterministic rebuilds

---

# 5. Index Metadata

## index_meta.json

Managed by:

- IndexMetaManager
- IndexConfiguration

Contains:

- index_version
- embedding_model
- embedding_dimension
- chunk_size
- overlap
- scoring_version
- index_format

If configuration changes → Global Reindex required.

Guarded by:
`IndexStructureChangedException`

---

# 6. Ingest Pipeline

## 6.1 Core Services

| Service | Responsibility |
|----------|----------------|
| DocumentService | Document lifecycle |
| DocumentVersionRepository | Version persistence |
| KnowledgeIngestService | Chunk generation |
| SimpleChunker | Deterministic splitting |
| TextNormalizer | Text cleanup |
| StopWords | Keyword filtering |
| ChunkManager | NDJSON append + compaction |
| ChunkWriter | Chunk persistence |
| IngestFlow | Step orchestration |
| IngestOrchestrator | Full ingest coordination |
| IngestJobService | Job tracking |
| LockService | Concurrency guard |

---

## 6.2 Local Ingest

Used when:
- A single document version changes

Process:

1. Extract document
2. Normalize text
3. Chunk deterministically
4. Remove previous chunks of document_id
5. Append new chunks to index.ndjson
6. Rebuild FAISS completely

index_version does NOT change.

---

## 6.3 Global Reindex

Used when:
- Embedding model changes
- Chunk size changes
- Overlap changes
- Scoring logic changes
- index_format changes

Process:

1. Re-extract all active document versions
2. Recreate full index.ndjson
3. Rebuild FAISS
4. index_version++

---

# 7. Vector Architecture

## 7.1 vector_ingest.py

Responsibilities:

- Stream-read index.ndjson
- Extract text + chunk_id
- Build embeddings
- Normalize embeddings
- Build FAISS IndexFlatIP
- Write vector.index
- Write vector.meta.json

Execution:

```bash
python vector_ingest.py --index path/to/index.ndjson --out path/to/vector.index
```

Characteristics:

- No partial updates
- No incremental mutation
- Always full rebuild
- Batch size = 64
- normalize_embeddings=True

---

## 7.2 vector_search.py

Responsibilities:

- Load vector.index
- Load vector_meta.json
- Encode query
- Search top-K
- Return JSON

Execution:

```bash
python vector_search.py "query" 5
```

Output:

```json
[
  { "chunk_id": "...", "score": 0.82 }
]
```

---

## 7.3 VectorSearchClient (PHP)

- Executes Python search script
- Parses JSON response
- Returns structured results
- Handles timeout + error states

---

# 8. Hybrid Retrieval

## 8.1 Components

| Class | Role |
|--------|------|
| NdjsonHybridRetriever | Orchestrator |
| NdjsonKeywordSearch | Keyword scoring |
| NdjsonChunkLookup | Chunk resolution |
| VectorSearchClient | Vector bridge |
| CachedRetriever | Cache layer |

---

## 8.2 Retrieval Flow

1. Extract terms (StopWords + normalization)
2. Keyword scoring
3. Vector search
4. Score fusion
5. Limit to N chunks
6. Resolve chunk text
7. Build LLM context

Keyword score remains primary signal.
Vector score augments semantic similarity.

---

# 9. Document Extraction

Supported via:

- DocumentExtractorInterface
- ExtractorResolver
- PdfExtractor
- DocumentLoader

Extraction must return clean UTF-8 text.
Chunking must remain deterministic.

---

# 10. Admin Layer (Symfony)

## Controllers

- DashboardController
- DocumentController
- IngestJobController
- SecurityController

## Entities

- Document
- DocumentVersion
- IngestJob
- User

## Repositories

- DocumentVersionRepository
- UserRepository

---

# 11. Concurrency & Locks

LockService ensures:

- No parallel reindex
- No parallel ingest conflict
- Controlled mutation of index.ndjson

File-based or service-based locking.

---

# 12. Determinism Rules

The system guarantees:

- Same documents + same config = identical index.ndjson
- Same index.ndjson = identical FAISS
- Same query + same index = identical results

No randomness.
No adaptive mutation.
No auto-learning.

---

# 13. LLM Integration

- Context strictly limited to retrieved chunks
- PromptBuilder constructs deterministic system prompt
- ContextService manages history
- SSE streaming enabled
- Model endpoint configurable

LLM never has direct access to full knowledge base.
Only retrieved chunks are injected.

---

# 14. Scalability

Designed for:

- >200k chunks
- Streaming NDJSON reads
- Full FAISS rebuild
- Cache layer for retrieval
- Controlled memory usage

No full-array JSON loads.

---

# 15. Failure Modes

Handled via:

- Missing vector index detection
- Structure drift detection
- Lock collision detection
- Embedding dependency checks
- Python execution errors
- Empty chunk fallback

---

# 16. Non-Goals

This system intentionally does NOT include:

- Online learning
- Embedding mutation
- Incremental FAISS update
- Auto chunk merging
- Self-modifying prompts

All structural changes require explicit reindex.

---

# 17. Design Philosophy

This is a governance-first RAG architecture:

- Deterministic
- Reproducible
- Drift-safe
- Audit-friendly
- Version-controlled

It prioritizes correctness and control over dynamic mutation.

---

# 18. Development Guidelines

When extending the system:

- Never mutate FAISS directly
- Never edit index.ndjson manually
- Always preserve determinism
- Increment index_version only via Global Reindex
- Guard all structural changes
- Maintain streaming compatibility

---

# 19. CLI Commands (Symfony)

Example:

```bash
php bin/console mto:agent:vector:ingest
```

Custom commands follow namespace:

```
mto:agent:*
```

---

# 20. Summary

This system is a deterministic, enterprise-grade hybrid RAG engine with:

- NDJSON-based streaming index
- Full FAISS rebuild strategy
- Structured ingest pipeline
- Hybrid retrieval
- Admin governance layer
- Strict guardrails

It is designed for controlled enterprise deployment, not experimental AI workflows.