mitho AI Agent Developer Deep Dive

Enterprise Hybrid RAG System (Symfony + NDJSON + FAISS)


1. System Overview

This system implements a deterministic, governance-stable Retrieval Augmented Generation (RAG) architecture based on:

  • Symfony (PHP backend)
  • NDJSON-based knowledge index
  • Full FAISS vector rebuild strategy
  • Hybrid retrieval (keyword + vector)
  • Deterministic ingest pipeline
  • Strict versioning & guardrails
  • Lock-based reindex protection

No incremental vector mutation is allowed.
FAISS is always rebuilt from index.ndjson.


2. High-Level Architecture

User Query → Hybrid Retrieval → Context Assembly → Prompt Builder → LLM → Streaming Response (SSE)

Knowledge Flow: Document → Version → Extract → Chunk → NDJSON → FAISS → Retrieval


3. Directory Structure (Knowledge Layer)

var/knowledge/
 ├── uploads/
 ├── chunks/
 ├── index.ndjson
 ├── index_meta.json
 ├── vector.index
 └── vector_meta.json

4. NDJSON Index

4.1 index.ndjson

  • Single Source of Truth
  • One JSON object per line
  • Streaming-readable
  • No JSON array wrapper
  • Scales beyond 200k chunks

Each line contains:

{
  "chunk_id": "uuid",
  "document_id": "uuid",
  "version": 3,
  "text": "...",
  "meta": { ... }
}

NDJSON enables:

  • Append-based writes
  • Compaction per document
  • Memory-safe streaming
  • Deterministic rebuilds

5. Index Metadata

index_meta.json

Managed by:

  • IndexMetaManager
  • IndexConfiguration

Contains:

  • index_version
  • embedding_model
  • embedding_dimension
  • chunk_size
  • overlap
  • scoring_version
  • index_format

If configuration changes → Global Reindex required.

Guarded by: IndexStructureChangedException


6. Ingest Pipeline

6.1 Core Services

Service Responsibility
DocumentService Document lifecycle
DocumentVersionRepository Version persistence
KnowledgeIngestService Chunk generation
SimpleChunker Deterministic splitting
TextNormalizer Text cleanup
StopWords Keyword filtering
ChunkManager NDJSON append + compaction
ChunkWriter Chunk persistence
IngestFlow Step orchestration
IngestOrchestrator Full ingest coordination
IngestJobService Job tracking
LockService Concurrency guard

6.2 Local Ingest

Used when:

  • A single document version changes

Process:

  1. Extract document
  2. Normalize text
  3. Chunk deterministically
  4. Remove previous chunks of document_id
  5. Append new chunks to index.ndjson
  6. Rebuild FAISS completely

index_version does NOT change.


6.3 Global Reindex

Used when:

  • Embedding model changes
  • Chunk size changes
  • Overlap changes
  • Scoring logic changes
  • index_format changes

Process:

  1. Re-extract all active document versions
  2. Recreate full index.ndjson
  3. Rebuild FAISS
  4. index_version++

7. Vector Architecture

7.1 vector_ingest.py

Responsibilities:

  • Stream-read index.ndjson
  • Extract text + chunk_id
  • Build embeddings
  • Normalize embeddings
  • Build FAISS IndexFlatIP
  • Write vector.index
  • Write vector.meta.json

Execution:

python vector_ingest.py --index path/to/index.ndjson --out path/to/vector.index

Characteristics:

  • No partial updates
  • No incremental mutation
  • Always full rebuild
  • Batch size = 64
  • normalize_embeddings=True

7.2 vector_search.py

Responsibilities:

  • Load vector.index
  • Load vector_meta.json
  • Encode query
  • Search top-K
  • Return JSON

Execution:

python vector_search.py "query" 5

Output:

[
  { "chunk_id": "...", "score": 0.82 }
]

7.3 VectorSearchClient (PHP)

  • Executes Python search script
  • Parses JSON response
  • Returns structured results
  • Handles timeout + error states

8. Hybrid Retrieval

8.1 Components

Class Role
NdjsonHybridRetriever Orchestrator
NdjsonKeywordSearch Keyword scoring
NdjsonChunkLookup Chunk resolution
VectorSearchClient Vector bridge
CachedRetriever Cache layer

8.2 Retrieval Flow

  1. Extract terms (StopWords + normalization)
  2. Keyword scoring
  3. Vector search
  4. Score fusion
  5. Limit to N chunks
  6. Resolve chunk text
  7. Build LLM context

Keyword score remains primary signal. Vector score augments semantic similarity.


9. Document Extraction

Supported via:

  • DocumentExtractorInterface
  • ExtractorResolver
  • PdfExtractor
  • DocumentLoader

Extraction must return clean UTF-8 text. Chunking must remain deterministic.


10. Admin Layer (Symfony)

Controllers

  • DashboardController
  • DocumentController
  • IngestJobController
  • SecurityController

Entities

  • Document
  • DocumentVersion
  • IngestJob
  • User

Repositories

  • DocumentVersionRepository
  • UserRepository

11. Concurrency & Locks

LockService ensures:

  • No parallel reindex
  • No parallel ingest conflict
  • Controlled mutation of index.ndjson

File-based or service-based locking.


12. Determinism Rules

The system guarantees:

  • Same documents + same config = identical index.ndjson
  • Same index.ndjson = identical FAISS
  • Same query + same index = identical results

No randomness. No adaptive mutation. No auto-learning.


13. LLM Integration

  • Context strictly limited to retrieved chunks
  • PromptBuilder constructs deterministic system prompt
  • ContextService manages history
  • SSE streaming enabled
  • Model endpoint configurable

LLM never has direct access to full knowledge base. Only retrieved chunks are injected.


14. Scalability

Designed for:

  • 200k chunks

  • Streaming NDJSON reads
  • Full FAISS rebuild
  • Cache layer for retrieval
  • Controlled memory usage

No full-array JSON loads.


15. Failure Modes

Handled via:

  • Missing vector index detection
  • Structure drift detection
  • Lock collision detection
  • Embedding dependency checks
  • Python execution errors
  • Empty chunk fallback

16. Non-Goals

This system intentionally does NOT include:

  • Online learning
  • Embedding mutation
  • Incremental FAISS update
  • Auto chunk merging
  • Self-modifying prompts

All structural changes require explicit reindex.


17. Design Philosophy

This is a governance-first RAG architecture:

  • Deterministic
  • Reproducible
  • Drift-safe
  • Audit-friendly
  • Version-controlled

It prioritizes correctness and control over dynamic mutation.


18. Development Guidelines

When extending the system:

  • Never mutate FAISS directly
  • Never edit index.ndjson manually
  • Always preserve determinism
  • Increment index_version only via Global Reindex
  • Guard all structural changes
  • Maintain streaming compatibility

19. CLI Commands (Symfony)

Example:

php bin/console mto:agent:vector:ingest

Custom commands follow namespace:

mto:agent:*

20. Summary

This system is a deterministic, enterprise-grade hybrid RAG engine with:

  • NDJSON-based streaming index
  • Full FAISS rebuild strategy
  • Structured ingest pipeline
  • Hybrid retrieval
  • Admin governance layer
  • Strict guardrails

It is designed for controlled enterprise deployment, not experimental AI workflows.

Description
No description provided
Readme 26 MiB
Languages
PHP 79%
Twig 13%
HTML 2.8%
Python 2.4%
JavaScript 1.8%
Other 1%