marek/MtoRagSystem

Fork 0

Go to file

team2 3416678cf4 add system prompt and chunks index views and edit

2026-02-15 21:33:31 +01:00

bin

first commit

2026-02-11 14:15:08 +01:00

config

add system prompt and chunks index views and edit

2026-02-15 21:33:31 +01:00

migrations

add system prompt and chunks index views and edit

2026-02-15 21:33:31 +01:00

public

new version ndjson

2026-02-12 11:22:56 +01:00

src

add system prompt and chunks index views and edit

2026-02-15 21:33:31 +01:00

templates

add system prompt and chunks index views and edit

2026-02-15 21:33:31 +01:00

.env

harden code and add messenger services and ne README.md and SYSTEM.,d

2026-02-15 14:36:04 +01:00

.gitignore

harden reset system

2026-02-15 18:07:38 +01:00

composer.json

harden code and add messenger services and ne README.md and SYSTEM.,d

2026-02-15 14:36:04 +01:00

composer.lock

harden code and add messenger services and ne README.md and SYSTEM.,d

2026-02-15 14:36:04 +01:00

README.md

harden code and add messenger services and ne README.md and SYSTEM.,d

2026-02-15 14:36:04 +01:00

symfony.lock

harden code and add messenger services and ne README.md and SYSTEM.,d

2026-02-15 14:36:04 +01:00

SYSTEM.md

harden code and add messenger services and ne README.md and SYSTEM.,d

2026-02-15 14:36:04 +01:00

README.md

mitho AI Agent – Developer Deep Dive

Enterprise Hybrid RAG System (Symfony + NDJSON + FAISS)

1. System Overview

This system implements a deterministic, governance-stable Retrieval Augmented Generation (RAG) architecture based on:

Symfony (PHP backend)
NDJSON-based knowledge index
Full FAISS vector rebuild strategy
Hybrid retrieval (keyword + vector)
Deterministic ingest pipeline
Strict versioning & guardrails
Lock-based reindex protection

No incremental vector mutation is allowed.
FAISS is always rebuilt from index.ndjson.

2. High-Level Architecture

User Query → Hybrid Retrieval → Context Assembly → Prompt Builder → LLM → Streaming Response (SSE)

Knowledge Flow: Document → Version → Extract → Chunk → NDJSON → FAISS → Retrieval

3. Directory Structure (Knowledge Layer)

var/knowledge/
 ├── uploads/
 ├── chunks/
 ├── index.ndjson
 ├── index_meta.json
 ├── vector.index
 └── vector_meta.json

4. NDJSON Index

4.1 index.ndjson

Single Source of Truth
One JSON object per line
Streaming-readable
No JSON array wrapper
Scales beyond 200k chunks

Each line contains:

{
  "chunk_id": "uuid",
  "document_id": "uuid",
  "version": 3,
  "text": "...",
  "meta": { ... }
}

NDJSON enables:

Append-based writes
Compaction per document
Memory-safe streaming
Deterministic rebuilds

5. Index Metadata

index_meta.json

Managed by:

IndexMetaManager
IndexConfiguration

Contains:

index_version
embedding_model
embedding_dimension
chunk_size
overlap
scoring_version
index_format

If configuration changes → Global Reindex required.

Guarded by: IndexStructureChangedException

6. Ingest Pipeline

6.1 Core Services

Service	Responsibility
DocumentService	Document lifecycle
DocumentVersionRepository	Version persistence
KnowledgeIngestService	Chunk generation
SimpleChunker	Deterministic splitting
TextNormalizer	Text cleanup
StopWords	Keyword filtering
ChunkManager	NDJSON append + compaction
ChunkWriter	Chunk persistence
IngestFlow	Step orchestration
IngestOrchestrator	Full ingest coordination
IngestJobService	Job tracking
LockService	Concurrency guard

6.2 Local Ingest

Used when:

A single document version changes

Process:

Extract document
Normalize text
Chunk deterministically
Remove previous chunks of document_id
Append new chunks to index.ndjson
Rebuild FAISS completely

index_version does NOT change.

6.3 Global Reindex

Used when:

Embedding model changes
Chunk size changes
Overlap changes
Scoring logic changes
index_format changes

Process:

Re-extract all active document versions
Recreate full index.ndjson
Rebuild FAISS
index_version++

7. Vector Architecture

7.1 vector_ingest.py

Responsibilities:

Stream-read index.ndjson
Extract text + chunk_id
Build embeddings
Normalize embeddings
Build FAISS IndexFlatIP
Write vector.index
Write vector.meta.json

Execution:

python vector_ingest.py --index path/to/index.ndjson --out path/to/vector.index

Characteristics:

No partial updates
No incremental mutation
Always full rebuild
Batch size = 64
normalize_embeddings=True

7.2 vector_search.py

Responsibilities:

Load vector.index
Load vector_meta.json
Encode query
Search top-K
Return JSON

Execution:

python vector_search.py "query" 5

Output:

[
  { "chunk_id": "...", "score": 0.82 }
]

7.3 VectorSearchClient (PHP)

Executes Python search script
Parses JSON response
Returns structured results
Handles timeout + error states

8. Hybrid Retrieval

8.1 Components

Class	Role
NdjsonHybridRetriever	Orchestrator
NdjsonKeywordSearch	Keyword scoring
NdjsonChunkLookup	Chunk resolution
VectorSearchClient	Vector bridge
CachedRetriever	Cache layer

8.2 Retrieval Flow

Extract terms (StopWords + normalization)
Keyword scoring
Vector search
Score fusion
Limit to N chunks
Resolve chunk text
Build LLM context

Keyword score remains primary signal. Vector score augments semantic similarity.

9. Document Extraction

Supported via:

DocumentExtractorInterface
ExtractorResolver
PdfExtractor
DocumentLoader

Extraction must return clean UTF-8 text. Chunking must remain deterministic.

10. Admin Layer (Symfony)

Controllers

DashboardController
DocumentController
IngestJobController
SecurityController

Entities

Document
DocumentVersion
IngestJob
User

Repositories

DocumentVersionRepository
UserRepository

11. Concurrency & Locks

LockService ensures:

No parallel reindex
No parallel ingest conflict
Controlled mutation of index.ndjson

File-based or service-based locking.

12. Determinism Rules

The system guarantees:

Same documents + same config = identical index.ndjson
Same index.ndjson = identical FAISS
Same query + same index = identical results

No randomness. No adaptive mutation. No auto-learning.

13. LLM Integration

Context strictly limited to retrieved chunks
PromptBuilder constructs deterministic system prompt
ContextService manages history
SSE streaming enabled
Model endpoint configurable

LLM never has direct access to full knowledge base. Only retrieved chunks are injected.

14. Scalability

Designed for:

200k chunks
Streaming NDJSON reads
Full FAISS rebuild
Cache layer for retrieval
Controlled memory usage

No full-array JSON loads.

15. Failure Modes

Handled via:

Missing vector index detection
Structure drift detection
Lock collision detection
Embedding dependency checks
Python execution errors
Empty chunk fallback

16. Non-Goals

This system intentionally does NOT include:

Online learning
Embedding mutation
Incremental FAISS update
Auto chunk merging
Self-modifying prompts

All structural changes require explicit reindex.

17. Design Philosophy

This is a governance-first RAG architecture:

Deterministic
Reproducible
Drift-safe
Audit-friendly
Version-controlled

It prioritizes correctness and control over dynamic mutation.

18. Development Guidelines

When extending the system:

Never mutate FAISS directly
Never edit index.ndjson manually
Always preserve determinism
Increment index_version only via Global Reindex
Guard all structural changes
Maintain streaming compatibility

19. CLI Commands (Symfony)

Example:

php bin/console mto:agent:vector:ingest

Custom commands follow namespace:

mto:agent:*

20. Summary

This system is a deterministic, enterprise-grade hybrid RAG engine with:

NDJSON-based streaming index
Full FAISS rebuild strategy
Structured ingest pipeline
Hybrid retrieval
Admin governance layer
Strict guardrails

It is designed for controlled enterprise deployment, not experimental AI workflows.

Languages

PHP 79%

Twig 13%

HTML 2.8%

Python 2.4%

JavaScript 1.8%

Other 1%

README.md Unescape Escape

mitho AI Agent – Developer Deep Dive

1. System Overview

2. High-Level Architecture

3. Directory Structure (Knowledge Layer)

4. NDJSON Index

4.1 index.ndjson

5. Index Metadata

index_meta.json

6. Ingest Pipeline

6.1 Core Services

6.2 Local Ingest

6.3 Global Reindex

7. Vector Architecture

7.1 vector_ingest.py

7.2 vector_search.py

7.3 VectorSearchClient (PHP)

8. Hybrid Retrieval

8.1 Components

8.2 Retrieval Flow

9. Document Extraction

10. Admin Layer (Symfony)

Controllers

Entities

Repositories

11. Concurrency & Locks

12. Determinism Rules

13. LLM Integration

14. Scalability

15. Failure Modes

16. Non-Goals

17. Design Philosophy

18. Development Guidelines

19. CLI Commands (Symfony)

20. Summary

README.md