Files
MtoRagSystem/patch_history/RETRIEX_LANGUAGE_CLEANUP_GUIDE.md
2026-05-04 19:15:22 +02:00

78 lines
2.8 KiB
Markdown

# RetrieX Language Cleanup Guide
Status: binding for RetrieX 1.5.3+ cleanup-profile work.
This guide defines where language, interaction, commerce and domain tokens belong. Its goal is to keep YAML maintenance simple and avoid duplicated keyword lists.
## 1. Central language cleanup lives in `language.yaml`
Use `config/retriex/language.yaml` for generic language noise only.
Allowed here:
- German function words: `der`, `die`, `das`, `ein`, `eine`, `mit`, `und`, `oder`, `ist`, `sind`, `kann`
- conversation filler words: `bitte`, `mal`, `gerne`, `noch`, `dazu`, `also`
- user instruction phrases: `ich suche`, `suche nach`, `zeige mir`, `gib mir`, `habt ihr`, `gibt es`
- presentation/meta terms: `tabelle`, `liste`, `übersicht`, `tabellarisch`, `auflistung`
- protected terms that must not be removed generically
Do not add product families, measurement parameters, intent terms or shop semantics here.
## 2. Use cleanup profiles instead of copying lists
Domain configs should reference a cleanup profile whenever they need generic language cleanup.
Current profiles:
- `commerce_query`: cleanup for shop/search query text
- `rag_evidence`: cleanup for evidence/answer-consistency checks
- `shop_context_fallback`: cleanup for history-based shop context fallback
Preferred pattern:
```yaml
cleanup_profile: commerce_query
```
Avoid adding the same generic words again to `commerce.yaml`, `agent.yaml`, `retrieval.yaml` or `intent.yaml`.
## 3. Keep domain semantics in domain configs
These belong outside `language.yaml`:
- commerce intent terms: `shop`, `produkt`, `artikel`, `preis`, `kosten`, `kaufen`, `bestellen`
- measurement/domain terms: `wasserhärte`, `chlor`, `redox`, `leitfähigkeit`, `ph`, `rx`, `th`, `tc`
- product-role terms: device, accessory, reagent, spare part and document-role vocabulary
- routing and answer behavior rules
- prompt-specific role or grounding rules
## 4. Protected terms are mandatory guardrails
Never remove these generically unless a later patch explicitly changes the guardrail:
- negations: `nicht`, `kein`, `keine`
- core product/domain anchors: `testomat`, `indikator`, `indikatortyp`
- short model/parameter tokens: `ph`, `rx`, `th`, `tc`
- important numeric anchors: `0,02`
When in doubt, add terms to `protected_terms` rather than removing them through a broad stopword group.
## 5. Change process
Before adding a new token list:
1. Ask whether it is generic language noise.
2. If yes, add it to `language.yaml` under the correct group/profile.
3. If no, keep it in the owning domain YAML.
4. Do not introduce PHP-only token lists.
5. Run the required checks.
Required checks:
```bash
bin/console mto:agent:config:validate
bin/console mto:agent:regression:test
bin/console mto:agent:config:audit-source --details
bin/console mto:agent:config:audit-patterns --details
```