MtoRagSystem/patch_history/RETRIEX_LANGUAGE_CLEANUP_GUIDE.md

# RetrieX Language Cleanup Guide

Status: binding for RetrieX 1.5.3+ cleanup-profile work.

This guide defines where language, interaction, commerce and domain tokens belong. Its goal is to keep YAML maintenance simple and avoid duplicated keyword lists.

## 1. Central language cleanup lives in `language.yaml`

Use `config/retriex/language.yaml` for generic language noise only.

Allowed here:

- German function words: `der`, `die`, `das`, `ein`, `eine`, `mit`, `und`, `oder`, `ist`, `sind`, `kann`
- conversation filler words: `bitte`, `mal`, `gerne`, `noch`, `dazu`, `also`
- user instruction phrases: `ich suche`, `suche nach`, `zeige mir`, `gib mir`, `habt ihr`, `gibt es`
- presentation/meta terms: `tabelle`, `liste`, `übersicht`, `tabellarisch`, `auflistung`
- protected terms that must not be removed generically

Do not add product families, measurement parameters, intent terms or shop semantics here.

## 2. Use cleanup profiles instead of copying lists

Domain configs should reference a cleanup profile whenever they need generic language cleanup.

Current profiles:

- `commerce_query`: cleanup for shop/search query text
- `rag_evidence`: cleanup for evidence/answer-consistency checks
- `shop_context_fallback`: cleanup for history-based shop context fallback

Preferred pattern:

```yaml
cleanup_profile: commerce_query
```

Avoid adding the same generic words again to `commerce.yaml`, `agent.yaml`, `retrieval.yaml` or `intent.yaml`.

## 3. Keep domain semantics in domain configs

These belong outside `language.yaml`:

- commerce intent terms: `shop`, `produkt`, `artikel`, `preis`, `kosten`, `kaufen`, `bestellen`
- measurement/domain terms: `wasserhärte`, `chlor`, `redox`, `leitfähigkeit`, `ph`, `rx`, `th`, `tc`
- product-role terms: device, accessory, reagent, spare part and document-role vocabulary
- routing and answer behavior rules
- prompt-specific role or grounding rules

## 4. Protected terms are mandatory guardrails

Never remove these generically unless a later patch explicitly changes the guardrail:

- negations: `nicht`, `kein`, `keine`
- core product/domain anchors: `testomat`, `indikator`, `indikatortyp`
- short model/parameter tokens: `ph`, `rx`, `th`, `tc`
- important numeric anchors: `0,02`

When in doubt, add terms to `protected_terms` rather than removing them through a broad stopword group.

## 5. Change process

Before adding a new token list:

1. Ask whether it is generic language noise.
2. If yes, add it to `language.yaml` under the correct group/profile.
3. If no, keep it in the owning domain YAML.
4. Do not introduce PHP-only token lists.
5. Run the required checks.

Required checks:

```bash
bin/console mto:agent:config:validate
bin/console mto:agent:regression:test
bin/console mto:agent:config:audit-source --details
bin/console mto:agent:config:audit-patterns --details
```