# RetrieX Language Cleanup Guide Status: binding for RetrieX 1.5.3+ cleanup-profile work. This guide defines where language, interaction, commerce and domain tokens belong. Its goal is to keep YAML maintenance simple and avoid duplicated keyword lists. ## 1. Central language cleanup lives in `language.yaml` Use `config/retriex/language.yaml` for generic language noise only. Allowed here: - German function words: `der`, `die`, `das`, `ein`, `eine`, `mit`, `und`, `oder`, `ist`, `sind`, `kann` - conversation filler words: `bitte`, `mal`, `gerne`, `noch`, `dazu`, `also` - user instruction phrases: `ich suche`, `suche nach`, `zeige mir`, `gib mir`, `habt ihr`, `gibt es` - presentation/meta terms: `tabelle`, `liste`, `übersicht`, `tabellarisch`, `auflistung` - protected terms that must not be removed generically Do not add product families, measurement parameters, intent terms or shop semantics here. ## 2. Use cleanup profiles instead of copying lists Domain configs should reference a cleanup profile whenever they need generic language cleanup. Current profiles: - `commerce_query`: cleanup for shop/search query text - `rag_evidence`: cleanup for evidence/answer-consistency checks - `shop_context_fallback`: cleanup for history-based shop context fallback Preferred pattern: ```yaml cleanup_profile: commerce_query ``` Avoid adding the same generic words again to `commerce.yaml`, `agent.yaml`, `retrieval.yaml` or `intent.yaml`. ## 3. Keep domain semantics in domain configs These belong outside `language.yaml`: - commerce intent terms: `shop`, `produkt`, `artikel`, `preis`, `kosten`, `kaufen`, `bestellen` - measurement/domain terms: `wasserhärte`, `chlor`, `redox`, `leitfähigkeit`, `ph`, `rx`, `th`, `tc` - product-role terms: device, accessory, reagent, spare part and document-role vocabulary - routing and answer behavior rules - prompt-specific role or grounding rules ## 4. Protected terms are mandatory guardrails Never remove these generically unless a later patch explicitly changes the guardrail: - negations: `nicht`, `kein`, `keine` - core product/domain anchors: `testomat`, `indikator`, `indikatortyp` - short model/parameter tokens: `ph`, `rx`, `th`, `tc` - important numeric anchors: `0,02` When in doubt, add terms to `protected_terms` rather than removing them through a broad stopword group. ## 5. Change process Before adding a new token list: 1. Ask whether it is generic language noise. 2. If yes, add it to `language.yaml` under the correct group/profile. 3. If no, keep it in the owning domain YAML. 4. Do not introduce PHP-only token lists. 5. Run the required checks. Required checks: ```bash bin/console mto:agent:config:validate bin/console mto:agent:regression:test bin/console mto:agent:config:audit-source --details bin/console mto:agent:config:audit-patterns --details ```