2.8 KiB
RetrieX Language Cleanup Guide
Status: binding for RetrieX 1.5.3+ cleanup-profile work.
This guide defines where language, interaction, commerce and domain tokens belong. Its goal is to keep YAML maintenance simple and avoid duplicated keyword lists.
1. Central language cleanup lives in language.yaml
Use config/retriex/language.yaml for generic language noise only.
Allowed here:
- German function words:
der,die,das,ein,eine,mit,und,oder,ist,sind,kann - conversation filler words:
bitte,mal,gerne,noch,dazu,also - user instruction phrases:
ich suche,suche nach,zeige mir,gib mir,habt ihr,gibt es - presentation/meta terms:
tabelle,liste,übersicht,tabellarisch,auflistung - protected terms that must not be removed generically
Do not add product families, measurement parameters, intent terms or shop semantics here.
2. Use cleanup profiles instead of copying lists
Domain configs should reference a cleanup profile whenever they need generic language cleanup.
Current profiles:
commerce_query: cleanup for shop/search query textrag_evidence: cleanup for evidence/answer-consistency checksshop_context_fallback: cleanup for history-based shop context fallback
Preferred pattern:
cleanup_profile: commerce_query
Avoid adding the same generic words again to commerce.yaml, agent.yaml, retrieval.yaml or intent.yaml.
3. Keep domain semantics in domain configs
These belong outside language.yaml:
- commerce intent terms:
shop,produkt,artikel,preis,kosten,kaufen,bestellen - measurement/domain terms:
wasserhärte,chlor,redox,leitfähigkeit,ph,rx,th,tc - product-role terms: device, accessory, reagent, spare part and document-role vocabulary
- routing and answer behavior rules
- prompt-specific role or grounding rules
4. Protected terms are mandatory guardrails
Never remove these generically unless a later patch explicitly changes the guardrail:
- negations:
nicht,kein,keine - core product/domain anchors:
testomat,indikator,indikatortyp - short model/parameter tokens:
ph,rx,th,tc - important numeric anchors:
0,02
When in doubt, add terms to protected_terms rather than removing them through a broad stopword group.
5. Change process
Before adding a new token list:
- Ask whether it is generic language noise.
- If yes, add it to
language.yamlunder the correct group/profile. - If no, keep it in the owning domain YAML.
- Do not introduce PHP-only token lists.
- Run the required checks.
Required checks:
bin/console mto:agent:config:validate
bin/console mto:agent:regression:test
bin/console mto:agent:config:audit-source --details
bin/console mto:agent:config:audit-patterns --details