Files
MtoRagSystem/RETRIEX_LANGUAGE_CLEANUP_GUIDE.md
team 1 c00cb3a9b9 p28
2026-05-04 08:38:53 +02:00

2.8 KiB

RetrieX Language Cleanup Guide

Status: binding for RetrieX 1.5.3+ cleanup-profile work.

This guide defines where language, interaction, commerce and domain tokens belong. Its goal is to keep YAML maintenance simple and avoid duplicated keyword lists.

1. Central language cleanup lives in language.yaml

Use config/retriex/language.yaml for generic language noise only.

Allowed here:

  • German function words: der, die, das, ein, eine, mit, und, oder, ist, sind, kann
  • conversation filler words: bitte, mal, gerne, noch, dazu, also
  • user instruction phrases: ich suche, suche nach, zeige mir, gib mir, habt ihr, gibt es
  • presentation/meta terms: tabelle, liste, übersicht, tabellarisch, auflistung
  • protected terms that must not be removed generically

Do not add product families, measurement parameters, intent terms or shop semantics here.

2. Use cleanup profiles instead of copying lists

Domain configs should reference a cleanup profile whenever they need generic language cleanup.

Current profiles:

  • commerce_query: cleanup for shop/search query text
  • rag_evidence: cleanup for evidence/answer-consistency checks
  • shop_context_fallback: cleanup for history-based shop context fallback

Preferred pattern:

cleanup_profile: commerce_query

Avoid adding the same generic words again to commerce.yaml, agent.yaml, retrieval.yaml or intent.yaml.

3. Keep domain semantics in domain configs

These belong outside language.yaml:

  • commerce intent terms: shop, produkt, artikel, preis, kosten, kaufen, bestellen
  • measurement/domain terms: wasserhärte, chlor, redox, leitfähigkeit, ph, rx, th, tc
  • product-role terms: device, accessory, reagent, spare part and document-role vocabulary
  • routing and answer behavior rules
  • prompt-specific role or grounding rules

4. Protected terms are mandatory guardrails

Never remove these generically unless a later patch explicitly changes the guardrail:

  • negations: nicht, kein, keine
  • core product/domain anchors: testomat, indikator, indikatortyp
  • short model/parameter tokens: ph, rx, th, tc
  • important numeric anchors: 0,02

When in doubt, add terms to protected_terms rather than removing them through a broad stopword group.

5. Change process

Before adding a new token list:

  1. Ask whether it is generic language noise.
  2. If yes, add it to language.yaml under the correct group/profile.
  3. If no, keep it in the owning domain YAML.
  4. Do not introduce PHP-only token lists.
  5. Run the required checks.

Required checks:

bin/console mto:agent:config:validate
bin/console mto:agent:regression:test
bin/console mto:agent:config:audit-source --details
bin/console mto:agent:config:audit-patterns --details