new struct md files
This commit is contained in:
77
patch_history/RETRIEX_LANGUAGE_CLEANUP_GUIDE.md
Normal file
77
patch_history/RETRIEX_LANGUAGE_CLEANUP_GUIDE.md
Normal file
@@ -0,0 +1,77 @@
|
||||
# RetrieX Language Cleanup Guide
|
||||
|
||||
Status: binding for RetrieX 1.5.3+ cleanup-profile work.
|
||||
|
||||
This guide defines where language, interaction, commerce and domain tokens belong. Its goal is to keep YAML maintenance simple and avoid duplicated keyword lists.
|
||||
|
||||
## 1. Central language cleanup lives in `language.yaml`
|
||||
|
||||
Use `config/retriex/language.yaml` for generic language noise only.
|
||||
|
||||
Allowed here:
|
||||
|
||||
- German function words: `der`, `die`, `das`, `ein`, `eine`, `mit`, `und`, `oder`, `ist`, `sind`, `kann`
|
||||
- conversation filler words: `bitte`, `mal`, `gerne`, `noch`, `dazu`, `also`
|
||||
- user instruction phrases: `ich suche`, `suche nach`, `zeige mir`, `gib mir`, `habt ihr`, `gibt es`
|
||||
- presentation/meta terms: `tabelle`, `liste`, `übersicht`, `tabellarisch`, `auflistung`
|
||||
- protected terms that must not be removed generically
|
||||
|
||||
Do not add product families, measurement parameters, intent terms or shop semantics here.
|
||||
|
||||
## 2. Use cleanup profiles instead of copying lists
|
||||
|
||||
Domain configs should reference a cleanup profile whenever they need generic language cleanup.
|
||||
|
||||
Current profiles:
|
||||
|
||||
- `commerce_query`: cleanup for shop/search query text
|
||||
- `rag_evidence`: cleanup for evidence/answer-consistency checks
|
||||
- `shop_context_fallback`: cleanup for history-based shop context fallback
|
||||
|
||||
Preferred pattern:
|
||||
|
||||
```yaml
|
||||
cleanup_profile: commerce_query
|
||||
```
|
||||
|
||||
Avoid adding the same generic words again to `commerce.yaml`, `agent.yaml`, `retrieval.yaml` or `intent.yaml`.
|
||||
|
||||
## 3. Keep domain semantics in domain configs
|
||||
|
||||
These belong outside `language.yaml`:
|
||||
|
||||
- commerce intent terms: `shop`, `produkt`, `artikel`, `preis`, `kosten`, `kaufen`, `bestellen`
|
||||
- measurement/domain terms: `wasserhärte`, `chlor`, `redox`, `leitfähigkeit`, `ph`, `rx`, `th`, `tc`
|
||||
- product-role terms: device, accessory, reagent, spare part and document-role vocabulary
|
||||
- routing and answer behavior rules
|
||||
- prompt-specific role or grounding rules
|
||||
|
||||
## 4. Protected terms are mandatory guardrails
|
||||
|
||||
Never remove these generically unless a later patch explicitly changes the guardrail:
|
||||
|
||||
- negations: `nicht`, `kein`, `keine`
|
||||
- core product/domain anchors: `testomat`, `indikator`, `indikatortyp`
|
||||
- short model/parameter tokens: `ph`, `rx`, `th`, `tc`
|
||||
- important numeric anchors: `0,02`
|
||||
|
||||
When in doubt, add terms to `protected_terms` rather than removing them through a broad stopword group.
|
||||
|
||||
## 5. Change process
|
||||
|
||||
Before adding a new token list:
|
||||
|
||||
1. Ask whether it is generic language noise.
|
||||
2. If yes, add it to `language.yaml` under the correct group/profile.
|
||||
3. If no, keep it in the owning domain YAML.
|
||||
4. Do not introduce PHP-only token lists.
|
||||
5. Run the required checks.
|
||||
|
||||
Required checks:
|
||||
|
||||
```bash
|
||||
bin/console mto:agent:config:validate
|
||||
bin/console mto:agent:regression:test
|
||||
bin/console mto:agent:config:audit-source --details
|
||||
bin/console mto:agent:config:audit-patterns --details
|
||||
```
|
||||
Reference in New Issue
Block a user