diff --git a/RETRIEX_PATCH_21_LANGUAGE_CLEANUP_PROFILES_README.md b/RETRIEX_PATCH_21_LANGUAGE_CLEANUP_PROFILES_README.md new file mode 100644 index 0000000..6bca05d --- /dev/null +++ b/RETRIEX_PATCH_21_LANGUAGE_CLEANUP_PROFILES_README.md @@ -0,0 +1,36 @@ +# RetrieX Patch 21 - Language Cleanup Profiles groundwork + +## Goal + +Prepare RetrieX 1.5.3 for simpler, centralized language cleanup without changing runtime behavior yet. + +## Changes + +- Extends `config/retriex/language.yaml` additively. +- Keeps legacy `retriex.stopwords.config.words` unchanged. +- Adds central groups for protected terms, German core stopwords, conversation noise, user instruction phrases, presentation/meta terms, and cleanup profiles. +- Introduces initial profiles: `commerce_query`, `rag_evidence`, `shop_context_fallback`. + +## Non-goals + +- No external stopword library. +- No Commerce/Agent runtime wiring yet. +- No removal of existing lists in `commerce.yaml`, `agent.yaml`, or `retrieval.yaml`. +- No domain-specific special cases. + +## Install + +Copy the files from this patch over the current RetrieX root. + +```bash +unzip retriex-p21-language-cleanup-profiles-patch-only.zip -d /path/to/retriex +cd /path/to/retriex +bin/console mto:agent:config:validate +bin/console mto:agent:regression:test +bin/console mto:agent:config:audit-source --details +bin/console mto:agent:config:audit-patterns --details +``` + +## Expected result + +All checks should remain green. This patch should not change answers yet. diff --git a/config/retriex/language.yaml b/config/retriex/language.yaml index 42081c3..75a937b 100644 --- a/config/retriex/language.yaml +++ b/config/retriex/language.yaml @@ -50,3 +50,106 @@ parameters: - würde - würdest - würden + + # Central language cleanup structure for RetrieX 1.5.3+. + # Legacy key `words` above remains the runtime-compatible default list. + # New cleanup profiles are introduced additively and are not yet wired into + # Commerce/Agent runtime logic in this patch. + protected_terms: + - nicht + - kein + - keine + - testomat + - indikator + - indikatortyp + - ph + - rx + - th + - tc + - '0,02' + + stopword_groups: + de_core: + - der + - die + - das + - den + - dem + - des + - ein + - eine + - einer + - eines + - und + - oder + - mit + - für + - fuer + - ist + - sind + - kann + - können + - koennen + + conversation: + - bitte + - mal + - gerne + - gern + - auch + - noch + - nochmal + - dazu + - davon + - also + - danke + + phrase_groups: + user_instruction: + - ich suche + - suche nach + - zeige mir + - zeig mir + - gib mir + - gebe mir + - nenne mir + - habt ihr + - gibt es + - suche im shop + + meta_term_groups: + presentation: + - tabelle + - tabellarisch + - liste + - übersicht + - uebersicht + - auflistung + + cleanup_profiles: + commerce_query: + stopword_groups: + - de_core + - conversation + phrase_groups: + - user_instruction + protected_term_groups: + - protected_terms + + rag_evidence: + stopword_groups: + - de_core + - conversation + protected_term_groups: + - protected_terms + + shop_context_fallback: + stopword_groups: + - de_core + - conversation + phrase_groups: + - user_instruction + meta_term_groups: + - presentation + protected_term_groups: + - protected_terms