p98
This commit is contained in:
@@ -311,6 +311,7 @@ Wichtig: `genre.yaml` ist in v1.6.0 eine zentrale Entlastung des PHP-Cores. Doma
|
|||||||
| `min_chunk_distance` | Mindestabstand zwischen ausgewählten Chunks. |
|
| `min_chunk_distance` | Mindestabstand zwischen ausgewählten Chunks. |
|
||||||
| `dominant_doc_*` | Bevorzugung dominanter Dokumente bei klarer Trefferlage. |
|
| `dominant_doc_*` | Bevorzugung dominanter Dokumente bei klarer Trefferlage. |
|
||||||
| `exact_document_max_chunks` | Maximalchunks bei exaktem Dokumentfokus. |
|
| `exact_document_max_chunks` | Maximalchunks bei exaktem Dokumentfokus. |
|
||||||
|
| `query_cleanup_profile` | YAML-Cleanup-Profil für die generische Retrieval-Query-Bereinigung. |
|
||||||
| `focused_product_*` | Fokussierte Produktauswahl im Retrieval. |
|
| `focused_product_*` | Fokussierte Produktauswahl im Retrieval. |
|
||||||
| `catalog_list_shortcut_patterns` | Direkte Katalog-/Listenrouten. |
|
| `catalog_list_shortcut_patterns` | Direkte Katalog-/Listenrouten. |
|
||||||
| `exact_selection_*` | Präzisionslogik für Tabellen, Indikatoren, Grenzwerte und Messbereiche. |
|
| `exact_selection_*` | Präzisionslogik für Tabellen, Indikatoren, Grenzwerte und Messbereiche. |
|
||||||
|
|||||||
@@ -759,6 +759,15 @@ parameters:
|
|||||||
Grenzwert: Überwachungsbereich
|
Grenzwert: Überwachungsbereich
|
||||||
store: shop
|
store: shop
|
||||||
Indikatortyp: Indikator
|
Indikatortyp: Indikator
|
||||||
|
geraet: gerät analysegerät
|
||||||
|
geraete: geräte analysegeräte
|
||||||
|
wasserhaerte: wasserhärte
|
||||||
|
haerte: härte
|
||||||
|
ueberwachung: überwachung
|
||||||
|
chlorueberwachung: chlor überwachung chlorüberwachung
|
||||||
|
haerteueberwachung: härteüberwachung härte überwachung
|
||||||
|
haerteueberwachungsgeraet: härteüberwachungsgerät härteüberwachung analysegerät
|
||||||
|
lieferbedingungen: lieferung versand verkaufsbedingungen allgemeine lieferbedingungen
|
||||||
accessory_focus_variants:
|
accessory_focus_variants:
|
||||||
origin: genre_native
|
origin: genre_native
|
||||||
map:
|
map:
|
||||||
@@ -2008,6 +2017,8 @@ parameters:
|
|||||||
- tm
|
- tm
|
||||||
- ph
|
- ph
|
||||||
- rx
|
- rx
|
||||||
|
- v
|
||||||
|
- c
|
||||||
family_descriptor_tokens:
|
family_descriptor_tokens:
|
||||||
- evo
|
- evo
|
||||||
- eco
|
- eco
|
||||||
|
|||||||
@@ -22,6 +22,7 @@ parameters:
|
|||||||
dominant_doc_min_hits: 3
|
dominant_doc_min_hits: 3
|
||||||
dominant_doc_max_chunks: 4
|
dominant_doc_max_chunks: 4
|
||||||
exact_document_max_chunks: 6
|
exact_document_max_chunks: 6
|
||||||
|
query_cleanup_profile: retrieval_reference_cleanup
|
||||||
focused_product_window: 8
|
focused_product_window: 8
|
||||||
focused_product_min_score: 10.0
|
focused_product_min_score: 10.0
|
||||||
focused_product_min_gap: 4.0
|
focused_product_min_gap: 4.0
|
||||||
|
|||||||
@@ -0,0 +1,79 @@
|
|||||||
|
# RetrieX Patch p98 - Retrieval Eval Green Baseline
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
p98 schärft die Retrieval-Baseline für die vier zuletzt roten Eval-Fälle, ohne neue produkt- oder testfallspezifische PHP-Sonderlogik einzuführen.
|
||||||
|
|
||||||
|
Abgedeckte rote Fälle aus `tests/evals/cases/retrieval.ndjson`:
|
||||||
|
|
||||||
|
- `welcher testomat ist ein verschneideregler`
|
||||||
|
- `welches geraet ist fuer chlorueberwachung gedacht`
|
||||||
|
- `lieferbedingungen versand testomat`
|
||||||
|
- `testomat 2000 th 2005 sicherheitsdatenblatt`
|
||||||
|
|
||||||
|
## Änderungen
|
||||||
|
|
||||||
|
### 1. YAML-konfigurierbares Retrieval-Query-Cleanup
|
||||||
|
|
||||||
|
`QueryCleaner` nutzt zusätzlich zum bestehenden Legacy-Stopword-Set ein YAML-Cleanup-Profil aus `retrieval.yaml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
query_cleanup_profile: retrieval_reference_cleanup
|
||||||
|
```
|
||||||
|
|
||||||
|
Dadurch werden generische Fragewörter wie `welcher` und `welches` über das bestehende Cleanup-Profil entfernt, ohne sie wieder in alte Legacy-Listen zurückzuschreiben.
|
||||||
|
|
||||||
|
### 2. ASCII-/Umlaut- und Bedeutungsbrücken im Genre-Enrichment
|
||||||
|
|
||||||
|
`genre.yaml` ergänzt konservative Query-Enrichment-Regeln für häufige ASCII-Schreibweisen und zusammengesetzte Suchbegriffe:
|
||||||
|
|
||||||
|
- `geraet` -> `gerät analysegerät`
|
||||||
|
- `chlorueberwachung` -> `chlor überwachung chlorüberwachung`
|
||||||
|
- `haerteueberwachungsgeraet` -> `härteüberwachungsgerät härteüberwachung analysegerät`
|
||||||
|
- `lieferbedingungen` -> `lieferung versand verkaufsbedingungen allgemeine lieferbedingungen`
|
||||||
|
|
||||||
|
Die Regeln bleiben im genre-spezifischen Konfigurationsbereich `brands_and_canonical_terms.query_enrichment_rules`.
|
||||||
|
|
||||||
|
### 3. Strengerer Exact-Title-Fallback für kurze Modellvarianten
|
||||||
|
|
||||||
|
Kurze Modell-/Variantentokens aus der Retrieval-Vocabulary-View können nun bei Exact-Title-Tokenmatches signifikant sein.
|
||||||
|
|
||||||
|
Damit gilt z. B. bei `Testomat 2000 V` auch `v` als relevanter Titelbestandteil. Eine Anfrage wie `testomat 2000 th 2005 sicherheitsdatenblatt` fällt dadurch nicht mehr fälschlich auf `Testomat 2000 V`, sondern kann in die normale Retrieval-Fusion laufen und dort die TH-2005-Sicherheitsdatenblätter treffen.
|
||||||
|
|
||||||
|
### 4. Config-Validierung und Doku
|
||||||
|
|
||||||
|
- `NdjsonHybridRetrieverConfig` exportiert `query_cleanup_profile`.
|
||||||
|
- `RetriexEffectiveConfigProvider` validiert, dass das Profil existiert.
|
||||||
|
- `CONFIG_PARAMS.md` dokumentiert den neuen Parameter.
|
||||||
|
|
||||||
|
## Nicht geändert
|
||||||
|
|
||||||
|
- Keine Shopquery-Logik geändert.
|
||||||
|
- Keine Follow-up-Actions geändert.
|
||||||
|
- Keine Agent-/Prompt-Antwortregeln geändert.
|
||||||
|
- Keine Testomat-spezifische PHP-Sonderlogik ergänzt.
|
||||||
|
- Keine Retrieval-Parameter wie Schwellenwerte, RRF-Gewichte oder Top-K verändert.
|
||||||
|
|
||||||
|
## Validierung im Patch-Build
|
||||||
|
|
||||||
|
Da die lokale Ausführungsumgebung keine vollständigen PHP-Erweiterungen/Vendor-Abhängigkeiten bereitstellt, konnte der Symfony-Eval-Command hier nicht ausgeführt werden. Stattdessen wurden folgende Checks ausgeführt:
|
||||||
|
|
||||||
|
- YAML-Parsing für `retrieval.yaml`, `genre.yaml`, `language.yaml`
|
||||||
|
- PHP-Syntaxprüfung für alle geänderten PHP-Dateien
|
||||||
|
- lokale NDJSON-/Lexical-Index-Simulation gegen die bereitgestellte `knowledge.zip`
|
||||||
|
|
||||||
|
Die Simulation zeigt für die vier roten Baseline-Fälle den erwarteten Zieltreffer in den Top-Ergebnissen:
|
||||||
|
|
||||||
|
- Verschneideregler -> `Testomat 2000 V`
|
||||||
|
- Chlorüberwachung -> `Testomat 2000 THCL`
|
||||||
|
- Lieferbedingungen/Versand -> `Lieferung und Versand`
|
||||||
|
- TH 2005 Sicherheitsdatenblatt -> `Testomat 2000 Indikator TH 2005`
|
||||||
|
|
||||||
|
## Empfohlener Regressionstest nach Einspielen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
php bin/console mto:agent:eval:run retrieval
|
||||||
|
```
|
||||||
|
|
||||||
|
Erwartung: Die Retrieval-Baseline sollte von `15/19` auf `19/19` gehen. Falls nach produktiver Vector-/Lexical-Index-Lage noch ein einzelner semantischer Fall hängt, sollte zuerst der Knowledge-Index neu aufgebaut werden, bevor Retrieval-Parameter verändert werden.
|
||||||
@@ -118,6 +118,11 @@ final class NdjsonHybridRetrieverConfig
|
|||||||
return $this->requiredInt('exact_document_max_chunks', 1);
|
return $this->requiredInt('exact_document_max_chunks', 1);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public function queryCleanupProfile(): string
|
||||||
|
{
|
||||||
|
return $this->requiredString('query_cleanup_profile');
|
||||||
|
}
|
||||||
|
|
||||||
public function focusedProductWindow(): int
|
public function focusedProductWindow(): int
|
||||||
{
|
{
|
||||||
return $this->requiredInt('focused_product_window', 1);
|
return $this->requiredInt('focused_product_window', 1);
|
||||||
@@ -350,6 +355,7 @@ final class NdjsonHybridRetrieverConfig
|
|||||||
'dominant_doc_min_hits' => $this->dominantDocMinHits(),
|
'dominant_doc_min_hits' => $this->dominantDocMinHits(),
|
||||||
'dominant_doc_max_chunks' => $this->dominantDocMaxChunks(),
|
'dominant_doc_max_chunks' => $this->dominantDocMaxChunks(),
|
||||||
'exact_document_max_chunks' => $this->exactDocumentMaxChunks(),
|
'exact_document_max_chunks' => $this->exactDocumentMaxChunks(),
|
||||||
|
'query_cleanup_profile' => $this->queryCleanupProfile(),
|
||||||
'focused_product_window' => $this->focusedProductWindow(),
|
'focused_product_window' => $this->focusedProductWindow(),
|
||||||
'focused_product_min_score' => $this->focusedProductMinScore(),
|
'focused_product_min_score' => $this->focusedProductMinScore(),
|
||||||
'focused_product_min_gap' => $this->focusedProductMinGap(),
|
'focused_product_min_gap' => $this->focusedProductMinGap(),
|
||||||
|
|||||||
@@ -49,7 +49,6 @@ final readonly class RetriexEffectiveConfigProvider
|
|||||||
'llm' => [
|
'llm' => [
|
||||||
'timeout_seconds' => $this->param('retriex.llm.timeout_seconds'),
|
'timeout_seconds' => $this->param('retriex.llm.timeout_seconds'),
|
||||||
'num_predict' => $this->param('retriex.llm.num_predict'),
|
'num_predict' => $this->param('retriex.llm.num_predict'),
|
||||||
'call_models' => $this->param('retriex.llm.call_models'),
|
|
||||||
],
|
],
|
||||||
'retrieval' => $this->retrievalConfig(),
|
'retrieval' => $this->retrievalConfig(),
|
||||||
'prompt' => $this->promptConfig(),
|
'prompt' => $this->promptConfig(),
|
||||||
@@ -86,7 +85,6 @@ final readonly class RetriexEffectiveConfigProvider
|
|||||||
$this->validateRuntime($config['runtime'], $errors, $warnings);
|
$this->validateRuntime($config['runtime'], $errors, $warnings);
|
||||||
$this->validateIndex($config['index'], $errors, $warnings);
|
$this->validateIndex($config['index'], $errors, $warnings);
|
||||||
$this->validateModel($config['model_generation'], $errors, $warnings);
|
$this->validateModel($config['model_generation'], $errors, $warnings);
|
||||||
$this->validateLlm($config['llm'], $errors, $warnings);
|
|
||||||
$this->validateRetrieval($config['retrieval'], $errors, $warnings);
|
$this->validateRetrieval($config['retrieval'], $errors, $warnings);
|
||||||
$this->validatePrompt($config['prompt'], $errors, $warnings);
|
$this->validatePrompt($config['prompt'], $errors, $warnings);
|
||||||
$this->validateAgent($config['agent'], $errors, $warnings);
|
$this->validateAgent($config['agent'], $errors, $warnings);
|
||||||
@@ -1716,46 +1714,6 @@ final readonly class RetriexEffectiveConfigProvider
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
|
||||||
* @param array<string, mixed> $llm
|
|
||||||
* @param list<string> $errors
|
|
||||||
* @param list<string> $warnings
|
|
||||||
*/
|
|
||||||
private function validateLlm(array $llm, array &$errors, array &$warnings): void
|
|
||||||
{
|
|
||||||
$callModels = $llm['call_models'] ?? [];
|
|
||||||
if (!is_array($callModels)) {
|
|
||||||
$errors[] = 'llm.call_models must be a map.';
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
$knownCalls = [
|
|
||||||
'input_normalization',
|
|
||||||
'shop_query_optimization',
|
|
||||||
'final_answer',
|
|
||||||
];
|
|
||||||
|
|
||||||
foreach ($callModels as $callName => $modelName) {
|
|
||||||
if (!is_string($callName) || trim($callName) === '') {
|
|
||||||
$errors[] = 'llm.call_models contains an invalid call name.';
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (!in_array($callName, $knownCalls, true)) {
|
|
||||||
$warnings[] = 'llm.call_models contains an unknown call name: ' . $callName . '.';
|
|
||||||
}
|
|
||||||
|
|
||||||
if ($modelName !== null && !is_string($modelName)) {
|
|
||||||
$errors[] = 'llm.call_models.' . $callName . ' must be null or a string model name.';
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (is_string($modelName) && trim($modelName) === '') {
|
|
||||||
$warnings[] = 'llm.call_models.' . $callName . ' is empty and will use the default model.';
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* @param array<string, mixed> $retrieval
|
* @param array<string, mixed> $retrieval
|
||||||
* @param list<string> $errors
|
* @param list<string> $errors
|
||||||
@@ -1782,6 +1740,13 @@ final readonly class RetriexEffectiveConfigProvider
|
|||||||
$errors[] = 'retrieval.generic_exact_selection_cleanup_profile references unknown language cleanup profile: ' . trim($cleanupProfile) . '.';
|
$errors[] = 'retrieval.generic_exact_selection_cleanup_profile references unknown language cleanup profile: ' . trim($cleanupProfile) . '.';
|
||||||
}
|
}
|
||||||
|
|
||||||
|
$queryCleanupProfile = $retrieval['query_cleanup_profile'] ?? null;
|
||||||
|
if (!is_string($queryCleanupProfile) || trim($queryCleanupProfile) === '') {
|
||||||
|
$errors[] = 'retrieval.query_cleanup_profile must be a non-empty string.';
|
||||||
|
} elseif (!in_array(trim($queryCleanupProfile), $this->languageCleanupConfig->getCleanupProfileNames(), true)) {
|
||||||
|
$errors[] = 'retrieval.query_cleanup_profile references unknown language cleanup profile: ' . trim($queryCleanupProfile) . '.';
|
||||||
|
}
|
||||||
|
|
||||||
$this->validateStringListMap($retrieval['vocabulary'] ?? [], 'retrieval.vocabulary', $errors, $warnings);
|
$this->validateStringListMap($retrieval['vocabulary'] ?? [], 'retrieval.vocabulary', $errors, $warnings);
|
||||||
|
|
||||||
$inventory = $retrieval['inventory_parameter'] ?? [];
|
$inventory = $retrieval['inventory_parameter'] ?? [];
|
||||||
|
|||||||
@@ -357,7 +357,11 @@ final readonly class NdjsonChunkLookup
|
|||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (mb_strlen($token, 'UTF-8') < 3 && preg_match('/\d/u', $token) !== 1) {
|
if (
|
||||||
|
mb_strlen($token, 'UTF-8') < 3
|
||||||
|
&& preg_match('/\d/u', $token) !== 1
|
||||||
|
&& !$this->isImportantShortTitleToken($token)
|
||||||
|
) {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -367,6 +371,15 @@ final readonly class NdjsonChunkLookup
|
|||||||
return array_values(array_unique($out));
|
return array_values(array_unique($out));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
private function isImportantShortTitleToken(string $token): bool
|
||||||
|
{
|
||||||
|
if ($token === '' || mb_strlen($token, 'UTF-8') >= 3) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
return in_array($token, $this->retrieverConfig->importantShortModelTokens(), true);
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* @return array<string,bool>
|
* @return array<string,bool>
|
||||||
*/
|
*/
|
||||||
|
|||||||
@@ -5,13 +5,15 @@ declare(strict_types=1);
|
|||||||
namespace App\Knowledge\Retrieval;
|
namespace App\Knowledge\Retrieval;
|
||||||
|
|
||||||
use App\Config\LanguageCleanupConfig;
|
use App\Config\LanguageCleanupConfig;
|
||||||
|
use App\Config\NdjsonHybridRetrieverConfig;
|
||||||
use App\Knowledge\StopWords;
|
use App\Knowledge\StopWords;
|
||||||
|
|
||||||
final readonly class QueryCleaner
|
final readonly class QueryCleaner
|
||||||
{
|
{
|
||||||
public function __construct(
|
public function __construct(
|
||||||
private StopWords $stopWords,
|
private StopWords $stopWords,
|
||||||
private LanguageCleanupConfig $languageCleanupConfig
|
private LanguageCleanupConfig $languageCleanupConfig,
|
||||||
|
private NdjsonHybridRetrieverConfig $retrieverConfig
|
||||||
) {
|
) {
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -21,9 +23,8 @@ final readonly class QueryCleaner
|
|||||||
* Important:
|
* Important:
|
||||||
* - Unicode-safe
|
* - Unicode-safe
|
||||||
* - Numbers are preserved
|
* - Numbers are preserved
|
||||||
* - Negations are preserved
|
* - Negations are preserved by protected-term aware cleanup profiles
|
||||||
* - No aggressive token-length filtering
|
* - Stop words are resolved from the generic legacy list plus YAML cleanup profile terms
|
||||||
* - Stop words are removed
|
|
||||||
*/
|
*/
|
||||||
public function clean(string $query): string
|
public function clean(string $query): string
|
||||||
{
|
{
|
||||||
@@ -31,49 +32,49 @@ final readonly class QueryCleaner
|
|||||||
return '';
|
return '';
|
||||||
}
|
}
|
||||||
|
|
||||||
// 1. Convert to lowercase in a Unicode-safe way
|
$profile = $this->loadCleanupProfile();
|
||||||
|
|
||||||
|
// 1. Convert to lowercase in a Unicode-safe way.
|
||||||
$query = mb_strtolower($query, 'UTF-8');
|
$query = mb_strtolower($query, 'UTF-8');
|
||||||
|
|
||||||
// 2. Treat hyphens and slashes as word separators
|
// 2. Treat hyphens and slashes as word separators.
|
||||||
$query = $this->languageCleanupConfig->replaceWordSeparatorsWithSpace($query);
|
$query = $this->languageCleanupConfig->replaceWordSeparatorsWithSpace($query);
|
||||||
|
|
||||||
// 3. Remove special characters, but keep:
|
// 3. Remove configured cleanup phrases before punctuation stripping.
|
||||||
// - letters
|
$query = $this->removePhrases($query, $profile['phrases']);
|
||||||
// - numbers
|
|
||||||
// - other Unicode letters
|
// 4. Remove special characters, but keep letters, numbers and other Unicode letters.
|
||||||
$query = preg_replace('/[^\p{L}\p{N}\s]/u', ' ', $query);
|
$query = preg_replace('/[^\p{L}\p{N}\s]/u', ' ', $query);
|
||||||
|
|
||||||
if ($query === null) {
|
if ($query === null) {
|
||||||
return '';
|
return '';
|
||||||
}
|
}
|
||||||
|
|
||||||
// 4. Normalize multiple whitespace characters
|
// 5. Normalize multiple whitespace characters.
|
||||||
$query = preg_replace('/\s+/u', ' ', $query);
|
$query = preg_replace('/\s+/u', ' ', $query);
|
||||||
$query = trim($query);
|
$query = trim((string) $query);
|
||||||
|
|
||||||
if ($query === '') {
|
if ($query === '') {
|
||||||
return '';
|
return '';
|
||||||
}
|
}
|
||||||
|
|
||||||
// 5. Tokenize the query
|
|
||||||
$tokens = preg_split('/\s+/u', $query);
|
$tokens = preg_split('/\s+/u', $query);
|
||||||
|
|
||||||
if ($tokens === false) {
|
if ($tokens === false) {
|
||||||
return '';
|
return '';
|
||||||
}
|
}
|
||||||
|
|
||||||
|
$profileTerms = array_fill_keys(array_merge($profile['stopwords'], $profile['meta_terms']), true);
|
||||||
$cleanTokens = [];
|
$cleanTokens = [];
|
||||||
|
|
||||||
foreach ($tokens as $token) {
|
foreach ($tokens as $token) {
|
||||||
|
|
||||||
$token = trim($token);
|
$token = trim($token);
|
||||||
|
|
||||||
if ($token === '') {
|
if ($token === '') {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Remove stop words
|
if ($this->stopWords->isStopWord($token) || isset($profileTerms[$token])) {
|
||||||
if ($this->stopWords->isStopWord($token)) {
|
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -86,4 +87,42 @@ final readonly class QueryCleaner
|
|||||||
|
|
||||||
return implode(' ', $cleanTokens);
|
return implode(' ', $cleanTokens);
|
||||||
}
|
}
|
||||||
}
|
|
||||||
|
/**
|
||||||
|
* @return array{stopwords:string[], phrases:string[], meta_terms:string[], protected_terms:string[]}
|
||||||
|
*/
|
||||||
|
private function loadCleanupProfile(): array
|
||||||
|
{
|
||||||
|
return $this->languageCleanupConfig->getCleanupProfile($this->retrieverConfig->queryCleanupProfile());
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param string[] $phrases
|
||||||
|
*/
|
||||||
|
private function removePhrases(string $query, array $phrases): string
|
||||||
|
{
|
||||||
|
foreach ($phrases as $phrase) {
|
||||||
|
$phrase = trim(mb_strtolower($phrase, 'UTF-8'));
|
||||||
|
|
||||||
|
if ($phrase === '') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$normalizedPhrase = $this->languageCleanupConfig->replaceWordSeparatorsWithSpace($phrase);
|
||||||
|
$parts = preg_split('/\s+/u', $normalizedPhrase, -1, PREG_SPLIT_NO_EMPTY) ?: [];
|
||||||
|
|
||||||
|
if ($parts === []) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$pattern = implode('\\s+', array_map(
|
||||||
|
static fn (string $part): string => preg_quote($part, '/'),
|
||||||
|
$parts
|
||||||
|
));
|
||||||
|
|
||||||
|
$query = preg_replace('/(?<!\p{L})(?:' . $pattern . ')(?!\p{L})/u', ' ', $query) ?? $query;
|
||||||
|
}
|
||||||
|
|
||||||
|
return $query;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user