Compare commits
10 Commits
77f4b4c871
...
64d1ec71e8
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
64d1ec71e8 | ||
|
|
3f914c1efd | ||
|
|
6e2ca15e97 | ||
|
|
6dced1c4df | ||
|
|
feaec9bbaf | ||
|
|
0d55c0a439 | ||
|
|
03d4a1d7c3 | ||
|
|
3d0092b753 | ||
|
|
e072a8e15e | ||
|
|
aa80acb10f |
@@ -311,6 +311,7 @@ Wichtig: `genre.yaml` ist in v1.6.0 eine zentrale Entlastung des PHP-Cores. Doma
|
|||||||
| `min_chunk_distance` | Mindestabstand zwischen ausgewählten Chunks. |
|
| `min_chunk_distance` | Mindestabstand zwischen ausgewählten Chunks. |
|
||||||
| `dominant_doc_*` | Bevorzugung dominanter Dokumente bei klarer Trefferlage. |
|
| `dominant_doc_*` | Bevorzugung dominanter Dokumente bei klarer Trefferlage. |
|
||||||
| `exact_document_max_chunks` | Maximalchunks bei exaktem Dokumentfokus. |
|
| `exact_document_max_chunks` | Maximalchunks bei exaktem Dokumentfokus. |
|
||||||
|
| `query_cleanup_profile` | YAML-Cleanup-Profil für die generische Retrieval-Query-Bereinigung. |
|
||||||
| `focused_product_*` | Fokussierte Produktauswahl im Retrieval. |
|
| `focused_product_*` | Fokussierte Produktauswahl im Retrieval. |
|
||||||
| `catalog_list_shortcut_patterns` | Direkte Katalog-/Listenrouten. |
|
| `catalog_list_shortcut_patterns` | Direkte Katalog-/Listenrouten. |
|
||||||
| `exact_selection_*` | Präzisionslogik für Tabellen, Indikatoren, Grenzwerte und Messbereiche. |
|
| `exact_selection_*` | Präzisionslogik für Tabellen, Indikatoren, Grenzwerte und Messbereiche. |
|
||||||
|
|||||||
731
RETRIEX-EVAL-CASE-HOWTO.md
Normal file
731
RETRIEX-EVAL-CASE-HOWTO.md
Normal file
@@ -0,0 +1,731 @@
|
|||||||
|
# RetrieX How-to: Neue Eval-Cases korrekt erstellen
|
||||||
|
|
||||||
|
Dieses How-to beschreibt, wie neue Regressionstests für die RetrieX Eval-Suite über den Admin-Bereich angelegt werden.
|
||||||
|
|
||||||
|
Ziel ist, neue rote oder fachlich wichtige Fälle dauerhaft abzusichern, ohne direkt Core-Logik, Retrieval-Regeln oder Shopquery-Heuristiken zu verändern.
|
||||||
|
|
||||||
|
## Einstieg
|
||||||
|
|
||||||
|
Admin-Pfad:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/admin/evals/
|
||||||
|
```
|
||||||
|
|
||||||
|
Im Bereich **„Eval-Case erstellen“** können neue Cases für folgende Typen angelegt werden:
|
||||||
|
|
||||||
|
```text
|
||||||
|
retrieval
|
||||||
|
shop_query
|
||||||
|
followup
|
||||||
|
answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
Nach dem Speichern wird der Case in die passende Datei geschrieben:
|
||||||
|
|
||||||
|
```text
|
||||||
|
tests/evals/cases/retrieval.ndjson
|
||||||
|
tests/evals/cases/shop_query.ndjson
|
||||||
|
tests/evals/cases/followup.ndjson
|
||||||
|
tests/evals/cases/answer_guard.ndjson
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Grundregel
|
||||||
|
|
||||||
|
Ein guter Eval-Case prüft genau **einen klaren Sachverhalt**.
|
||||||
|
|
||||||
|
Gut:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"expected_query": "testomat 808",
|
||||||
|
"must_not_include_terms": [
|
||||||
|
"indikator",
|
||||||
|
"300"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Weniger gut:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"expected_query": "testomat 808",
|
||||||
|
"must_include_terms": [
|
||||||
|
"testomat",
|
||||||
|
"808",
|
||||||
|
"gerät",
|
||||||
|
"preis",
|
||||||
|
"wasserhärte"
|
||||||
|
],
|
||||||
|
"must_not_include_terms": [
|
||||||
|
"indikator",
|
||||||
|
"300",
|
||||||
|
"testomat 2000",
|
||||||
|
"chlor",
|
||||||
|
"versand"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Je kleiner und eindeutiger der Case ist, desto besser eignet er sich als Regressionstest.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Felder im Admin
|
||||||
|
|
||||||
|
## 1. Eval-Typ
|
||||||
|
|
||||||
|
Wähle den Typ passend zum Ziel des Tests.
|
||||||
|
|
||||||
|
```text
|
||||||
|
retrieval → prüft, ob die richtigen RAG-Dokumente/Chunks gefunden werden
|
||||||
|
shop_query → prüft, welche Shopquery aus einem direkten Prompt entsteht
|
||||||
|
followup → prüft, welche Shopquery aus Prompt + Chatverlauf entsteht
|
||||||
|
answer_guard → prüft No-Answer-, Nicht-Halluzinations- oder Evidenzfälle
|
||||||
|
```
|
||||||
|
|
||||||
|
Faustregel:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Wird das richtige Dokument gefunden? → retrieval
|
||||||
|
Wird die richtige Shopquery erzeugt? → shop_query
|
||||||
|
Versteht RetrieX die Folgefrage im Verlauf? → followup
|
||||||
|
Erfindet RetrieX nichts bei schwacher Evidenz? → answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Neue Case-ID
|
||||||
|
|
||||||
|
Die Case-ID muss eindeutig sein und darf nur folgende Zeichen enthalten:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Buchstaben
|
||||||
|
Zahlen
|
||||||
|
_
|
||||||
|
-
|
||||||
|
```
|
||||||
|
|
||||||
|
Gute Beispiele:
|
||||||
|
|
||||||
|
```text
|
||||||
|
retrieval_semantic_chlor_clt_001
|
||||||
|
shop_query_indicator_300_exact_002
|
||||||
|
followup_main_device_price_002
|
||||||
|
answer_guard_unknown_medium_001
|
||||||
|
```
|
||||||
|
|
||||||
|
Nicht verwenden:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Test 1
|
||||||
|
shop query indikator 300
|
||||||
|
gerät/frage/neue-version
|
||||||
|
```
|
||||||
|
|
||||||
|
Empfohlenes Schema:
|
||||||
|
|
||||||
|
```text
|
||||||
|
<typ>_<thema>_<ziel>_<nummer>
|
||||||
|
```
|
||||||
|
|
||||||
|
Beispiel:
|
||||||
|
|
||||||
|
```text
|
||||||
|
followup_testomat808_device_price_001
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Prompt
|
||||||
|
|
||||||
|
Hier kommt exakt der Nutzerprompt hinein, der getestet werden soll.
|
||||||
|
|
||||||
|
Beispiele:
|
||||||
|
|
||||||
|
```text
|
||||||
|
welches geraet ist fuer chlorueberwachung gedacht
|
||||||
|
```
|
||||||
|
|
||||||
|
```text
|
||||||
|
was kostet der indikator
|
||||||
|
```
|
||||||
|
|
||||||
|
```text
|
||||||
|
und was kostet das gerät selber
|
||||||
|
```
|
||||||
|
|
||||||
|
```text
|
||||||
|
welcher testomat misst drachenblut
|
||||||
|
```
|
||||||
|
|
||||||
|
Der Prompt sollte möglichst so eingetragen werden, wie er real im Chat vorkommt. Tippfehler dürfen bewusst enthalten sein, wenn genau dieses Verhalten abgesichert werden soll.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Assert-JSON
|
||||||
|
|
||||||
|
Das Assert-JSON beschreibt, was der Test prüfen soll.
|
||||||
|
|
||||||
|
Das Feld muss immer ein gültiges JSON-Objekt sein:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Wichtig:
|
||||||
|
|
||||||
|
- Keine Kommentare im JSON
|
||||||
|
- Keine trailing commas
|
||||||
|
- Doppelte Anführungszeichen verwenden
|
||||||
|
- Das Feld muss ein Objekt `{ ... }` sein, kein Array
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Eval-Typen und Beispiele
|
||||||
|
|
||||||
|
## A) Retrieval-Case
|
||||||
|
|
||||||
|
Retrieval-Cases prüfen, ob die richtigen RAG-Dokumente oder Chunks gefunden werden.
|
||||||
|
|
||||||
|
### Minimaler positiver Retrieval-Case
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"min_results": 1
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Retrieval-Case mit erwarteter Dokument-ID
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"min_results": 1,
|
||||||
|
"must_include_one_of_document_ids": [
|
||||||
|
"DOKUMENT-ID-HIER"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Retrieval-Case mit mehreren möglichen Ziel-Dokumenten
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"min_results": 1,
|
||||||
|
"must_include_one_of_document_ids": [
|
||||||
|
"DOKUMENT-ID-1",
|
||||||
|
"DOKUMENT-ID-2"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Retrieval-Case mit Pflichtbegriffen
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"min_results": 1,
|
||||||
|
"must_include_any_terms": [
|
||||||
|
"lieferung",
|
||||||
|
"versand"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Retrieval-Case mit verbotenen Dokumenten
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"min_results": 1,
|
||||||
|
"must_not_include_document_ids": [
|
||||||
|
"FALSCHE-DOKUMENT-ID"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Retrieval-Case für No-Result / Unsinn
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"max_results": 0
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Empfohlene Retrieval-Struktur
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"min_results": 1,
|
||||||
|
"must_include_one_of_document_ids": [
|
||||||
|
"DOKUMENT-ID-HIER"
|
||||||
|
],
|
||||||
|
"must_include_any_terms": [
|
||||||
|
"wichtiger fachbegriff",
|
||||||
|
"produktname"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## B) Shopquery-Case
|
||||||
|
|
||||||
|
Shopquery-Cases prüfen, welche Shopquery aus einem direkten Prompt entsteht.
|
||||||
|
|
||||||
|
### Exakte Shopquery
|
||||||
|
|
||||||
|
Prompt:
|
||||||
|
|
||||||
|
```text
|
||||||
|
was kostet der Testomat 808 Indikator 300
|
||||||
|
```
|
||||||
|
|
||||||
|
Assert-JSON:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"expected_query": "testomat 808 300 indikator"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Shopquery mit Pflicht- und Verbotsbegriffen
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"must_include_terms": [
|
||||||
|
"testomat",
|
||||||
|
"808",
|
||||||
|
"300",
|
||||||
|
"indikator"
|
||||||
|
],
|
||||||
|
"must_not_include_terms": [
|
||||||
|
"300 s",
|
||||||
|
"301",
|
||||||
|
"302",
|
||||||
|
"303"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Query darf nicht auf Noise fallen
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"must_not_equal_query": "information"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multi-Produkt- oder Link-Follow-up mit Einzelqueries
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"expected_individual_queries": [
|
||||||
|
"testomat 2000 self clean",
|
||||||
|
"testomat 2000 cal",
|
||||||
|
"testomat 808"
|
||||||
|
],
|
||||||
|
"expected_individual_queries_exact": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Empfehlung für Shopquery-Cases
|
||||||
|
|
||||||
|
Nicht jeden Case sofort zu streng mit `expected_query` absichern. Bei noch variabler Query-Bildung ist oft besser:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"must_include_terms": [
|
||||||
|
"testomat",
|
||||||
|
"808",
|
||||||
|
"sio2"
|
||||||
|
],
|
||||||
|
"must_not_include_terms": [
|
||||||
|
"gerät",
|
||||||
|
"möchte",
|
||||||
|
"messen"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`expected_query` nur verwenden, wenn die Query bereits stabil und bewusst exakt sein soll.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## C) Follow-up-Case
|
||||||
|
|
||||||
|
Follow-up-Cases prüfen, ob RetrieX den Verlauf korrekt nutzt.
|
||||||
|
|
||||||
|
Bei `followup` ist **History-JSON praktisch Pflicht**, weil sonst kein echter Verlauf getestet wird.
|
||||||
|
|
||||||
|
### Beispiel: Indikatorpreis nach Verlauf
|
||||||
|
|
||||||
|
Prompt:
|
||||||
|
|
||||||
|
```text
|
||||||
|
was kostet der indikator
|
||||||
|
```
|
||||||
|
|
||||||
|
History-JSON:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"prompt": "Was ist der niedrigste Grenzwert für die Wasserhärte, welcher mit einem Testomaten überwacht werden kann?",
|
||||||
|
"answer": "Der niedrigste Grenzwert für die Wasserhärte beträgt 0,02 °dH. Dieser Wert wird vom Testomat 808 gemessen."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"prompt": "mit welchem indikator",
|
||||||
|
"answer": "Der niedrigste messbare Grenzwert für Wasserhärte mit dem Testomat 808 wird mit dem Indikatortyp 300 erreicht."
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Assert-JSON:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"expected_query": "testomat 808 300 indikator",
|
||||||
|
"must_include_terms": [
|
||||||
|
"testomat",
|
||||||
|
"808",
|
||||||
|
"300",
|
||||||
|
"indikator"
|
||||||
|
],
|
||||||
|
"must_not_include_terms": [
|
||||||
|
"300 s",
|
||||||
|
"301",
|
||||||
|
"302",
|
||||||
|
"303",
|
||||||
|
"testomat 2000"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Beispiel: Wechsel vom Indikator zurück zum Hauptgerät
|
||||||
|
|
||||||
|
Prompt:
|
||||||
|
|
||||||
|
```text
|
||||||
|
und was kostet das gerät selber
|
||||||
|
```
|
||||||
|
|
||||||
|
History-JSON:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"prompt": "was kostet der indikator",
|
||||||
|
"answer": "Shop-Suche abgeschlossen. Gesendete Suchquery: testomat 808 300 indikator. Testomat® 808 Indikator 300 500 ml, Produkt-Nummer 141001. Testomat® 808 Indikator 300 2 x 100 ml, Produkt-Nummer 140001. Der zugehörige Testomat ist Testomat 808."
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Assert-JSON:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"expected_query": "testomat 808",
|
||||||
|
"must_include_terms": [
|
||||||
|
"testomat",
|
||||||
|
"808"
|
||||||
|
],
|
||||||
|
"must_not_include_terms": [
|
||||||
|
"indikator",
|
||||||
|
"300",
|
||||||
|
"141001",
|
||||||
|
"140001"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Empfehlung für Follow-up-Cases
|
||||||
|
|
||||||
|
Die History sollte genau die Informationen enthalten, die der echte Chat vorher hatte.
|
||||||
|
|
||||||
|
Nicht zu wenig:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Nur "Indikator 300" ohne Geräteanker kann zu unklar sein.
|
||||||
|
```
|
||||||
|
|
||||||
|
Nicht zu viel:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Ein kompletter langer Chatverlauf kann den Case unnötig instabil machen.
|
||||||
|
```
|
||||||
|
|
||||||
|
Gut ist ein kurzer, fachlich relevanter Auszug.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## D) Answer-Guard-Case
|
||||||
|
|
||||||
|
Answer-Guard-Cases prüfen, dass RetrieX bei Unsinn, schwacher Evidenz oder falschen Zuordnungen nichts erfindet.
|
||||||
|
|
||||||
|
### Unsinn soll keine Treffer liefern
|
||||||
|
|
||||||
|
Prompt:
|
||||||
|
|
||||||
|
```text
|
||||||
|
dsgfsdgfsdgf
|
||||||
|
```
|
||||||
|
|
||||||
|
Assert-JSON:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"max_results": 0
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Erfundenes Medium soll nicht als echtes Produkt beantwortet werden
|
||||||
|
|
||||||
|
Prompt:
|
||||||
|
|
||||||
|
```text
|
||||||
|
welcher testomat misst drachenblut
|
||||||
|
```
|
||||||
|
|
||||||
|
Assert-JSON:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"must_not_include_terms": [
|
||||||
|
"drachenblut"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Falsches Dokument darf nicht gezogen werden
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"min_results": 1,
|
||||||
|
"must_not_include_document_ids": [
|
||||||
|
"FALSCHE-DOKUMENT-ID"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Empfehlung für Answer-Guard-Cases
|
||||||
|
|
||||||
|
Bei Answer-Guard-Cases möglichst nicht auf einzelne Wörter im kompletten Retrieval-Text überreagieren. Besser sind:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Dokument-IDs
|
||||||
|
klare Produktnamen
|
||||||
|
klare verbotene Zielbegriffe
|
||||||
|
max_results bei Unsinn
|
||||||
|
```
|
||||||
|
|
||||||
|
Ein Wort irgendwo im Retrieval-Kontext ist nicht automatisch ein fachlicher Fehler.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Optionales Feld: History-JSON
|
||||||
|
|
||||||
|
History-JSON wird vor allem für `followup` verwendet.
|
||||||
|
|
||||||
|
Format:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"prompt": "vorherige Nutzerfrage",
|
||||||
|
"answer": "vorherige Antwort oder relevanter Auszug"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Mehrere Turns:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"prompt": "erste Frage",
|
||||||
|
"answer": "erste Antwort"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"prompt": "zweite Frage",
|
||||||
|
"answer": "zweite Antwort"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Wichtig:
|
||||||
|
|
||||||
|
```text
|
||||||
|
History-JSON ist ein Array [...]
|
||||||
|
Assert-JSON ist ein Objekt {...}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Optionales Feld: Request Context Hint
|
||||||
|
|
||||||
|
Dieses Feld kann meistens leer bleiben.
|
||||||
|
|
||||||
|
Es ist nur sinnvoll, wenn ein Case zusätzlichen Kontext simulieren soll, der nicht sauber über History abbildbar ist.
|
||||||
|
|
||||||
|
Beispiel:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Sichtbare Shop-Ergebnisse enthalten Testomat 808 und Testomat 808 Indikator 300.
|
||||||
|
Der Nutzer fragt nach dem Gerät selber.
|
||||||
|
```
|
||||||
|
|
||||||
|
Empfehlung:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Für normale Regressionen lieber History-JSON verwenden.
|
||||||
|
Request Context Hint nur für Spezialfälle nutzen.
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Vollständiges Beispiel: Follow-up-Gerätepreis
|
||||||
|
|
||||||
|
## Eval-Typ
|
||||||
|
|
||||||
|
```text
|
||||||
|
followup
|
||||||
|
```
|
||||||
|
|
||||||
|
## Neue Case-ID
|
||||||
|
|
||||||
|
```text
|
||||||
|
followup_testomat808_main_device_price_002
|
||||||
|
```
|
||||||
|
|
||||||
|
## Prompt
|
||||||
|
|
||||||
|
```text
|
||||||
|
und was kostet das gerät selber
|
||||||
|
```
|
||||||
|
|
||||||
|
## Assert-JSON
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"expected_query": "testomat 808",
|
||||||
|
"must_include_terms": [
|
||||||
|
"testomat",
|
||||||
|
"808"
|
||||||
|
],
|
||||||
|
"must_not_include_terms": [
|
||||||
|
"indikator",
|
||||||
|
"300",
|
||||||
|
"141001",
|
||||||
|
"140001"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## History-JSON
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"prompt": "was kostet der indikator",
|
||||||
|
"answer": "Shop-Suche abgeschlossen. Gesendete Suchquery: testomat 808 300 indikator. Testomat® 808 Indikator 300 500 ml, Produkt-Nummer 141001. Testomat® 808 Indikator 300 2 x 100 ml, Produkt-Nummer 140001. Der zugehörige Testomat ist Testomat 808."
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Request Context Hint
|
||||||
|
|
||||||
|
Leer lassen.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Nach dem Speichern prüfen
|
||||||
|
|
||||||
|
Nach dem Speichern sollte der passende Eval-Typ ausgeführt werden.
|
||||||
|
|
||||||
|
Im Admin:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/admin/evals/
|
||||||
|
```
|
||||||
|
|
||||||
|
Oder per CLI:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
php bin/console mto:agent:eval:run retrieval
|
||||||
|
php bin/console mto:agent:eval:run shop_query
|
||||||
|
php bin/console mto:agent:eval:run followup
|
||||||
|
php bin/console mto:agent:eval:run answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
Für einen einzelnen Typ:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:eval:run followup
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Praktische Checkliste
|
||||||
|
|
||||||
|
Vor dem Speichern prüfen:
|
||||||
|
|
||||||
|
```text
|
||||||
|
[ ] Eval-Typ passt zum Ziel
|
||||||
|
[ ] Case-ID ist eindeutig
|
||||||
|
[ ] Case-ID enthält nur Buchstaben, Zahlen, _ oder -
|
||||||
|
[ ] Prompt ist realistisch und exakt
|
||||||
|
[ ] Assert-JSON ist gültiges JSON-Objekt
|
||||||
|
[ ] History-JSON ist bei Follow-up-Cases vorhanden
|
||||||
|
[ ] History-JSON ist gültiges JSON-Array
|
||||||
|
[ ] Der Case prüft nur einen klaren Sachverhalt
|
||||||
|
[ ] Assertions sind nicht unnötig streng
|
||||||
|
[ ] Nach dem Speichern läuft der passende Eval-Typ grün
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Wann ein neuer Eval-Case angelegt werden sollte
|
||||||
|
|
||||||
|
Ein neuer Case ist sinnvoll, wenn:
|
||||||
|
|
||||||
|
```text
|
||||||
|
ein realer Prompt rot war
|
||||||
|
ein wichtiger grüner Flow dauerhaft abgesichert werden soll
|
||||||
|
ein Tippfehler-/Noise-Fall stabil bleiben soll
|
||||||
|
eine Produktidentität nicht verloren gehen darf
|
||||||
|
eine falsche Dokumentzuordnung verhindert werden soll
|
||||||
|
eine No-Answer-Situation nicht halluzinieren darf
|
||||||
|
```
|
||||||
|
|
||||||
|
Kein neuer Case ist nötig, wenn:
|
||||||
|
|
||||||
|
```text
|
||||||
|
nur die Formulierung einer Antwort leicht anders war
|
||||||
|
der Prompt fachlich nicht relevant ist
|
||||||
|
die Erwartung nicht eindeutig definiert werden kann
|
||||||
|
der Case mehrere unabhängige Dinge gleichzeitig prüfen würde
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Leitlinie
|
||||||
|
|
||||||
|
Ab RetrieX v1.6.2 gilt:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Keine neue Genauigkeitslogik ohne konkreten roten oder fachlich wichtigen Eval-Fall.
|
||||||
|
```
|
||||||
|
|
||||||
|
Daher sollten neue Optimierungen möglichst immer so ablaufen:
|
||||||
|
|
||||||
|
```text
|
||||||
|
1. Prompt testen
|
||||||
|
2. Verhalten bewerten
|
||||||
|
3. Wenn wichtig: Eval-Case anlegen
|
||||||
|
4. Eval grün bekommen
|
||||||
|
5. Erst danach Logik, YAML oder Parameter ändern
|
||||||
|
```
|
||||||
@@ -29,7 +29,8 @@
|
|||||||
"symfony/twig-bundle": "7.4.*",
|
"symfony/twig-bundle": "7.4.*",
|
||||||
"symfony/uid": "7.4.*",
|
"symfony/uid": "7.4.*",
|
||||||
"symfony/yaml": "^7.4",
|
"symfony/yaml": "^7.4",
|
||||||
"ext-sqlite3": "*"
|
"ext-sqlite3": "*",
|
||||||
|
"ext-mbstring": "*"
|
||||||
},
|
},
|
||||||
"config": {
|
"config": {
|
||||||
"optimize-autoloader": true,
|
"optimize-autoloader": true,
|
||||||
|
|||||||
@@ -759,6 +759,15 @@ parameters:
|
|||||||
Grenzwert: Überwachungsbereich
|
Grenzwert: Überwachungsbereich
|
||||||
store: shop
|
store: shop
|
||||||
Indikatortyp: Indikator
|
Indikatortyp: Indikator
|
||||||
|
geraet: gerät analysegerät
|
||||||
|
geraete: geräte analysegeräte
|
||||||
|
wasserhaerte: wasserhärte
|
||||||
|
haerte: härte
|
||||||
|
ueberwachung: überwachung
|
||||||
|
chlorueberwachung: chlor überwachung chlorüberwachung
|
||||||
|
haerteueberwachung: härteüberwachung härte überwachung
|
||||||
|
haerteueberwachungsgeraet: härteüberwachungsgerät härteüberwachung analysegerät
|
||||||
|
lieferbedingungen: lieferung versand verkaufsbedingungen allgemeine lieferbedingungen
|
||||||
accessory_focus_variants:
|
accessory_focus_variants:
|
||||||
origin: genre_native
|
origin: genre_native
|
||||||
map:
|
map:
|
||||||
@@ -1277,6 +1286,13 @@ parameters:
|
|||||||
- schwimmbad
|
- schwimmbad
|
||||||
- schwimmbecken
|
- schwimmbecken
|
||||||
- pool
|
- pool
|
||||||
|
- silikat
|
||||||
|
- silikatüberwachung
|
||||||
|
- silikatueberwachung
|
||||||
|
- sio2
|
||||||
|
- si o2
|
||||||
|
- kieselsäure
|
||||||
|
- kieselsaeure
|
||||||
- 0,02
|
- 0,02
|
||||||
stopword_cleanup:
|
stopword_cleanup:
|
||||||
origin: genre_native
|
origin: genre_native
|
||||||
@@ -2008,6 +2024,8 @@ parameters:
|
|||||||
- tm
|
- tm
|
||||||
- ph
|
- ph
|
||||||
- rx
|
- rx
|
||||||
|
- v
|
||||||
|
- c
|
||||||
family_descriptor_tokens:
|
family_descriptor_tokens:
|
||||||
- evo
|
- evo
|
||||||
- eco
|
- eco
|
||||||
|
|||||||
@@ -22,6 +22,7 @@ parameters:
|
|||||||
dominant_doc_min_hits: 3
|
dominant_doc_min_hits: 3
|
||||||
dominant_doc_max_chunks: 4
|
dominant_doc_max_chunks: 4
|
||||||
exact_document_max_chunks: 6
|
exact_document_max_chunks: 6
|
||||||
|
query_cleanup_profile: retrieval_reference_cleanup
|
||||||
focused_product_window: 8
|
focused_product_window: 8
|
||||||
focused_product_min_score: 10.0
|
focused_product_min_score: 10.0
|
||||||
focused_product_min_gap: 4.0
|
focused_product_min_gap: 4.0
|
||||||
|
|||||||
@@ -0,0 +1,37 @@
|
|||||||
|
# RetrieX Patch p100b - Admin Eval Case Selection Fix
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
Behebt die Admin-Eval-UX, wenn ein einzelner Case ausgewaehlt wird und der Request mit `No eval cases selected.` endet.
|
||||||
|
|
||||||
|
## Ursache
|
||||||
|
|
||||||
|
Die p100/p100a-Seite nutzte ein freies `datalist`-Feld fuer Case-IDs, das Cases aller Eval-Typen enthielt. Dadurch konnte ein Case aus `shop_query` ausgewaehlt werden, waehrend das Formular noch einen anderen Eval-Typ sendete. Der Admin-Service suchte dann nur in der Case-Datei des gesendeten Typs und fand keine passenden Cases.
|
||||||
|
|
||||||
|
## Aenderungen
|
||||||
|
|
||||||
|
- Das freie Case-ID-Feld wurde durch ein gefiltertes Select ersetzt.
|
||||||
|
- Die Case-Liste wird clientseitig passend zum gewaehlten Eval-Typ gefiltert.
|
||||||
|
- Beim Wechsel des Eval-Typs wird eine nicht passende Case-Auswahl automatisch geleert.
|
||||||
|
- Der Admin-Service ist robuster: Wenn eine Case-ID nicht im gesendeten Typ gefunden wird, wird sie ueber alle unterstuetzten Eval-Typen gesucht und mit dem richtigen Typ ausgefuehrt.
|
||||||
|
- Der Controller redirectet nach dem Run auf den effektiv ausgefuehrten Eval-Typ.
|
||||||
|
- Die alte unklare Meldung `No eval cases selected.` wird durch konkrete Fehlertexte ersetzt.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
Keine Aenderungen an:
|
||||||
|
|
||||||
|
- Retrieval-Logik
|
||||||
|
- Shopquery-Logik
|
||||||
|
- Follow-up-Logik
|
||||||
|
- Answer-Guard-Logik
|
||||||
|
- Eval-Cases
|
||||||
|
- YAML-Konfiguration
|
||||||
|
- Modellparametern
|
||||||
|
- Datenbank/Migrationen
|
||||||
|
|
||||||
|
## Geaenderte Dateien
|
||||||
|
|
||||||
|
- `src/Controller/Admin/AdminEvalController.php`
|
||||||
|
- `src/Service/Admin/EvalAdminService.php`
|
||||||
|
- `templates/admin/evals/index.html.twig`
|
||||||
@@ -0,0 +1,45 @@
|
|||||||
|
# RetrieX Patch p100c - Admin Eval Document Labels
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
Die Admin-Eval-Resultate sollen bei Retrieval-/Answer-Guard-Fällen nicht nur technische `document_id`- und `chunk_id`-Werte anzeigen, sondern auch menschenlesbare Dokumentinformationen, damit ein gefundenes Dokument im Admin/Dateibestand leichter identifiziert werden kann.
|
||||||
|
|
||||||
|
## Änderungen
|
||||||
|
|
||||||
|
- `NdjsonHybridRetriever::retrieveDebug()` gibt pro Debug-Treffer zusätzlich aus:
|
||||||
|
- `document_title`
|
||||||
|
- `file_path`
|
||||||
|
- `version_number`
|
||||||
|
- `RetrievalDebugRunner` schreibt in Eval-Reports zusätzlich:
|
||||||
|
- `document_refs`: eindeutige Dokumentübersicht mit Titel, Datei, Version, Ranks und Chunk-IDs
|
||||||
|
- `result_rows`: rankgenaue Trefferliste mit Titel, Datei, Chunk-ID und Text-Preview
|
||||||
|
- Admin-Eval-Template zeigt diese Informationen direkt in den Result-Details:
|
||||||
|
- Tabelle "Gefundene Dokumente"
|
||||||
|
- aufklappbare Tabelle "Treffer / Chunks anzeigen"
|
||||||
|
- JSON-Details bleiben weiterhin verfügbar
|
||||||
|
|
||||||
|
## Nicht geändert
|
||||||
|
|
||||||
|
- Keine Eval-Assertions geändert
|
||||||
|
- Keine Retrieval-Gewichte geändert
|
||||||
|
- Keine Shopquery-/Follow-up-/Answer-Logik geändert
|
||||||
|
- Keine YAML-/Parameteränderung
|
||||||
|
- Keine Datenbankmigration
|
||||||
|
|
||||||
|
## Prüfung
|
||||||
|
|
||||||
|
Nach Einspielen:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
php bin/console mto:agent:eval:run retrieval
|
||||||
|
php bin/console mto:agent:eval:run answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
Danach im Admin:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/admin/evals/
|
||||||
|
```
|
||||||
|
|
||||||
|
Einen Retrieval- oder Answer-Guard-Eval öffnen und prüfen, ob bei den Resultaten Titel/Datei zusätzlich zur Doc-ID sichtbar sind.
|
||||||
@@ -0,0 +1,44 @@
|
|||||||
|
# RetrieX Patch p100d – Admin Eval Prompt Context
|
||||||
|
|
||||||
|
Status: patch-only follow-up for p100 Admin Eval UX.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Make eval results easier to understand in the Admin UI by showing the actual case prompt directly next to the case id. For follow-up and shopquery cases, show a compact history/context preview as well.
|
||||||
|
|
||||||
|
## Changes
|
||||||
|
|
||||||
|
- Admin eval result table now displays the case prompt below the case id.
|
||||||
|
- Follow-up/shopquery eval details now include a compact history preview.
|
||||||
|
- Admin eval result table shows history/context in a collapsible section when available.
|
||||||
|
|
||||||
|
## Files changed
|
||||||
|
|
||||||
|
- `src/Eval/ShopQueryEvalRunner.php`
|
||||||
|
- `templates/admin/evals/index.html.twig`
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
No production answer logic is changed:
|
||||||
|
|
||||||
|
- no retrieval logic changes
|
||||||
|
- no shopquery logic changes
|
||||||
|
- no follow-up logic changes
|
||||||
|
- no answer-guard logic changes
|
||||||
|
- no eval assertion changes
|
||||||
|
- no YAML or parameter changes
|
||||||
|
- no database migration
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
Recommended after applying:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
php bin/console mto:agent:eval:run retrieval
|
||||||
|
php bin/console mto:agent:eval:run shop_query
|
||||||
|
php bin/console mto:agent:eval:run followup
|
||||||
|
php bin/console mto:agent:eval:run answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
Then open `/admin/evals/` and verify that each result row shows the case prompt and that follow-up/shopquery rows can reveal context/history.
|
||||||
75
patch_history/RETRIEX_PATCH_100_ADMIN_EVAL_UX_README.md
Normal file
75
patch_history/RETRIEX_PATCH_100_ADMIN_EVAL_UX_README.md
Normal file
@@ -0,0 +1,75 @@
|
|||||||
|
# RetrieX Patch p100 - Admin Eval UX
|
||||||
|
|
||||||
|
Status: patch-only candidate
|
||||||
|
Basis: confirmed v1.6.2 + p99/p99b/p99c green eval suite
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
p100 macht die mit p99 eingeführte Eval-Suite im Admin sichtbar und bedienbar, ohne die produktive RAG-, Shop-, Prompt-, Scoring- oder Antwortlogik fachlich zu ändern.
|
||||||
|
|
||||||
|
## Enthalten
|
||||||
|
|
||||||
|
- Neuer Admin-Bereich `/admin/evals/`
|
||||||
|
- Übersicht über die Eval-Typen:
|
||||||
|
- `retrieval`
|
||||||
|
- `shop_query`
|
||||||
|
- `followup`
|
||||||
|
- `answer_guard`
|
||||||
|
- Anzeige der Case-Anzahl pro Typ
|
||||||
|
- Anzeige typspezifischer letzter Reports aus `tests/evals/reports/<type>-last-run.json`
|
||||||
|
- Run-Buttons pro Eval-Typ
|
||||||
|
- Formular zum Ausführen eines kompletten Typs oder einer einzelnen Case-ID
|
||||||
|
- Detailansicht für PASS/FAIL, Fehler und Result-Details
|
||||||
|
- CLI-Referenz im Admin
|
||||||
|
- Sidebar-Link unter KI-Endpunkte
|
||||||
|
- Link von der KI-/LLM-Setup-Seite zur Eval Suite
|
||||||
|
|
||||||
|
## Report-Verhalten
|
||||||
|
|
||||||
|
Admin-Runs schreiben zwei Reports:
|
||||||
|
|
||||||
|
- `tests/evals/reports/<type>-last-run.json`
|
||||||
|
- `tests/evals/reports/last-run.json`
|
||||||
|
|
||||||
|
Die CLI bleibt unverändert und schreibt weiterhin den bekannten `last-run.json`.
|
||||||
|
|
||||||
|
## Rollen
|
||||||
|
|
||||||
|
Der neue Bereich ist auf Controller-Ebene durch `ROLE_KNOWLEDGE_ADMIN` geschützt.
|
||||||
|
|
||||||
|
## Nicht geändert
|
||||||
|
|
||||||
|
- keine Retrieval-Gewichte
|
||||||
|
- keine Shopquery-Erzeugungslogik
|
||||||
|
- keine Follow-up-Logik
|
||||||
|
- keine Answer-Guard-Logik
|
||||||
|
- keine Prompt-Änderung
|
||||||
|
- keine YAML-Vokabularänderung
|
||||||
|
- keine Modellparameteränderung
|
||||||
|
- keine Datenbankmigration
|
||||||
|
|
||||||
|
## Geänderte Dateien
|
||||||
|
|
||||||
|
- `src/Controller/Admin/AdminEvalController.php`
|
||||||
|
- `src/Service/Admin/EvalAdminService.php`
|
||||||
|
- `templates/admin/evals/index.html.twig`
|
||||||
|
- `templates/admin/base.html.twig`
|
||||||
|
- `templates/admin/model_config/list.html.twig`
|
||||||
|
- `patch_history/RETRIEX_PATCH_100_ADMIN_EVAL_UX_README.md`
|
||||||
|
|
||||||
|
## Prüfung nach Einspielen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
php bin/console mto:agent:eval:run retrieval
|
||||||
|
php bin/console mto:agent:eval:run shop_query
|
||||||
|
php bin/console mto:agent:eval:run followup
|
||||||
|
php bin/console mto:agent:eval:run answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
Zusätzlich im Browser prüfen:
|
||||||
|
|
||||||
|
- `/admin/evals/`
|
||||||
|
- Eval-Typ ausführen
|
||||||
|
- Detailreport öffnen
|
||||||
|
- Sidebar-Link sichtbar für Knowledge Admins
|
||||||
@@ -0,0 +1,54 @@
|
|||||||
|
# RetrieX Patch p101a - Admin Eval Case Creator Separate Page
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
Der Eval-Case-Creator wird als eigene Admin-Seite geführt, damit die Eval-Suite-Übersicht schlank bleibt und nicht durch das komplette Case-Erstellformular aufgeblasen wirkt.
|
||||||
|
|
||||||
|
## Neue / geänderte Admin-Routen
|
||||||
|
|
||||||
|
- `GET /admin/evals/` bleibt die fokussierte Eval-Suite-Übersicht für Runs und Reports.
|
||||||
|
- `GET /admin/evals/cases/new` zeigt das separate Formular zum Anlegen neuer Eval-Cases.
|
||||||
|
- `POST /admin/evals/cases` speichert neue Eval-Cases in `tests/evals/cases/<type>.ndjson`.
|
||||||
|
|
||||||
|
## UX-Änderungen
|
||||||
|
|
||||||
|
- Die Eval-Suite-Übersicht erhält nur einen kompakten Button `Eval-Case erstellen`.
|
||||||
|
- Report-Ergebnisse erhalten den Button `Als neuen Case vorbereiten`.
|
||||||
|
- Die neue Seite übernimmt bei vorbereiteten Cases:
|
||||||
|
- Eval-Typ
|
||||||
|
- Prompt
|
||||||
|
- History/Kontext, sofern im Report vorhanden
|
||||||
|
- vorgeschlagene Assertions aus Query, Einzelqueries oder Dokument-IDs
|
||||||
|
- Die eigentliche Case-Erstellung liegt außerhalb der Report-/Run-Übersicht.
|
||||||
|
|
||||||
|
## Validierung
|
||||||
|
|
||||||
|
Beim Speichern werden geprüft:
|
||||||
|
|
||||||
|
- CSRF-Token
|
||||||
|
- `ROLE_KNOWLEDGE_ADMIN`
|
||||||
|
- unterstützter Eval-Typ
|
||||||
|
- eindeutige Case-ID über alle Eval-Typen
|
||||||
|
- erlaubtes Case-ID-Format
|
||||||
|
- nicht leerer Prompt
|
||||||
|
- gültiges Assert-JSON-Objekt
|
||||||
|
- gültige History-JSON-Liste
|
||||||
|
- DTO-Validierung über `EvalCase::fromArray()`
|
||||||
|
|
||||||
|
## Nicht geändert
|
||||||
|
|
||||||
|
- Keine Retrieval-Logik
|
||||||
|
- Keine Shopquery-Logik
|
||||||
|
- Keine Follow-up-Logik
|
||||||
|
- Keine Answer-Guard-Logik
|
||||||
|
- Keine Eval-Cases
|
||||||
|
- Keine YAML-/Parameteränderung
|
||||||
|
- Keine Migration
|
||||||
|
|
||||||
|
## Betroffene Dateien
|
||||||
|
|
||||||
|
- `src/Controller/Admin/AdminEvalController.php`
|
||||||
|
- `src/Service/Admin/EvalAdminService.php`
|
||||||
|
- `templates/admin/evals/index.html.twig`
|
||||||
|
- `templates/admin/evals/case_new.html.twig`
|
||||||
|
- `patch_history/RETRIEX_PATCH_101A_ADMIN_EVAL_CASE_CREATOR_PAGE_README.md`
|
||||||
@@ -0,0 +1,52 @@
|
|||||||
|
# RetrieX Patch p101b - Admin Eval Case Help Texts
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
Verbessert die Hilfetexte auf der Admin-Seite zum Erstellen neuer Eval-Cases, damit auch weniger technische Nutzer verstehen, welche Werte in welche Felder gehören.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
Geändert:
|
||||||
|
|
||||||
|
- `templates/admin/evals/case_new.html.twig`
|
||||||
|
|
||||||
|
Neu:
|
||||||
|
|
||||||
|
- `patch_history/RETRIEX_PATCH_101B_ADMIN_EVAL_CASE_HELP_TEXTS_README.md`
|
||||||
|
|
||||||
|
## Änderungen
|
||||||
|
|
||||||
|
- Ausführlichere Beschreibungen unter allen Eingabefeldern
|
||||||
|
- Laienfreundliche Erklärung der Eval-Typen
|
||||||
|
- Beispiele für gute Case-IDs
|
||||||
|
- Klarere Erklärung für Prompt vs. erwartete Antwort
|
||||||
|
- Copy-Paste-Beispiele für Assert-JSON
|
||||||
|
- Erklärung, wann History-JSON benötigt wird
|
||||||
|
- Hinweis, dass Request Context Hint fast immer leer bleiben kann
|
||||||
|
- Zusätzliche Checkliste vor dem Speichern
|
||||||
|
|
||||||
|
## Nicht geändert
|
||||||
|
|
||||||
|
- Keine Eval-Logik
|
||||||
|
- Keine Retrieval-Logik
|
||||||
|
- Keine Shopquery-Logik
|
||||||
|
- Keine Follow-up-Logik
|
||||||
|
- Keine Answer-Guard-Logik
|
||||||
|
- Keine bestehenden Eval-Cases
|
||||||
|
- Keine YAML- oder Parameteränderung
|
||||||
|
- Keine Migration
|
||||||
|
|
||||||
|
## Prüfung
|
||||||
|
|
||||||
|
Nach Einspielen:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
```
|
||||||
|
|
||||||
|
Dann im Admin prüfen:
|
||||||
|
|
||||||
|
- `/admin/evals/cases/new`
|
||||||
|
- Hilfetexte unter allen Feldern sichtbar
|
||||||
|
- Vorlage aus Report-Result weiterhin nutzbar
|
||||||
|
- Case speichern weiterhin möglich
|
||||||
@@ -0,0 +1,50 @@
|
|||||||
|
# RetrieX Patch p101c - Admin Eval Case Delete
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
Ergänzt die Admin-Eval-Case-Verwaltung um eine sichere Löschfunktion für einzelne Eval-Cases.
|
||||||
|
|
||||||
|
Damit können falsch angelegte oder nicht mehr benötigte Cases direkt im Admin entfernt werden, ohne die Eval-Suite-Übersicht weiter aufzublähen.
|
||||||
|
|
||||||
|
## Umfang
|
||||||
|
|
||||||
|
- Neue POST-Route `admin_evals_case_delete` unter `/admin/evals/cases/delete`
|
||||||
|
- CSRF-Schutz pro Eval-Typ und Case-ID
|
||||||
|
- Rollenprüfung über `ROLE_KNOWLEDGE_ADMIN`
|
||||||
|
- Entfernen genau des ausgewählten Cases aus `tests/evals/cases/<type>.ndjson`
|
||||||
|
- Abbruch ohne Änderung, wenn die NDJSON-Datei ungültig ist oder der Case nicht gefunden wird
|
||||||
|
- Löschbereich auf der separaten Case-Seite `/admin/evals/cases/new`
|
||||||
|
- Bestätigungsdialog vor dem Löschen
|
||||||
|
- Hinweis, dass nach dem Löschen der betroffene Eval-Typ erneut ausgeführt werden sollte
|
||||||
|
|
||||||
|
## Nicht geändert
|
||||||
|
|
||||||
|
- Keine Retrieval-Logik
|
||||||
|
- Keine Shopquery-Logik
|
||||||
|
- Keine Follow-up-Logik
|
||||||
|
- Keine Answer-Guard-Logik
|
||||||
|
- Keine Eval-Assertions
|
||||||
|
- Keine bestehenden Cases automatisch gelöscht
|
||||||
|
- Keine YAML-/Parameteränderung
|
||||||
|
- Keine Migration
|
||||||
|
|
||||||
|
## Prüfung
|
||||||
|
|
||||||
|
Nach Einspielen:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
php bin/console mto:agent:eval:run retrieval
|
||||||
|
php bin/console mto:agent:eval:run shop_query
|
||||||
|
php bin/console mto:agent:eval:run followup
|
||||||
|
php bin/console mto:agent:eval:run answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
Im Admin:
|
||||||
|
|
||||||
|
1. `/admin/evals/cases/new` öffnen.
|
||||||
|
2. Einen Test-Case anlegen oder einen bestehenden Test-Case auswählen.
|
||||||
|
3. `Case löschen` klicken.
|
||||||
|
4. Bestätigungsdialog bestätigen.
|
||||||
|
5. Prüfen, dass der Case aus der Liste verschwindet.
|
||||||
|
6. Den betroffenen Eval-Typ erneut laufen lassen.
|
||||||
@@ -0,0 +1,53 @@
|
|||||||
|
# RetrieX Patch p101d - Admin Eval Case Delete Hotfix
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
Behebt einen Fehler aus p101c, bei dem beim Löschen eines Eval-Cases folgende Exception auftreten konnte:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Call to undefined method App\Service\Admin\EvalAdminService::normalizeExistingCaseId()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Ursache
|
||||||
|
|
||||||
|
`EvalAdminService::deleteCase()` ruft eine Validierungs-Hilfsmethode für bestehende Case-IDs auf. Diese Methode wurde in p101c referenziert, aber nicht in die Service-Klasse aufgenommen.
|
||||||
|
|
||||||
|
## Änderung
|
||||||
|
|
||||||
|
Ergänzt `normalizeExistingCaseId()` in `EvalAdminService`.
|
||||||
|
|
||||||
|
Die Methode:
|
||||||
|
|
||||||
|
- trimmt die übergebene Case-ID,
|
||||||
|
- verhindert leere IDs,
|
||||||
|
- erlaubt nur Buchstaben, Zahlen, Unterstriche und Bindestriche,
|
||||||
|
- gibt eine verständliche Fehlermeldung bei ungültigen IDs zurück.
|
||||||
|
|
||||||
|
## Geänderte Dateien
|
||||||
|
|
||||||
|
```text
|
||||||
|
src/Service/Admin/EvalAdminService.php
|
||||||
|
patch_history/RETRIEX_PATCH_101D_ADMIN_EVAL_CASE_DELETE_HOTFIX_README.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## Nicht geändert
|
||||||
|
|
||||||
|
```text
|
||||||
|
keine Eval-Logik
|
||||||
|
keine Retrieval-Logik
|
||||||
|
keine Shopquery-Logik
|
||||||
|
keine Follow-up-Logik
|
||||||
|
keine Answer-Guard-Logik
|
||||||
|
keine YAML-/Parameteränderung
|
||||||
|
keine bestehenden Eval-Cases
|
||||||
|
keine Migration
|
||||||
|
```
|
||||||
|
|
||||||
|
## Prüfung
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php -l src/Service/Admin/EvalAdminService.php
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
```
|
||||||
|
|
||||||
|
Danach im Admin einen Eval-Case löschen.
|
||||||
@@ -0,0 +1,66 @@
|
|||||||
|
# RetrieX Patch p101 - Admin Eval Case Creator
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
p101 ergänzt die bestehende Admin Eval Suite um einen kleinen Case-Creator, damit neue Regression-Cases direkt aus dem Admin heraus in die passenden NDJSON-Dateien geschrieben werden können.
|
||||||
|
|
||||||
|
Der Patch baut auf dem grünen p100/p100a/p100b/p100c/p100d-Stand auf und verändert keine produktive RAG-, Shopquery-, Follow-up- oder Antwortlogik.
|
||||||
|
|
||||||
|
## Änderungen
|
||||||
|
|
||||||
|
- Neue POST-Route im Admin:
|
||||||
|
- `/admin/evals/case/create`
|
||||||
|
- Route-Name: `admin_evals_case_create`
|
||||||
|
- `EvalAdminService::createCase()` zum validierten Schreiben neuer Eval-Cases.
|
||||||
|
- Neues Formular auf `/admin/evals/`:
|
||||||
|
- Eval-Typ
|
||||||
|
- Case-ID
|
||||||
|
- Prompt
|
||||||
|
- Assert-JSON
|
||||||
|
- optionales History-JSON
|
||||||
|
- optionaler Request Context Hint
|
||||||
|
- Button pro Report-Result:
|
||||||
|
- `Als neuen Case vorbereiten`
|
||||||
|
- übernimmt Prompt, Typ, History-Vorschau, Query oder Dokument-ID als Vorlage in den Creator.
|
||||||
|
- JSON-/ID-Validierung vor dem Schreiben.
|
||||||
|
- Duplicate-Guard über alle Eval-Typen.
|
||||||
|
|
||||||
|
## Geschriebene Dateien
|
||||||
|
|
||||||
|
Neue Cases werden an folgende Dateien angehängt:
|
||||||
|
|
||||||
|
- `tests/evals/cases/retrieval.ndjson`
|
||||||
|
- `tests/evals/cases/shop_query.ndjson`
|
||||||
|
- `tests/evals/cases/followup.ndjson`
|
||||||
|
- `tests/evals/cases/answer_guard.ndjson`
|
||||||
|
|
||||||
|
## Sicherheit / Scope
|
||||||
|
|
||||||
|
Nicht geändert:
|
||||||
|
|
||||||
|
- keine Retrieval-Gewichte
|
||||||
|
- keine Shopquery-Logik
|
||||||
|
- keine Follow-up-Logik
|
||||||
|
- keine Answer-Guard-Logik
|
||||||
|
- keine Prompt-/YAML-/Parameteränderung
|
||||||
|
- keine Migration
|
||||||
|
|
||||||
|
## Manuelle Prüfung
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
php bin/console mto:agent:eval:run retrieval
|
||||||
|
php bin/console mto:agent:eval:run shop_query
|
||||||
|
php bin/console mto:agent:eval:run followup
|
||||||
|
php bin/console mto:agent:eval:run answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
Zusätzlich im Admin:
|
||||||
|
|
||||||
|
1. `/admin/evals/` öffnen.
|
||||||
|
2. Einen Eval laufen lassen.
|
||||||
|
3. Bei einem Result `Als neuen Case vorbereiten` klicken.
|
||||||
|
4. Case-ID anpassen bzw. prüfen.
|
||||||
|
5. Assert-JSON prüfen.
|
||||||
|
6. Speichern.
|
||||||
|
7. Den betroffenen Eval-Typ erneut laufen lassen.
|
||||||
@@ -0,0 +1,79 @@
|
|||||||
|
# RetrieX Patch p98 - Retrieval Eval Green Baseline
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
p98 schärft die Retrieval-Baseline für die vier zuletzt roten Eval-Fälle, ohne neue produkt- oder testfallspezifische PHP-Sonderlogik einzuführen.
|
||||||
|
|
||||||
|
Abgedeckte rote Fälle aus `tests/evals/cases/retrieval.ndjson`:
|
||||||
|
|
||||||
|
- `welcher testomat ist ein verschneideregler`
|
||||||
|
- `welches geraet ist fuer chlorueberwachung gedacht`
|
||||||
|
- `lieferbedingungen versand testomat`
|
||||||
|
- `testomat 2000 th 2005 sicherheitsdatenblatt`
|
||||||
|
|
||||||
|
## Änderungen
|
||||||
|
|
||||||
|
### 1. YAML-konfigurierbares Retrieval-Query-Cleanup
|
||||||
|
|
||||||
|
`QueryCleaner` nutzt zusätzlich zum bestehenden Legacy-Stopword-Set ein YAML-Cleanup-Profil aus `retrieval.yaml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
query_cleanup_profile: retrieval_reference_cleanup
|
||||||
|
```
|
||||||
|
|
||||||
|
Dadurch werden generische Fragewörter wie `welcher` und `welches` über das bestehende Cleanup-Profil entfernt, ohne sie wieder in alte Legacy-Listen zurückzuschreiben.
|
||||||
|
|
||||||
|
### 2. ASCII-/Umlaut- und Bedeutungsbrücken im Genre-Enrichment
|
||||||
|
|
||||||
|
`genre.yaml` ergänzt konservative Query-Enrichment-Regeln für häufige ASCII-Schreibweisen und zusammengesetzte Suchbegriffe:
|
||||||
|
|
||||||
|
- `geraet` -> `gerät analysegerät`
|
||||||
|
- `chlorueberwachung` -> `chlor überwachung chlorüberwachung`
|
||||||
|
- `haerteueberwachungsgeraet` -> `härteüberwachungsgerät härteüberwachung analysegerät`
|
||||||
|
- `lieferbedingungen` -> `lieferung versand verkaufsbedingungen allgemeine lieferbedingungen`
|
||||||
|
|
||||||
|
Die Regeln bleiben im genre-spezifischen Konfigurationsbereich `brands_and_canonical_terms.query_enrichment_rules`.
|
||||||
|
|
||||||
|
### 3. Strengerer Exact-Title-Fallback für kurze Modellvarianten
|
||||||
|
|
||||||
|
Kurze Modell-/Variantentokens aus der Retrieval-Vocabulary-View können nun bei Exact-Title-Tokenmatches signifikant sein.
|
||||||
|
|
||||||
|
Damit gilt z. B. bei `Testomat 2000 V` auch `v` als relevanter Titelbestandteil. Eine Anfrage wie `testomat 2000 th 2005 sicherheitsdatenblatt` fällt dadurch nicht mehr fälschlich auf `Testomat 2000 V`, sondern kann in die normale Retrieval-Fusion laufen und dort die TH-2005-Sicherheitsdatenblätter treffen.
|
||||||
|
|
||||||
|
### 4. Config-Validierung und Doku
|
||||||
|
|
||||||
|
- `NdjsonHybridRetrieverConfig` exportiert `query_cleanup_profile`.
|
||||||
|
- `RetriexEffectiveConfigProvider` validiert, dass das Profil existiert.
|
||||||
|
- `CONFIG_PARAMS.md` dokumentiert den neuen Parameter.
|
||||||
|
|
||||||
|
## Nicht geändert
|
||||||
|
|
||||||
|
- Keine Shopquery-Logik geändert.
|
||||||
|
- Keine Follow-up-Actions geändert.
|
||||||
|
- Keine Agent-/Prompt-Antwortregeln geändert.
|
||||||
|
- Keine Testomat-spezifische PHP-Sonderlogik ergänzt.
|
||||||
|
- Keine Retrieval-Parameter wie Schwellenwerte, RRF-Gewichte oder Top-K verändert.
|
||||||
|
|
||||||
|
## Validierung im Patch-Build
|
||||||
|
|
||||||
|
Da die lokale Ausführungsumgebung keine vollständigen PHP-Erweiterungen/Vendor-Abhängigkeiten bereitstellt, konnte der Symfony-Eval-Command hier nicht ausgeführt werden. Stattdessen wurden folgende Checks ausgeführt:
|
||||||
|
|
||||||
|
- YAML-Parsing für `retrieval.yaml`, `genre.yaml`, `language.yaml`
|
||||||
|
- PHP-Syntaxprüfung für alle geänderten PHP-Dateien
|
||||||
|
- lokale NDJSON-/Lexical-Index-Simulation gegen die bereitgestellte `knowledge.zip`
|
||||||
|
|
||||||
|
Die Simulation zeigt für die vier roten Baseline-Fälle den erwarteten Zieltreffer in den Top-Ergebnissen:
|
||||||
|
|
||||||
|
- Verschneideregler -> `Testomat 2000 V`
|
||||||
|
- Chlorüberwachung -> `Testomat 2000 THCL`
|
||||||
|
- Lieferbedingungen/Versand -> `Lieferung und Versand`
|
||||||
|
- TH 2005 Sicherheitsdatenblatt -> `Testomat 2000 Indikator TH 2005`
|
||||||
|
|
||||||
|
## Empfohlener Regressionstest nach Einspielen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
php bin/console mto:agent:eval:run retrieval
|
||||||
|
```
|
||||||
|
|
||||||
|
Erwartung: Die Retrieval-Baseline sollte von `15/19` auf `19/19` gehen. Falls nach produktiver Vector-/Lexical-Index-Lage noch ein einzelner semantischer Fall hängt, sollte zuerst der Knowledge-Index neu aufgebaut werden, bevor Retrieval-Parameter verändert werden.
|
||||||
@@ -0,0 +1,85 @@
|
|||||||
|
# RetrieX Patch p99b - Eval Suite Alignment
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
p99 hatte die neue Eval-Suite erfolgreich aktiviert, aber drei neue Cases zeigten nach dem ersten Lauf rote Signale. p99b trennt dabei False-Positive-Assertions von zwei realen Robustheitsluecken, ohne die bestehende Retrieval-Baseline oder Shop-/Follow-up-Architektur umzubauen.
|
||||||
|
|
||||||
|
## Ausgangslage
|
||||||
|
|
||||||
|
Nach p99:
|
||||||
|
|
||||||
|
- `mto:agent:config:validate`: OK
|
||||||
|
- `mto:agent:eval:run retrieval`: 19/19 OK
|
||||||
|
- `mto:agent:eval:run shop_query`: 4/5 OK
|
||||||
|
- `mto:agent:eval:run followup`: 3/4 OK
|
||||||
|
- `mto:agent:eval:run answer_guard`: 3/4 OK
|
||||||
|
|
||||||
|
Rote Cases:
|
||||||
|
|
||||||
|
- `shop_query_sio2_anchor_001`: normalisierte Shopquery konnte auf `gerät` zusammenschrumpfen.
|
||||||
|
- `followup_main_device_price_001`: Hauptgeraet-Follow-up konnte an der vorherigen Indikator-Query `testomat 808 indikator 300` haengen bleiben.
|
||||||
|
- `answer_guard_delivery_not_sdb_001`: Assertion war zu streng, weil ein Textbegriff `Sicherheitsdatenblatt` im Retrieval-Text kein ausreichender Fehlernachweis ist, solange das falsche Dokument nicht dominiert.
|
||||||
|
|
||||||
|
## Aenderungen
|
||||||
|
|
||||||
|
### 1. SiO2/Silikat als aktuelle Eingabe schuetzen
|
||||||
|
|
||||||
|
`config/retriex/genre.yaml`
|
||||||
|
|
||||||
|
Ergaenzt `shop_query_runtime.current_input_preservation_terms` um:
|
||||||
|
|
||||||
|
- `silikat`
|
||||||
|
- `silikatüberwachung`
|
||||||
|
- `silikatueberwachung`
|
||||||
|
- `sio2`
|
||||||
|
- `si o2`
|
||||||
|
- `kieselsäure`
|
||||||
|
- `kieselsaeure`
|
||||||
|
|
||||||
|
Damit verliert eine normalisierte Standalone-Shopfrage wie `suche gerät kühlsysteme Silikatüberwachung` nicht mehr den fachlichen Messparameter, bevor die generische Device-Anchor-Regel `testomat 808 sio2` greifen kann.
|
||||||
|
|
||||||
|
### 2. Hauptgeraet-Follow-up darf Zubehoerreste entfernen
|
||||||
|
|
||||||
|
`src/Agent/AgentRunner.php`
|
||||||
|
|
||||||
|
`guardMainDeviceReferentialShopQueryWithHistoryModelAnchor()` wurde so angepasst, dass eine Shopquery wie `testomat 808 indikator 300` bei einem Prompt wie `und was kostet das gerät selber` nicht allein deshalb akzeptiert wird, weil sie bereits einen Modellanker enthaelt.
|
||||||
|
|
||||||
|
Neu wird geprueft, ob nach dem Modellanker noch Zubehoer-/Code-Resttokens vorhanden sind. Falls ja, wird auf den reinen Modellanker aus dem Verlauf reduziert, z. B. `testomat 808`.
|
||||||
|
|
||||||
|
### 3. Answer-Guard-Case weniger spröde
|
||||||
|
|
||||||
|
`tests/evals/cases/answer_guard.ndjson`
|
||||||
|
|
||||||
|
Der Case `answer_guard_delivery_not_sdb_001` prueft weiterhin:
|
||||||
|
|
||||||
|
- passendes Liefer-/Versand-Dokument muss enthalten sein
|
||||||
|
- konkretes SDB-Dokument darf nicht enthalten sein
|
||||||
|
|
||||||
|
Die zu breite Text-Assertion auf den Begriff `sicherheitsdatenblatt` wurde entfernt, weil sie auch legitime Neben-/Hinweistexte treffen kann.
|
||||||
|
|
||||||
|
## Bewusst nicht geaendert
|
||||||
|
|
||||||
|
- Keine Retrieval-Gewichte
|
||||||
|
- Keine Shopware-Suche
|
||||||
|
- Keine Prompt-Texte
|
||||||
|
- Keine Modellparameter
|
||||||
|
- Keine neue Produkt-Sonderlogik
|
||||||
|
- Keine Aenderung an p98-Retrieval-Eval-Cases
|
||||||
|
|
||||||
|
## Erwartete Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
php bin/console mto:agent:eval:run retrieval
|
||||||
|
php bin/console mto:agent:eval:run shop_query
|
||||||
|
php bin/console mto:agent:eval:run followup
|
||||||
|
php bin/console mto:agent:eval:run answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
Erwartung:
|
||||||
|
|
||||||
|
- Config valid
|
||||||
|
- Retrieval 19/19
|
||||||
|
- Shopquery 5/5
|
||||||
|
- Followup 4/4
|
||||||
|
- Answer guard 4/4
|
||||||
@@ -0,0 +1,60 @@
|
|||||||
|
# RETRIEX PATCH 99C - Main Device Follow-up Eval Alignment
|
||||||
|
|
||||||
|
Status: patch-only follow-up for p99/p99b.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Keep the new p99 follow-up eval suite aligned with the already confirmed manual
|
||||||
|
reference flow:
|
||||||
|
|
||||||
|
1. lowest water-hardness threshold
|
||||||
|
2. indicator type
|
||||||
|
3. indicator price
|
||||||
|
4. main device price
|
||||||
|
|
||||||
|
The main-device follow-up `und was kostet das gerät selber` must resolve back to
|
||||||
|
the main device anchor (`testomat 808`) and must not keep accessory remnants such
|
||||||
|
as `indikator` or exact indicator code `300`.
|
||||||
|
|
||||||
|
## Root cause
|
||||||
|
|
||||||
|
p99b added a residual accessory guard, but the main-device history-anchor guard
|
||||||
|
returned early for non-generic shop queries before the residual check could run.
|
||||||
|
A query like `testomat 808 indikator 300` contains digits, so it was not treated
|
||||||
|
as a generic main-device query and stayed unchanged.
|
||||||
|
|
||||||
|
## Change
|
||||||
|
|
||||||
|
`AgentRunner::guardMainDeviceReferentialShopQueryWithHistoryModelAnchor()` now:
|
||||||
|
|
||||||
|
1. detects the main-device referential prompt,
|
||||||
|
2. extracts the latest history model anchor,
|
||||||
|
3. if the generated shop query already contains that model anchor, checks for
|
||||||
|
accessory/code residuals,
|
||||||
|
4. reduces the query to the pure model anchor when such residuals are present.
|
||||||
|
|
||||||
|
This keeps explicit non-generic product queries untouched unless they contain the
|
||||||
|
current history model anchor plus accessory leftovers in a main-device follow-up.
|
||||||
|
|
||||||
|
## Expected eval result
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
php bin/console mto:agent:eval:run retrieval
|
||||||
|
php bin/console mto:agent:eval:run shop_query
|
||||||
|
php bin/console mto:agent:eval:run followup
|
||||||
|
php bin/console mto:agent:eval:run answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected:
|
||||||
|
|
||||||
|
- retrieval: 19/19
|
||||||
|
- shop_query: 5/5
|
||||||
|
- followup: 4/4
|
||||||
|
- answer_guard: 4/4
|
||||||
|
|
||||||
|
## Productive logic impact
|
||||||
|
|
||||||
|
Minimal. The patch only changes the already existing main-device follow-up guard
|
||||||
|
for prompts asking for the main device itself. It does not modify retrieval,
|
||||||
|
ranking, prompt templates, YAML vocabulary, shop result guards, or answer logic.
|
||||||
157
patch_history/RETRIEX_PATCH_99_EVAL_SUITE_EXPANSION_README.md
Normal file
157
patch_history/RETRIEX_PATCH_99_EVAL_SUITE_EXPANSION_README.md
Normal file
@@ -0,0 +1,157 @@
|
|||||||
|
# RetrieX Patch p99 - Eval Suite Expansion
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
p99 erweitert die bisher reine Retrieval-Eval-Baseline um zusätzliche, manuell bekannte Regressionstypen aus v1.6.2:
|
||||||
|
|
||||||
|
- Shopquery-Erzeugung
|
||||||
|
- Follow-up-Auflösung mit Chatverlauf
|
||||||
|
- Antwort-/Halluzinations-Guardrails auf Retrieval-Evidenzebene
|
||||||
|
|
||||||
|
Der Patch ändert bewusst keine produktive RAG-, Retrieval-, Shop-, Prompt- oder Antwortlogik. Er ergänzt nur Eval-Infrastruktur und Eval-Cases.
|
||||||
|
|
||||||
|
## Neue Eval-Typen
|
||||||
|
|
||||||
|
### `shop_query`
|
||||||
|
|
||||||
|
Prüft die von `AgentRunner` vorbereitete Shop-Suchquery anhand der Shop-Meta-Ausgabe. Der Runner stoppt, sobald die erste Shop-Such-Meta-Card erzeugt wurde. Dadurch werden die Query-Guards, die Routing-/History-Logik und die finalen Shopquery-Filter geprüft, ohne von der Live-Shopware-Suche abhängig zu sein.
|
||||||
|
|
||||||
|
Beispiel:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:eval:run shop_query
|
||||||
|
```
|
||||||
|
|
||||||
|
Cases liegen in:
|
||||||
|
|
||||||
|
```text
|
||||||
|
tests/evals/cases/shop_query.ndjson
|
||||||
|
```
|
||||||
|
|
||||||
|
Abgedeckt werden unter anderem:
|
||||||
|
|
||||||
|
- exakter Indikatorcode `Testomat 808 Indikator 300`
|
||||||
|
- Brauerei-/Brauwasser-Query-Cleanup
|
||||||
|
- Schwimmbad-Tippfehlerkorrektur
|
||||||
|
- LAB-CL-Kürzelerhalt
|
||||||
|
- SIO2-Geräteanker für Silikatüberwachung
|
||||||
|
|
||||||
|
### `followup`
|
||||||
|
|
||||||
|
Prüft referenzielle Shop-Folgefragen mit vorbereiteten History-Turns. Die History wird pro Eval-Case in einen isolierten temporären Eval-User geschrieben und danach wieder gelöscht.
|
||||||
|
|
||||||
|
Beispiel:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:eval:run followup
|
||||||
|
```
|
||||||
|
|
||||||
|
Cases liegen in:
|
||||||
|
|
||||||
|
```text
|
||||||
|
tests/evals/cases/followup.ndjson
|
||||||
|
```
|
||||||
|
|
||||||
|
Abgedeckt werden unter anderem:
|
||||||
|
|
||||||
|
- `0,02 °dH -> Testomat 808 -> Indikatortyp 300 -> was kostet der indikator`
|
||||||
|
- Wechsel vom Indikatorpreis zurück zum Hauptgerätpreis
|
||||||
|
- schwache Shop-Folgefrage `suche im shop nach der information` mit THCL-Historyanker
|
||||||
|
- Produktlink-Follow-up mit Einzelqueries statt kombinierter Multi-Produkt-Query
|
||||||
|
|
||||||
|
### `answer_guard`
|
||||||
|
|
||||||
|
Prüft Antwort-Guardrails vor der finalen LLM-Antwort auf Basis der Retrieval-Evidenz. Das ist absichtlich kein generativer LLM-Antworttest, sondern ein stabiler Pre-Answer-Guard gegen falsche Evidenz oder Halluzinationsrisiken.
|
||||||
|
|
||||||
|
Beispiel:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:eval:run answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
Cases liegen in:
|
||||||
|
|
||||||
|
```text
|
||||||
|
tests/evals/cases/answer_guard.ndjson
|
||||||
|
```
|
||||||
|
|
||||||
|
Abgedeckt werden unter anderem:
|
||||||
|
|
||||||
|
- Noise-Prompt ohne Evidenz
|
||||||
|
- Fantasie-Medien wie Drachenblut / Mondwasser
|
||||||
|
- Lieferbedingungen dürfen nicht auf Sicherheitsdatenblätter kippen
|
||||||
|
|
||||||
|
## Neue Assertion-Felder
|
||||||
|
|
||||||
|
### Für `shop_query` und `followup`
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"expected_query": "testomat 808 300 indikator",
|
||||||
|
"must_include_terms": ["testomat", "808", "300", "indikator"],
|
||||||
|
"must_not_include_terms": ["300 s", "301", "302"],
|
||||||
|
"must_not_equal_query": "information"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Für Multi-Produkt-Follow-ups:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"expected_individual_queries": [
|
||||||
|
"testomat 2000 self clean",
|
||||||
|
"testomat 2000 cal",
|
||||||
|
"testomat 808"
|
||||||
|
],
|
||||||
|
"expected_individual_queries_exact": true,
|
||||||
|
"min_individual_queries": 3,
|
||||||
|
"max_individual_queries": 3
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Für `retrieval` und `answer_guard`
|
||||||
|
|
||||||
|
`RetrievalDebugRunner` unterstützt zusätzlich:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"must_not_include_terms": ["sicherheitsdatenblatt"],
|
||||||
|
"must_not_match_patterns": ["/forbidden/u"]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Geänderte Dateien
|
||||||
|
|
||||||
|
```text
|
||||||
|
src/Command/AgentEvalRunCommand.php
|
||||||
|
src/Eval/AgentEvalRunner.php
|
||||||
|
src/Eval/AnswerGuardEvalRunner.php
|
||||||
|
src/Eval/Dto/EvalCase.php
|
||||||
|
src/Eval/RetrievalDebugRunner.php
|
||||||
|
src/Eval/ShopQueryEvalRunner.php
|
||||||
|
tests/evals/cases/answer_guard.ndjson
|
||||||
|
tests/evals/cases/followup.ndjson
|
||||||
|
tests/evals/cases/shop_query.ndjson
|
||||||
|
patch_history/RETRIEX_PATCH_99_EVAL_SUITE_EXPANSION_README.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## Nicht geändert
|
||||||
|
|
||||||
|
- Keine Retrieval-Gewichte geändert.
|
||||||
|
- Keine Shopquery-Produktivlogik geändert.
|
||||||
|
- Keine Prompt-Regeln geändert.
|
||||||
|
- Keine YAML-Vokabularregeln geändert.
|
||||||
|
- Keine LLM-/Modellparameter geändert.
|
||||||
|
- Keine Admin-/Frontend-Logik geändert.
|
||||||
|
|
||||||
|
## Empfohlene Validierung nach Einspielen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
php bin/console mto:agent:config:validate
|
||||||
|
php bin/console mto:agent:eval:run retrieval
|
||||||
|
php bin/console mto:agent:eval:run shop_query
|
||||||
|
php bin/console mto:agent:eval:run followup
|
||||||
|
php bin/console mto:agent:eval:run answer_guard
|
||||||
|
```
|
||||||
|
|
||||||
|
Wichtig: `shop_query` und `followup` laufen über den `AgentRunner` bis zur Shop-Meta-Card. Sie stoppen vor der Live-Shop-Suche, können aber je nach aktiver Konfiguration weiterhin Input-Normalisierung oder Shopquery-Optimierung über das konfigurierte LLM versuchen. Wenn das LLM nicht erreichbar ist, greift die bestehende Fallback-Logik des Agenten.
|
||||||
@@ -4155,7 +4155,6 @@ final readonly class AgentRunner
|
|||||||
$shopSearchQuery === ''
|
$shopSearchQuery === ''
|
||||||
|| trim($commerceHistoryContext) === ''
|
|| trim($commerceHistoryContext) === ''
|
||||||
|| $this->referenceAnchorExtractor->extractFirstProductModelAnchor($prompt) !== ''
|
|| $this->referenceAnchorExtractor->extractFirstProductModelAnchor($prompt) !== ''
|
||||||
|| $this->referenceAnchorExtractor->extractFirstProductModelAnchor($shopSearchQuery) !== ''
|
|
||||||
) {
|
) {
|
||||||
return $shopSearchQuery;
|
return $shopSearchQuery;
|
||||||
}
|
}
|
||||||
@@ -4164,10 +4163,6 @@ final readonly class AgentRunner
|
|||||||
return $shopSearchQuery;
|
return $shopSearchQuery;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (!$this->isGenericMainDeviceReferentialShopQuery($shopSearchQuery)) {
|
|
||||||
return $shopSearchQuery;
|
|
||||||
}
|
|
||||||
|
|
||||||
$modelAnchor = $this->normalizeShopQueryAnchor(
|
$modelAnchor = $this->normalizeShopQueryAnchor(
|
||||||
$this->extractLatestHistoryProductModelAnchor($commerceHistoryContext)
|
$this->extractLatestHistoryProductModelAnchor($commerceHistoryContext)
|
||||||
);
|
);
|
||||||
@@ -4176,9 +4171,43 @@ final readonly class AgentRunner
|
|||||||
return $shopSearchQuery;
|
return $shopSearchQuery;
|
||||||
}
|
}
|
||||||
|
|
||||||
return $this->queryAlreadyContainsAllAnchorTokens($shopSearchQuery, $modelAnchor)
|
if ($this->queryAlreadyContainsAllAnchorTokens($shopSearchQuery, $modelAnchor)) {
|
||||||
? $shopSearchQuery
|
return $this->containsMainDeviceFollowUpAccessoryResidual($shopSearchQuery, $modelAnchor)
|
||||||
: $modelAnchor;
|
? $modelAnchor
|
||||||
|
: $shopSearchQuery;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!$this->isGenericMainDeviceReferentialShopQuery($shopSearchQuery)) {
|
||||||
|
return $shopSearchQuery;
|
||||||
|
}
|
||||||
|
|
||||||
|
return $modelAnchor;
|
||||||
|
}
|
||||||
|
|
||||||
|
private function containsMainDeviceFollowUpAccessoryResidual(string $shopSearchQuery, string $modelAnchor): bool
|
||||||
|
{
|
||||||
|
$queryTokens = $this->tokenizeShopQueryCandidate($shopSearchQuery);
|
||||||
|
if ($queryTokens === []) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
$modelTokens = array_fill_keys($this->tokenizeShopQueryCandidate($modelAnchor), true);
|
||||||
|
$accessoryTokens = $this->buildShopQueryTokenSet($this->mergeUniqueStrings(
|
||||||
|
$this->agentRunnerConfig->getNoLlmAccessoryProductRoleKeywords(),
|
||||||
|
$this->agentRunnerConfig->getRequestedAccessoryCodeTerms()
|
||||||
|
));
|
||||||
|
|
||||||
|
foreach ($queryTokens as $token) {
|
||||||
|
if (isset($modelTokens[$token])) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (isset($accessoryTokens[$token]) || preg_match('/^\d{1,5}$/u', $token) === 1) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
private function guardWeakReferentialShopQueryWithHistoryModelAnchor(
|
private function guardWeakReferentialShopQueryWithHistoryModelAnchor(
|
||||||
|
|||||||
@@ -37,7 +37,7 @@ final class AgentEvalRunCommand extends Command
|
|||||||
->addArgument(
|
->addArgument(
|
||||||
'type',
|
'type',
|
||||||
InputArgument::OPTIONAL,
|
InputArgument::OPTIONAL,
|
||||||
'Eval type to run',
|
'Eval type to run (retrieval, shop_query, followup, answer_guard)',
|
||||||
'retrieval'
|
'retrieval'
|
||||||
)
|
)
|
||||||
->addOption(
|
->addOption(
|
||||||
|
|||||||
@@ -118,6 +118,11 @@ final class NdjsonHybridRetrieverConfig
|
|||||||
return $this->requiredInt('exact_document_max_chunks', 1);
|
return $this->requiredInt('exact_document_max_chunks', 1);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public function queryCleanupProfile(): string
|
||||||
|
{
|
||||||
|
return $this->requiredString('query_cleanup_profile');
|
||||||
|
}
|
||||||
|
|
||||||
public function focusedProductWindow(): int
|
public function focusedProductWindow(): int
|
||||||
{
|
{
|
||||||
return $this->requiredInt('focused_product_window', 1);
|
return $this->requiredInt('focused_product_window', 1);
|
||||||
@@ -350,6 +355,7 @@ final class NdjsonHybridRetrieverConfig
|
|||||||
'dominant_doc_min_hits' => $this->dominantDocMinHits(),
|
'dominant_doc_min_hits' => $this->dominantDocMinHits(),
|
||||||
'dominant_doc_max_chunks' => $this->dominantDocMaxChunks(),
|
'dominant_doc_max_chunks' => $this->dominantDocMaxChunks(),
|
||||||
'exact_document_max_chunks' => $this->exactDocumentMaxChunks(),
|
'exact_document_max_chunks' => $this->exactDocumentMaxChunks(),
|
||||||
|
'query_cleanup_profile' => $this->queryCleanupProfile(),
|
||||||
'focused_product_window' => $this->focusedProductWindow(),
|
'focused_product_window' => $this->focusedProductWindow(),
|
||||||
'focused_product_min_score' => $this->focusedProductMinScore(),
|
'focused_product_min_score' => $this->focusedProductMinScore(),
|
||||||
'focused_product_min_gap' => $this->focusedProductMinGap(),
|
'focused_product_min_gap' => $this->focusedProductMinGap(),
|
||||||
|
|||||||
@@ -49,7 +49,6 @@ final readonly class RetriexEffectiveConfigProvider
|
|||||||
'llm' => [
|
'llm' => [
|
||||||
'timeout_seconds' => $this->param('retriex.llm.timeout_seconds'),
|
'timeout_seconds' => $this->param('retriex.llm.timeout_seconds'),
|
||||||
'num_predict' => $this->param('retriex.llm.num_predict'),
|
'num_predict' => $this->param('retriex.llm.num_predict'),
|
||||||
'call_models' => $this->param('retriex.llm.call_models'),
|
|
||||||
],
|
],
|
||||||
'retrieval' => $this->retrievalConfig(),
|
'retrieval' => $this->retrievalConfig(),
|
||||||
'prompt' => $this->promptConfig(),
|
'prompt' => $this->promptConfig(),
|
||||||
@@ -86,7 +85,6 @@ final readonly class RetriexEffectiveConfigProvider
|
|||||||
$this->validateRuntime($config['runtime'], $errors, $warnings);
|
$this->validateRuntime($config['runtime'], $errors, $warnings);
|
||||||
$this->validateIndex($config['index'], $errors, $warnings);
|
$this->validateIndex($config['index'], $errors, $warnings);
|
||||||
$this->validateModel($config['model_generation'], $errors, $warnings);
|
$this->validateModel($config['model_generation'], $errors, $warnings);
|
||||||
$this->validateLlm($config['llm'], $errors, $warnings);
|
|
||||||
$this->validateRetrieval($config['retrieval'], $errors, $warnings);
|
$this->validateRetrieval($config['retrieval'], $errors, $warnings);
|
||||||
$this->validatePrompt($config['prompt'], $errors, $warnings);
|
$this->validatePrompt($config['prompt'], $errors, $warnings);
|
||||||
$this->validateAgent($config['agent'], $errors, $warnings);
|
$this->validateAgent($config['agent'], $errors, $warnings);
|
||||||
@@ -1716,46 +1714,6 @@ final readonly class RetriexEffectiveConfigProvider
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
|
||||||
* @param array<string, mixed> $llm
|
|
||||||
* @param list<string> $errors
|
|
||||||
* @param list<string> $warnings
|
|
||||||
*/
|
|
||||||
private function validateLlm(array $llm, array &$errors, array &$warnings): void
|
|
||||||
{
|
|
||||||
$callModels = $llm['call_models'] ?? [];
|
|
||||||
if (!is_array($callModels)) {
|
|
||||||
$errors[] = 'llm.call_models must be a map.';
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
$knownCalls = [
|
|
||||||
'input_normalization',
|
|
||||||
'shop_query_optimization',
|
|
||||||
'final_answer',
|
|
||||||
];
|
|
||||||
|
|
||||||
foreach ($callModels as $callName => $modelName) {
|
|
||||||
if (!is_string($callName) || trim($callName) === '') {
|
|
||||||
$errors[] = 'llm.call_models contains an invalid call name.';
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (!in_array($callName, $knownCalls, true)) {
|
|
||||||
$warnings[] = 'llm.call_models contains an unknown call name: ' . $callName . '.';
|
|
||||||
}
|
|
||||||
|
|
||||||
if ($modelName !== null && !is_string($modelName)) {
|
|
||||||
$errors[] = 'llm.call_models.' . $callName . ' must be null or a string model name.';
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (is_string($modelName) && trim($modelName) === '') {
|
|
||||||
$warnings[] = 'llm.call_models.' . $callName . ' is empty and will use the default model.';
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* @param array<string, mixed> $retrieval
|
* @param array<string, mixed> $retrieval
|
||||||
* @param list<string> $errors
|
* @param list<string> $errors
|
||||||
@@ -1782,6 +1740,13 @@ final readonly class RetriexEffectiveConfigProvider
|
|||||||
$errors[] = 'retrieval.generic_exact_selection_cleanup_profile references unknown language cleanup profile: ' . trim($cleanupProfile) . '.';
|
$errors[] = 'retrieval.generic_exact_selection_cleanup_profile references unknown language cleanup profile: ' . trim($cleanupProfile) . '.';
|
||||||
}
|
}
|
||||||
|
|
||||||
|
$queryCleanupProfile = $retrieval['query_cleanup_profile'] ?? null;
|
||||||
|
if (!is_string($queryCleanupProfile) || trim($queryCleanupProfile) === '') {
|
||||||
|
$errors[] = 'retrieval.query_cleanup_profile must be a non-empty string.';
|
||||||
|
} elseif (!in_array(trim($queryCleanupProfile), $this->languageCleanupConfig->getCleanupProfileNames(), true)) {
|
||||||
|
$errors[] = 'retrieval.query_cleanup_profile references unknown language cleanup profile: ' . trim($queryCleanupProfile) . '.';
|
||||||
|
}
|
||||||
|
|
||||||
$this->validateStringListMap($retrieval['vocabulary'] ?? [], 'retrieval.vocabulary', $errors, $warnings);
|
$this->validateStringListMap($retrieval['vocabulary'] ?? [], 'retrieval.vocabulary', $errors, $warnings);
|
||||||
|
|
||||||
$inventory = $retrieval['inventory_parameter'] ?? [];
|
$inventory = $retrieval['inventory_parameter'] ?? [];
|
||||||
|
|||||||
192
src/Controller/Admin/AdminEvalController.php
Normal file
192
src/Controller/Admin/AdminEvalController.php
Normal file
@@ -0,0 +1,192 @@
|
|||||||
|
<?php
|
||||||
|
|
||||||
|
declare(strict_types=1);
|
||||||
|
|
||||||
|
namespace App\Controller\Admin;
|
||||||
|
|
||||||
|
use App\Security\ApplicationRoles;
|
||||||
|
use App\Service\Admin\EvalAdminService;
|
||||||
|
use Symfony\Bundle\FrameworkBundle\Controller\AbstractController;
|
||||||
|
use Symfony\Component\HttpFoundation\Request;
|
||||||
|
use Symfony\Component\HttpFoundation\Response;
|
||||||
|
use Symfony\Component\Routing\Attribute\Route;
|
||||||
|
|
||||||
|
#[Route('/admin/evals')]
|
||||||
|
final class AdminEvalController extends AbstractController
|
||||||
|
{
|
||||||
|
#[Route('/', name: 'admin_evals_index', methods: ['GET'])]
|
||||||
|
public function index(Request $request, EvalAdminService $evals): Response
|
||||||
|
{
|
||||||
|
$this->denyAccessUnlessGranted(ApplicationRoles::ROLE_KNOWLEDGE_ADMIN);
|
||||||
|
|
||||||
|
$selectedType = trim((string) $request->query->get('type', ''));
|
||||||
|
if ($selectedType === '' || !in_array($selectedType, $evals->supportedTypeNames(), true)) {
|
||||||
|
$selectedType = 'retrieval';
|
||||||
|
}
|
||||||
|
|
||||||
|
return $this->render('admin/evals/index.html.twig', [
|
||||||
|
'types' => $evals->supportedTypes(),
|
||||||
|
'overview' => $evals->overview(),
|
||||||
|
'cases_by_type' => $evals->casesByType(),
|
||||||
|
'selected_type' => $selectedType,
|
||||||
|
'selected_report' => $evals->readTypeReport($selectedType),
|
||||||
|
'last_report' => $evals->readLastReport(),
|
||||||
|
]);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[Route('/run', name: 'admin_evals_run', methods: ['POST'])]
|
||||||
|
public function run(Request $request, EvalAdminService $evals): Response
|
||||||
|
{
|
||||||
|
$this->denyAccessUnlessGranted(ApplicationRoles::ROLE_KNOWLEDGE_ADMIN);
|
||||||
|
|
||||||
|
if (!$this->isCsrfTokenValid('admin_eval_run', (string) $request->request->get('_token'))) {
|
||||||
|
throw $this->createAccessDeniedException();
|
||||||
|
}
|
||||||
|
|
||||||
|
$type = trim((string) $request->request->get('type', 'retrieval'));
|
||||||
|
$caseId = trim((string) $request->request->get('case_id', ''));
|
||||||
|
|
||||||
|
try {
|
||||||
|
$report = $evals->run($type, $caseId !== '' ? $caseId : null);
|
||||||
|
$type = trim((string) ($report['type'] ?? $type));
|
||||||
|
|
||||||
|
$this->addFlash(
|
||||||
|
((int) ($report['failed'] ?? 0)) === 0 ? 'success' : 'danger',
|
||||||
|
sprintf(
|
||||||
|
'Eval %s abgeschlossen: %d/%d bestanden.',
|
||||||
|
$type,
|
||||||
|
(int) ($report['passed'] ?? 0),
|
||||||
|
(int) ($report['total'] ?? 0)
|
||||||
|
)
|
||||||
|
);
|
||||||
|
} catch (\Throwable $e) {
|
||||||
|
$this->addFlash('danger', $e->getMessage());
|
||||||
|
}
|
||||||
|
|
||||||
|
return $this->redirectToRoute('admin_evals_index', [
|
||||||
|
'type' => $type,
|
||||||
|
]);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[Route('/cases/new', name: 'admin_evals_case_new', methods: ['GET'])]
|
||||||
|
public function newCase(Request $request, EvalAdminService $evals): Response
|
||||||
|
{
|
||||||
|
$this->denyAccessUnlessGranted(ApplicationRoles::ROLE_KNOWLEDGE_ADMIN);
|
||||||
|
|
||||||
|
$type = trim((string) $request->query->get('type', 'retrieval'));
|
||||||
|
if (!in_array($type, $evals->supportedTypeNames(), true)) {
|
||||||
|
$type = 'retrieval';
|
||||||
|
}
|
||||||
|
|
||||||
|
$sourceType = trim((string) $request->query->get('source_type', ''));
|
||||||
|
$sourceCaseId = trim((string) $request->query->get('source_case_id', ''));
|
||||||
|
|
||||||
|
try {
|
||||||
|
$draft = $sourceType !== '' && $sourceCaseId !== ''
|
||||||
|
? $evals->caseDraftFromReportResult($sourceType, $sourceCaseId)
|
||||||
|
: $evals->emptyCaseDraft($type);
|
||||||
|
} catch (\Throwable $e) {
|
||||||
|
$this->addFlash('warning', $e->getMessage());
|
||||||
|
$draft = $evals->emptyCaseDraft($type);
|
||||||
|
}
|
||||||
|
|
||||||
|
return $this->render('admin/evals/case_new.html.twig', [
|
||||||
|
'types' => $evals->supportedTypes(),
|
||||||
|
'cases_by_type' => $evals->casesByType(),
|
||||||
|
'case_draft' => $draft,
|
||||||
|
]);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[Route('/cases', name: 'admin_evals_case_create', methods: ['POST'])]
|
||||||
|
public function createCase(Request $request, EvalAdminService $evals): Response
|
||||||
|
{
|
||||||
|
$this->denyAccessUnlessGranted(ApplicationRoles::ROLE_KNOWLEDGE_ADMIN);
|
||||||
|
|
||||||
|
if (!$this->isCsrfTokenValid('admin_eval_case_create', (string) $request->request->get('_token'))) {
|
||||||
|
throw $this->createAccessDeniedException();
|
||||||
|
}
|
||||||
|
|
||||||
|
$type = trim((string) $request->request->get('type', 'retrieval'));
|
||||||
|
$draft = [
|
||||||
|
'type' => $type,
|
||||||
|
'id' => (string) $request->request->get('id', ''),
|
||||||
|
'prompt' => (string) $request->request->get('prompt', ''),
|
||||||
|
'assert_json' => (string) $request->request->get('assert_json', ''),
|
||||||
|
'history_json' => (string) $request->request->get('history_json', ''),
|
||||||
|
'request_context_hint' => (string) $request->request->get('request_context_hint', ''),
|
||||||
|
'source_label' => '',
|
||||||
|
];
|
||||||
|
|
||||||
|
try {
|
||||||
|
$created = $evals->createCase(
|
||||||
|
type: $type,
|
||||||
|
id: (string) $request->request->get('id', ''),
|
||||||
|
prompt: (string) $request->request->get('prompt', ''),
|
||||||
|
assertJson: (string) $request->request->get('assert_json', ''),
|
||||||
|
historyJson: (string) $request->request->get('history_json', ''),
|
||||||
|
requestContextHint: (string) $request->request->get('request_context_hint', ''),
|
||||||
|
);
|
||||||
|
|
||||||
|
$type = (string) ($created['type'] ?? $type);
|
||||||
|
|
||||||
|
$this->addFlash(
|
||||||
|
'success',
|
||||||
|
sprintf('Eval-Case "%s" wurde in %s.ndjson gespeichert.', (string) ($created['id'] ?? ''), $type)
|
||||||
|
);
|
||||||
|
|
||||||
|
return $this->redirectToRoute('admin_evals_index', [
|
||||||
|
'type' => $type,
|
||||||
|
]);
|
||||||
|
} catch (\Throwable $e) {
|
||||||
|
$this->addFlash('danger', $e->getMessage());
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!in_array($type, $evals->supportedTypeNames(), true)) {
|
||||||
|
$draft['type'] = 'retrieval';
|
||||||
|
}
|
||||||
|
|
||||||
|
return $this->render('admin/evals/case_new.html.twig', [
|
||||||
|
'types' => $evals->supportedTypes(),
|
||||||
|
'cases_by_type' => $evals->casesByType(),
|
||||||
|
'case_draft' => $draft,
|
||||||
|
], new Response('', Response::HTTP_UNPROCESSABLE_ENTITY));
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
#[Route('/cases/delete', name: 'admin_evals_case_delete', methods: ['POST'])]
|
||||||
|
public function deleteCase(Request $request, EvalAdminService $evals): Response
|
||||||
|
{
|
||||||
|
$this->denyAccessUnlessGranted(ApplicationRoles::ROLE_KNOWLEDGE_ADMIN);
|
||||||
|
|
||||||
|
$type = trim((string) $request->request->get('type', 'retrieval'));
|
||||||
|
$caseId = trim((string) $request->request->get('case_id', ''));
|
||||||
|
|
||||||
|
if (!$this->isCsrfTokenValid(
|
||||||
|
sprintf('admin_eval_case_delete_%s_%s', $type, $caseId),
|
||||||
|
(string) $request->request->get('_token')
|
||||||
|
)) {
|
||||||
|
throw $this->createAccessDeniedException();
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
$deleted = $evals->deleteCase($type, $caseId);
|
||||||
|
$type = (string) ($deleted['type'] ?? $type);
|
||||||
|
|
||||||
|
$this->addFlash(
|
||||||
|
'success',
|
||||||
|
sprintf('Eval-Case "%s" wurde aus %s.ndjson entfernt.', (string) ($deleted['id'] ?? $caseId), $type)
|
||||||
|
);
|
||||||
|
} catch (\Throwable $e) {
|
||||||
|
$this->addFlash('danger', $e->getMessage());
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!in_array($type, $evals->supportedTypeNames(), true)) {
|
||||||
|
$type = 'retrieval';
|
||||||
|
}
|
||||||
|
|
||||||
|
return $this->redirectToRoute('admin_evals_case_new', [
|
||||||
|
'type' => $type,
|
||||||
|
]);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
@@ -11,6 +11,8 @@ final readonly class AgentEvalRunner
|
|||||||
{
|
{
|
||||||
public function __construct(
|
public function __construct(
|
||||||
private RetrievalDebugRunner $retrievalDebugRunner,
|
private RetrievalDebugRunner $retrievalDebugRunner,
|
||||||
|
private ShopQueryEvalRunner $shopQueryEvalRunner,
|
||||||
|
private AnswerGuardEvalRunner $answerGuardEvalRunner,
|
||||||
) {
|
) {
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -20,6 +22,14 @@ final readonly class AgentEvalRunner
|
|||||||
return $this->retrievalDebugRunner->run($case);
|
return $this->retrievalDebugRunner->run($case);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if ($case->isShopQueryCase() || $case->isFollowUpCase()) {
|
||||||
|
return $this->shopQueryEvalRunner->run($case);
|
||||||
|
}
|
||||||
|
|
||||||
|
if ($case->isAnswerGuardCase()) {
|
||||||
|
return $this->answerGuardEvalRunner->run($case);
|
||||||
|
}
|
||||||
|
|
||||||
throw new \InvalidArgumentException(sprintf(
|
throw new \InvalidArgumentException(sprintf(
|
||||||
'Unsupported eval case type: %s',
|
'Unsupported eval case type: %s',
|
||||||
$case->type
|
$case->type
|
||||||
|
|||||||
32
src/Eval/AnswerGuardEvalRunner.php
Normal file
32
src/Eval/AnswerGuardEvalRunner.php
Normal file
@@ -0,0 +1,32 @@
|
|||||||
|
<?php
|
||||||
|
|
||||||
|
declare(strict_types=1);
|
||||||
|
|
||||||
|
namespace App\Eval;
|
||||||
|
|
||||||
|
use App\Eval\Dto\EvalCase;
|
||||||
|
use App\Eval\Dto\EvalResult;
|
||||||
|
|
||||||
|
final readonly class AnswerGuardEvalRunner
|
||||||
|
{
|
||||||
|
public function __construct(
|
||||||
|
private RetrievalDebugRunner $retrievalDebugRunner,
|
||||||
|
) {
|
||||||
|
}
|
||||||
|
|
||||||
|
public function run(EvalCase $case): EvalResult
|
||||||
|
{
|
||||||
|
$result = $this->retrievalDebugRunner->run($case);
|
||||||
|
$details = $result->details;
|
||||||
|
$details['guard_scope'] = 'retrieval_evidence_pre_answer';
|
||||||
|
|
||||||
|
return new EvalResult(
|
||||||
|
caseId: $result->caseId,
|
||||||
|
type: $case->type,
|
||||||
|
passed: $result->passed,
|
||||||
|
durationMs: $result->durationMs,
|
||||||
|
failures: $result->failures,
|
||||||
|
details: $details,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -8,12 +8,15 @@ final readonly class EvalCase
|
|||||||
{
|
{
|
||||||
/**
|
/**
|
||||||
* @param array<string, mixed> $assert
|
* @param array<string, mixed> $assert
|
||||||
|
* @param array<int, array{prompt:string,answer:string}> $history
|
||||||
*/
|
*/
|
||||||
public function __construct(
|
public function __construct(
|
||||||
public string $id,
|
public string $id,
|
||||||
public string $type,
|
public string $type,
|
||||||
public string $prompt,
|
public string $prompt,
|
||||||
public array $assert = [],
|
public array $assert = [],
|
||||||
|
public array $history = [],
|
||||||
|
public string $requestContextHint = '',
|
||||||
) {
|
) {
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -26,6 +29,8 @@ final readonly class EvalCase
|
|||||||
$type = trim((string) ($row['type'] ?? ''));
|
$type = trim((string) ($row['type'] ?? ''));
|
||||||
$prompt = trim((string) ($row['prompt'] ?? ''));
|
$prompt = trim((string) ($row['prompt'] ?? ''));
|
||||||
$assert = is_array($row['assert'] ?? null) ? $row['assert'] : [];
|
$assert = is_array($row['assert'] ?? null) ? $row['assert'] : [];
|
||||||
|
$history = self::normalizeHistory($row['history'] ?? []);
|
||||||
|
$requestContextHint = trim((string) ($row['request_context_hint'] ?? ''));
|
||||||
|
|
||||||
if ($id === '') {
|
if ($id === '') {
|
||||||
throw new \InvalidArgumentException('Eval case id must not be empty.');
|
throw new \InvalidArgumentException('Eval case id must not be empty.');
|
||||||
@@ -50,6 +55,8 @@ final readonly class EvalCase
|
|||||||
type: $type,
|
type: $type,
|
||||||
prompt: $prompt,
|
prompt: $prompt,
|
||||||
assert: $assert,
|
assert: $assert,
|
||||||
|
history: $history,
|
||||||
|
requestContextHint: $requestContextHint,
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -57,4 +64,64 @@ final readonly class EvalCase
|
|||||||
{
|
{
|
||||||
return $this->type === 'retrieval';
|
return $this->type === 'retrieval';
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public function isShopQueryCase(): bool
|
||||||
|
{
|
||||||
|
return $this->type === 'shop_query';
|
||||||
|
}
|
||||||
|
|
||||||
|
public function isFollowUpCase(): bool
|
||||||
|
{
|
||||||
|
return $this->type === 'followup';
|
||||||
|
}
|
||||||
|
|
||||||
|
public function isAnswerGuardCase(): bool
|
||||||
|
{
|
||||||
|
return $this->type === 'answer_guard';
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<int, array{prompt:string,answer:string}>
|
||||||
|
*/
|
||||||
|
private static function normalizeHistory(mixed $value): array
|
||||||
|
{
|
||||||
|
if (!is_array($value)) {
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
|
||||||
|
$history = [];
|
||||||
|
|
||||||
|
foreach ($value as $entry) {
|
||||||
|
if (is_string($entry)) {
|
||||||
|
$entry = trim($entry);
|
||||||
|
|
||||||
|
if ($entry !== '') {
|
||||||
|
$history[] = [
|
||||||
|
'prompt' => 'Eval-Kontext',
|
||||||
|
'answer' => $entry,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!is_array($entry)) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$prompt = trim((string) ($entry['prompt'] ?? ''));
|
||||||
|
$answer = trim((string) ($entry['answer'] ?? $entry['response'] ?? ''));
|
||||||
|
|
||||||
|
if ($prompt === '' && $answer === '') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$history[] = [
|
||||||
|
'prompt' => $prompt !== '' ? $prompt : 'Eval-Kontext',
|
||||||
|
'answer' => $answer,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
return $history;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
@@ -33,6 +33,8 @@ final readonly class RetrievalDebugRunner
|
|||||||
|
|
||||||
$documentIds = $this->extractUniqueStringValues($rows, 'document_id');
|
$documentIds = $this->extractUniqueStringValues($rows, 'document_id');
|
||||||
$chunkIds = $this->extractUniqueStringValues($rows, 'chunk_id');
|
$chunkIds = $this->extractUniqueStringValues($rows, 'chunk_id');
|
||||||
|
$documentRefs = $this->buildDocumentRefs($rows);
|
||||||
|
$resultRows = $this->buildResultRows($rows);
|
||||||
$joinedText = $this->extractJoinedText($rows);
|
$joinedText = $this->extractJoinedText($rows);
|
||||||
|
|
||||||
$assert = $case->assert;
|
$assert = $case->assert;
|
||||||
@@ -187,6 +189,25 @@ final readonly class RetrievalDebugRunner
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
$forbiddenTerms = $this->normalizeStringList($assert['must_not_include_terms'] ?? []);
|
||||||
|
foreach ($forbiddenTerms as $forbiddenTerm) {
|
||||||
|
if ($this->containsTerm($joinedText, $forbiddenTerm)) {
|
||||||
|
$failures[] = sprintf(
|
||||||
|
'forbidden term "%s" was present in the retrieval text.',
|
||||||
|
$forbiddenTerm
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
foreach ($this->normalizeStringList($assert['must_not_match_patterns'] ?? []) as $pattern) {
|
||||||
|
if (@preg_match($pattern, $joinedText) === 1) {
|
||||||
|
$failures[] = sprintf(
|
||||||
|
'forbidden pattern "%s" matched the retrieval text.',
|
||||||
|
$pattern
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
return new EvalResult(
|
return new EvalResult(
|
||||||
caseId: $case->id,
|
caseId: $case->id,
|
||||||
type: $case->type,
|
type: $case->type,
|
||||||
@@ -201,8 +222,11 @@ final readonly class RetrievalDebugRunner
|
|||||||
'intent' => $intent,
|
'intent' => $intent,
|
||||||
'document_ids' => $documentIds,
|
'document_ids' => $documentIds,
|
||||||
'chunk_ids' => $chunkIds,
|
'chunk_ids' => $chunkIds,
|
||||||
|
'document_refs' => $documentRefs,
|
||||||
|
'result_rows' => $resultRows,
|
||||||
'matched_any_terms' => $matchedAnyTerms,
|
'matched_any_terms' => $matchedAnyTerms,
|
||||||
'matched_all_terms' => $matchedAllTerms,
|
'matched_all_terms' => $matchedAllTerms,
|
||||||
|
'forbidden_terms_checked' => $this->normalizeStringList($assert['must_not_include_terms'] ?? []),
|
||||||
],
|
],
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
@@ -248,6 +272,122 @@ final readonly class RetrievalDebugRunner
|
|||||||
return array_keys($values);
|
return array_keys($values);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<int, array<string, mixed>> $rows
|
||||||
|
* @return array<int, array{id:string,title:string,file_path:string,version_number:string,chunk_ids:array<int,string>,ranks:array<int,int>}>
|
||||||
|
*/
|
||||||
|
private function buildDocumentRefs(array $rows): array
|
||||||
|
{
|
||||||
|
$refs = [];
|
||||||
|
|
||||||
|
foreach ($rows as $row) {
|
||||||
|
$documentId = $this->extractNullableString($row, 'document_id');
|
||||||
|
|
||||||
|
if ($documentId === '') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!isset($refs[$documentId])) {
|
||||||
|
$refs[$documentId] = [
|
||||||
|
'id' => $documentId,
|
||||||
|
'title' => $this->extractNullableString($row, 'document_title'),
|
||||||
|
'file_path' => $this->extractNullableString($row, 'file_path'),
|
||||||
|
'version_number' => $this->extractNullableString($row, 'version_number'),
|
||||||
|
'chunk_ids' => [],
|
||||||
|
'ranks' => [],
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
$chunkId = $this->extractNullableString($row, 'chunk_id');
|
||||||
|
if ($chunkId !== '' && !in_array($chunkId, $refs[$documentId]['chunk_ids'], true)) {
|
||||||
|
$refs[$documentId]['chunk_ids'][] = $chunkId;
|
||||||
|
}
|
||||||
|
|
||||||
|
$rank = $this->extractNullableInt($row, 'rank');
|
||||||
|
if ($rank !== null && !in_array($rank, $refs[$documentId]['ranks'], true)) {
|
||||||
|
$refs[$documentId]['ranks'][] = $rank;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return array_values($refs);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<int, array<string, mixed>> $rows
|
||||||
|
* @return array<int, array<string, mixed>>
|
||||||
|
*/
|
||||||
|
private function buildResultRows(array $rows): array
|
||||||
|
{
|
||||||
|
$out = [];
|
||||||
|
|
||||||
|
foreach ($rows as $row) {
|
||||||
|
$out[] = [
|
||||||
|
'rank' => $this->extractNullableInt($row, 'rank'),
|
||||||
|
'document_id' => $this->extractNullableString($row, 'document_id'),
|
||||||
|
'document_title' => $this->extractNullableString($row, 'document_title'),
|
||||||
|
'file_path' => $this->extractNullableString($row, 'file_path'),
|
||||||
|
'chunk_id' => $this->extractNullableString($row, 'chunk_id'),
|
||||||
|
'chunk_index' => $this->extractNullableInt($row, 'chunk_index'),
|
||||||
|
'raw_score' => $row['raw_score'] ?? null,
|
||||||
|
'rrf_score' => $row['rrf_score'] ?? null,
|
||||||
|
'text_preview' => $this->previewText($this->extractNullableString($row, 'text')),
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
return $out;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<string, mixed> $row
|
||||||
|
*/
|
||||||
|
private function extractNullableString(array $row, string $key): string
|
||||||
|
{
|
||||||
|
$value = $row[$key] ?? null;
|
||||||
|
|
||||||
|
if ($value === null || is_array($value) || is_object($value)) {
|
||||||
|
return '';
|
||||||
|
}
|
||||||
|
|
||||||
|
return trim((string)$value);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<string, mixed> $row
|
||||||
|
*/
|
||||||
|
private function extractNullableInt(array $row, string $key): ?int
|
||||||
|
{
|
||||||
|
$value = $row[$key] ?? null;
|
||||||
|
|
||||||
|
if ($value === null || $value === '') {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (is_int($value)) {
|
||||||
|
return $value;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (is_string($value) && preg_match('/^-?\d+$/', trim($value)) === 1) {
|
||||||
|
return (int)$value;
|
||||||
|
}
|
||||||
|
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
private function previewText(string $text, int $limit = 240): string
|
||||||
|
{
|
||||||
|
$text = preg_replace('/\s+/u', ' ', trim($text)) ?? trim($text);
|
||||||
|
|
||||||
|
if ($text === '') {
|
||||||
|
return '';
|
||||||
|
}
|
||||||
|
|
||||||
|
if (mb_strlen($text, 'UTF-8') <= $limit) {
|
||||||
|
return $text;
|
||||||
|
}
|
||||||
|
|
||||||
|
return mb_substr($text, 0, $limit, 'UTF-8') . '...';
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* @param array<int, array<string, mixed>> $rows
|
* @param array<int, array<string, mixed>> $rows
|
||||||
*/
|
*/
|
||||||
|
|||||||
389
src/Eval/ShopQueryEvalRunner.php
Normal file
389
src/Eval/ShopQueryEvalRunner.php
Normal file
@@ -0,0 +1,389 @@
|
|||||||
|
<?php
|
||||||
|
|
||||||
|
declare(strict_types=1);
|
||||||
|
|
||||||
|
namespace App\Eval;
|
||||||
|
|
||||||
|
use App\Agent\AgentRunner;
|
||||||
|
use App\Context\ContextService;
|
||||||
|
use App\Eval\Dto\EvalCase;
|
||||||
|
use App\Eval\Dto\EvalResult;
|
||||||
|
|
||||||
|
final readonly class ShopQueryEvalRunner
|
||||||
|
{
|
||||||
|
public function __construct(
|
||||||
|
private AgentRunner $agentRunner,
|
||||||
|
private ContextService $contextService,
|
||||||
|
) {
|
||||||
|
}
|
||||||
|
|
||||||
|
public function run(EvalCase $case): EvalResult
|
||||||
|
{
|
||||||
|
$start = microtime(true);
|
||||||
|
$failures = [];
|
||||||
|
$userId = $this->buildUserId($case);
|
||||||
|
$transcript = '';
|
||||||
|
$shopMeta = null;
|
||||||
|
|
||||||
|
$this->contextService->deleteHistory($userId);
|
||||||
|
$this->seedHistory($userId, $case->history);
|
||||||
|
|
||||||
|
try {
|
||||||
|
foreach ($this->agentRunner->run($case->prompt, $userId, false, $case->requestContextHint) as $chunk) {
|
||||||
|
if (!is_string($chunk) || $chunk === '') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$transcript .= $chunk . "\n";
|
||||||
|
|
||||||
|
if (!str_contains($chunk, 'retriex-shop-meta')) {
|
||||||
|
if (mb_strlen($transcript, 'UTF-8') > 120000) {
|
||||||
|
$transcript = mb_substr($transcript, -120000, null, 'UTF-8');
|
||||||
|
}
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$shopMeta = $this->extractShopMeta($chunk);
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
} catch (\Throwable $e) {
|
||||||
|
$failures[] = sprintf('agent run failed before shop-query meta was emitted: %s', $e->getMessage());
|
||||||
|
} finally {
|
||||||
|
$this->contextService->deleteHistory($userId);
|
||||||
|
}
|
||||||
|
|
||||||
|
$durationMs = round((microtime(true) - $start) * 1000, 2);
|
||||||
|
|
||||||
|
if ($shopMeta === null) {
|
||||||
|
$failures[] = 'no shop-query meta message was emitted before the runner stopped.';
|
||||||
|
$shopMeta = [
|
||||||
|
'query' => '',
|
||||||
|
'individual_queries' => [],
|
||||||
|
'raw_html' => '',
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
$this->assertShopQuery($failures, $case, $shopMeta);
|
||||||
|
|
||||||
|
return new EvalResult(
|
||||||
|
caseId: $case->id,
|
||||||
|
type: $case->type,
|
||||||
|
passed: $failures === [],
|
||||||
|
durationMs: $durationMs,
|
||||||
|
failures: $failures,
|
||||||
|
details: [
|
||||||
|
'prompt' => $case->prompt,
|
||||||
|
'history_turns' => count($case->history),
|
||||||
|
'history' => $this->buildHistoryPreview($case->history),
|
||||||
|
'has_request_context_hint' => $case->requestContextHint !== '',
|
||||||
|
'query' => $shopMeta['query'],
|
||||||
|
'individual_queries' => $shopMeta['individual_queries'],
|
||||||
|
'transcript_preview' => $this->previewText($transcript),
|
||||||
|
],
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<int, array{prompt:string,answer:string}> $history
|
||||||
|
* @return array<int, array{prompt:string,answer_preview:string}>
|
||||||
|
*/
|
||||||
|
private function buildHistoryPreview(array $history): array
|
||||||
|
{
|
||||||
|
$preview = [];
|
||||||
|
|
||||||
|
foreach ($history as $turn) {
|
||||||
|
$prompt = trim((string) ($turn['prompt'] ?? ''));
|
||||||
|
$answer = trim((string) ($turn['answer'] ?? ''));
|
||||||
|
|
||||||
|
if ($prompt === '' && $answer === '') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$preview[] = [
|
||||||
|
'prompt' => $prompt !== '' ? $prompt : 'Eval-Kontext',
|
||||||
|
'answer_preview' => $this->previewText($answer, 260),
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
return $preview;
|
||||||
|
}
|
||||||
|
|
||||||
|
private function buildUserId(EvalCase $case): string
|
||||||
|
{
|
||||||
|
$safeId = preg_replace('/[^a-zA-Z0-9_-]+/', '_', $case->id) ?? $case->id;
|
||||||
|
$safeId = trim($safeId, '_');
|
||||||
|
|
||||||
|
return 'eval_' . ($safeId !== '' ? $safeId : sha1($case->id));
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<int, array{prompt:string,answer:string}> $history
|
||||||
|
*/
|
||||||
|
private function seedHistory(string $userId, array $history): void
|
||||||
|
{
|
||||||
|
foreach ($history as $turn) {
|
||||||
|
$prompt = trim($turn['prompt'] ?? '');
|
||||||
|
$answer = trim($turn['answer'] ?? '');
|
||||||
|
|
||||||
|
if ($prompt === '' && $answer === '') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if ($prompt === '') {
|
||||||
|
$prompt = 'Eval-Kontext';
|
||||||
|
}
|
||||||
|
|
||||||
|
$this->contextService->appendHistory($userId, $prompt, $answer);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array{query:string,individual_queries:array<int,string>,raw_html:string}
|
||||||
|
*/
|
||||||
|
private function extractShopMeta(string $html): array
|
||||||
|
{
|
||||||
|
$isMultiQuery = str_contains($html, 'retriex-meta-query--multi');
|
||||||
|
$codes = [];
|
||||||
|
|
||||||
|
if (preg_match_all('/<code>(.*?)<\/code>/su', $html, $matches) !== false) {
|
||||||
|
foreach ($matches[1] ?? [] as $value) {
|
||||||
|
$decoded = html_entity_decode(strip_tags((string) $value), ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8');
|
||||||
|
$decoded = $this->normalizeOneLine($decoded);
|
||||||
|
|
||||||
|
if ($decoded !== '') {
|
||||||
|
$codes[] = $decoded;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
$codes = array_values(array_unique($codes));
|
||||||
|
|
||||||
|
if ($isMultiQuery) {
|
||||||
|
return [
|
||||||
|
'query' => '',
|
||||||
|
'individual_queries' => $codes,
|
||||||
|
'raw_html' => $html,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
return [
|
||||||
|
'query' => $codes[0] ?? '',
|
||||||
|
'individual_queries' => [],
|
||||||
|
'raw_html' => $html,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<int, string> $failures
|
||||||
|
* @param array{query:string,individual_queries:array<int,string>,raw_html:string} $shopMeta
|
||||||
|
*/
|
||||||
|
private function assertShopQuery(array &$failures, EvalCase $case, array $shopMeta): void
|
||||||
|
{
|
||||||
|
$assert = $case->assert;
|
||||||
|
$query = $shopMeta['query'];
|
||||||
|
$individualQueries = $shopMeta['individual_queries'];
|
||||||
|
$joined = trim($query . ' ' . implode(' ', $individualQueries));
|
||||||
|
|
||||||
|
$expectedQuery = $this->stringOrNull($assert['expected_query'] ?? null);
|
||||||
|
if ($expectedQuery !== null && $this->normalizeQuery($query) !== $this->normalizeQuery($expectedQuery)) {
|
||||||
|
$failures[] = sprintf(
|
||||||
|
'shop query mismatch: expected "%s", got "%s".',
|
||||||
|
$expectedQuery,
|
||||||
|
$query
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
$forbiddenExactQuery = $this->stringOrNull($assert['must_not_equal_query'] ?? null);
|
||||||
|
if ($forbiddenExactQuery !== null && $this->normalizeQuery($query) === $this->normalizeQuery($forbiddenExactQuery)) {
|
||||||
|
$failures[] = sprintf('shop query must not equal "%s".', $forbiddenExactQuery);
|
||||||
|
}
|
||||||
|
|
||||||
|
$expectedIndividualQueries = $this->normalizeStringList($assert['expected_individual_queries'] ?? []);
|
||||||
|
if ($expectedIndividualQueries !== []) {
|
||||||
|
foreach ($expectedIndividualQueries as $expectedIndividualQuery) {
|
||||||
|
if (!$this->containsNormalizedQuery($individualQueries, $expectedIndividualQuery)) {
|
||||||
|
$failures[] = sprintf(
|
||||||
|
'missing expected individual shop query "%s". Got [%s].',
|
||||||
|
$expectedIndividualQuery,
|
||||||
|
implode(', ', $individualQueries)
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (($assert['expected_individual_queries_exact'] ?? false) === true) {
|
||||||
|
$expected = array_map(fn(string $value): string => $this->normalizeQuery($value), $expectedIndividualQueries);
|
||||||
|
$actual = array_map(fn(string $value): string => $this->normalizeQuery($value), $individualQueries);
|
||||||
|
|
||||||
|
sort($expected);
|
||||||
|
sort($actual);
|
||||||
|
|
||||||
|
if ($expected !== $actual) {
|
||||||
|
$failures[] = sprintf(
|
||||||
|
'individual shop queries differ from expected exact set. Expected [%s], got [%s].',
|
||||||
|
implode(', ', $expectedIndividualQueries),
|
||||||
|
implode(', ', $individualQueries)
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (isset($assert['min_individual_queries']) && count($individualQueries) < (int) $assert['min_individual_queries']) {
|
||||||
|
$failures[] = sprintf(
|
||||||
|
'too few individual shop queries: expected >= %d, got %d.',
|
||||||
|
(int) $assert['min_individual_queries'],
|
||||||
|
count($individualQueries)
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (isset($assert['max_individual_queries']) && count($individualQueries) > (int) $assert['max_individual_queries']) {
|
||||||
|
$failures[] = sprintf(
|
||||||
|
'too many individual shop queries: expected <= %d, got %d.',
|
||||||
|
(int) $assert['max_individual_queries'],
|
||||||
|
count($individualQueries)
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
foreach ($this->normalizeStringList($assert['must_include_terms'] ?? []) as $term) {
|
||||||
|
if (!$this->containsTerm($joined, $term)) {
|
||||||
|
$failures[] = sprintf('shop query output does not contain required term "%s".', $term);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
$requiredAnyTerms = $this->normalizeStringList($assert['must_include_any_terms'] ?? []);
|
||||||
|
if ($requiredAnyTerms !== []) {
|
||||||
|
$matched = false;
|
||||||
|
foreach ($requiredAnyTerms as $term) {
|
||||||
|
if ($this->containsTerm($joined, $term)) {
|
||||||
|
$matched = true;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!$matched) {
|
||||||
|
$failures[] = sprintf(
|
||||||
|
'shop query output contains none of the required any-terms: [%s].',
|
||||||
|
implode(', ', $requiredAnyTerms)
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
foreach ($this->normalizeStringList($assert['must_not_include_terms'] ?? []) as $term) {
|
||||||
|
if ($this->containsTerm($joined, $term)) {
|
||||||
|
$failures[] = sprintf('shop query output contains forbidden term "%s".', $term);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
foreach ($this->normalizeStringList($assert['query_must_match_patterns'] ?? []) as $pattern) {
|
||||||
|
if (@preg_match($pattern, $joined) !== 1) {
|
||||||
|
$failures[] = sprintf('shop query output does not match required pattern "%s".', $pattern);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
foreach ($this->normalizeStringList($assert['query_must_not_match_patterns'] ?? []) as $pattern) {
|
||||||
|
if (@preg_match($pattern, $joined) === 1) {
|
||||||
|
$failures[] = sprintf('shop query output matches forbidden pattern "%s".', $pattern);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<int, string> $queries
|
||||||
|
*/
|
||||||
|
private function containsNormalizedQuery(array $queries, string $needle): bool
|
||||||
|
{
|
||||||
|
$needle = $this->normalizeQuery($needle);
|
||||||
|
|
||||||
|
foreach ($queries as $query) {
|
||||||
|
if ($this->normalizeQuery($query) === $needle) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
private function containsTerm(string $haystack, string $term): bool
|
||||||
|
{
|
||||||
|
$haystack = $this->normalizeText($haystack);
|
||||||
|
$term = $this->normalizeText($term);
|
||||||
|
|
||||||
|
return $term !== '' && str_contains($haystack, $term);
|
||||||
|
}
|
||||||
|
|
||||||
|
private function normalizeQuery(string $value): string
|
||||||
|
{
|
||||||
|
$value = $this->normalizeText($value);
|
||||||
|
$value = preg_replace('/[^\p{L}\p{N}]+/u', ' ', $value) ?? $value;
|
||||||
|
$value = preg_replace('/\s+/u', ' ', $value) ?? $value;
|
||||||
|
|
||||||
|
return trim($value);
|
||||||
|
}
|
||||||
|
|
||||||
|
private function normalizeText(string $value): string
|
||||||
|
{
|
||||||
|
$value = html_entity_decode(strip_tags($value), ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8');
|
||||||
|
$value = mb_strtolower(trim($value), 'UTF-8');
|
||||||
|
$value = preg_replace('/\s+/u', ' ', $value) ?? $value;
|
||||||
|
|
||||||
|
return trim($value);
|
||||||
|
}
|
||||||
|
|
||||||
|
private function normalizeOneLine(string $value): string
|
||||||
|
{
|
||||||
|
$value = trim($value);
|
||||||
|
$value = preg_replace('/\s+/u', ' ', $value) ?? $value;
|
||||||
|
|
||||||
|
return trim($value);
|
||||||
|
}
|
||||||
|
|
||||||
|
private function stringOrNull(mixed $value): ?string
|
||||||
|
{
|
||||||
|
if (!is_string($value)) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
$value = trim($value);
|
||||||
|
|
||||||
|
return $value !== '' ? $value : null;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<int, string>
|
||||||
|
*/
|
||||||
|
private function normalizeStringList(mixed $value): array
|
||||||
|
{
|
||||||
|
if (!is_array($value)) {
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
|
||||||
|
$out = [];
|
||||||
|
|
||||||
|
foreach ($value as $item) {
|
||||||
|
if (!is_string($item)) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$item = trim($item);
|
||||||
|
|
||||||
|
if ($item === '') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$out[] = $item;
|
||||||
|
}
|
||||||
|
|
||||||
|
return array_values(array_unique($out));
|
||||||
|
}
|
||||||
|
|
||||||
|
private function previewText(string $value, int $maxLength = 1200): string
|
||||||
|
{
|
||||||
|
$value = $this->normalizeOneLine($value);
|
||||||
|
$maxLength = max(40, $maxLength);
|
||||||
|
|
||||||
|
if (mb_strlen($value, 'UTF-8') <= $maxLength) {
|
||||||
|
return $value;
|
||||||
|
}
|
||||||
|
|
||||||
|
return rtrim(mb_substr($value, 0, $maxLength, 'UTF-8')) . '...';
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -357,7 +357,11 @@ final readonly class NdjsonChunkLookup
|
|||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (mb_strlen($token, 'UTF-8') < 3 && preg_match('/\d/u', $token) !== 1) {
|
if (
|
||||||
|
mb_strlen($token, 'UTF-8') < 3
|
||||||
|
&& preg_match('/\d/u', $token) !== 1
|
||||||
|
&& !$this->isImportantShortTitleToken($token)
|
||||||
|
) {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -367,6 +371,15 @@ final readonly class NdjsonChunkLookup
|
|||||||
return array_values(array_unique($out));
|
return array_values(array_unique($out));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
private function isImportantShortTitleToken(string $token): bool
|
||||||
|
{
|
||||||
|
if ($token === '' || mb_strlen($token, 'UTF-8') >= 3) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
return in_array($token, $this->retrieverConfig->importantShortModelTokens(), true);
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* @return array<string,bool>
|
* @return array<string,bool>
|
||||||
*/
|
*/
|
||||||
|
|||||||
@@ -133,13 +133,17 @@ final readonly class NdjsonHybridRetriever implements RetrieverInterface
|
|||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
$row = $result['rows'][$chunkId];
|
||||||
$rank++;
|
$rank++;
|
||||||
|
|
||||||
$out[] = [
|
$out[] = [
|
||||||
'rank' => $rank,
|
'rank' => $rank,
|
||||||
'chunk_id' => $chunkId,
|
'chunk_id' => $chunkId,
|
||||||
'document_id' => $result['rows'][$chunkId]['document_id'] ?? null,
|
'document_id' => $row['document_id'] ?? null,
|
||||||
'chunk_index' => $result['rows'][$chunkId]['chunk_index'] ?? null,
|
'document_title' => $this->extractDocumentTitle($row),
|
||||||
|
'file_path' => $this->extractMetadataString($row, 'file_path'),
|
||||||
|
'version_number' => $this->extractMetadataString($row, 'version_number'),
|
||||||
|
'chunk_index' => $row['chunk_index'] ?? null,
|
||||||
'raw_score' => $result['rawScores'][$chunkId] ?? null,
|
'raw_score' => $result['rawScores'][$chunkId] ?? null,
|
||||||
'rrf_score' => $result['rrfScores'][$chunkId] ?? null,
|
'rrf_score' => $result['rrfScores'][$chunkId] ?? null,
|
||||||
'threshold' => $result['threshold'],
|
'threshold' => $result['threshold'],
|
||||||
@@ -148,7 +152,7 @@ final readonly class NdjsonHybridRetriever implements RetrieverInterface
|
|||||||
'entity_label' => $result['entityLabel'],
|
'entity_label' => $result['entityLabel'],
|
||||||
'is_list_query' => $result['isListQuery'],
|
'is_list_query' => $result['isListQuery'],
|
||||||
'selection_mode' => $result['selectionMode'],
|
'selection_mode' => $result['selectionMode'],
|
||||||
'text' => trim((string)$result['rows'][$chunkId]['text']),
|
'text' => trim((string)($row['text'] ?? '')),
|
||||||
];
|
];
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -1683,6 +1687,20 @@ final readonly class NdjsonHybridRetriever implements RetrieverInterface
|
|||||||
return '';
|
return '';
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Extracts a scalar metadata value for debug/eval output.
|
||||||
|
*/
|
||||||
|
private function extractMetadataString(array $row, string $key): string
|
||||||
|
{
|
||||||
|
$value = $row['metadata'][$key] ?? null;
|
||||||
|
|
||||||
|
if (is_scalar($value)) {
|
||||||
|
return trim((string)$value);
|
||||||
|
}
|
||||||
|
|
||||||
|
return '';
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Normalizes text for token-safe product comparisons.
|
* Normalizes text for token-safe product comparisons.
|
||||||
*/
|
*/
|
||||||
|
|||||||
@@ -5,13 +5,15 @@ declare(strict_types=1);
|
|||||||
namespace App\Knowledge\Retrieval;
|
namespace App\Knowledge\Retrieval;
|
||||||
|
|
||||||
use App\Config\LanguageCleanupConfig;
|
use App\Config\LanguageCleanupConfig;
|
||||||
|
use App\Config\NdjsonHybridRetrieverConfig;
|
||||||
use App\Knowledge\StopWords;
|
use App\Knowledge\StopWords;
|
||||||
|
|
||||||
final readonly class QueryCleaner
|
final readonly class QueryCleaner
|
||||||
{
|
{
|
||||||
public function __construct(
|
public function __construct(
|
||||||
private StopWords $stopWords,
|
private StopWords $stopWords,
|
||||||
private LanguageCleanupConfig $languageCleanupConfig
|
private LanguageCleanupConfig $languageCleanupConfig,
|
||||||
|
private NdjsonHybridRetrieverConfig $retrieverConfig
|
||||||
) {
|
) {
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -21,9 +23,8 @@ final readonly class QueryCleaner
|
|||||||
* Important:
|
* Important:
|
||||||
* - Unicode-safe
|
* - Unicode-safe
|
||||||
* - Numbers are preserved
|
* - Numbers are preserved
|
||||||
* - Negations are preserved
|
* - Negations are preserved by protected-term aware cleanup profiles
|
||||||
* - No aggressive token-length filtering
|
* - Stop words are resolved from the generic legacy list plus YAML cleanup profile terms
|
||||||
* - Stop words are removed
|
|
||||||
*/
|
*/
|
||||||
public function clean(string $query): string
|
public function clean(string $query): string
|
||||||
{
|
{
|
||||||
@@ -31,49 +32,49 @@ final readonly class QueryCleaner
|
|||||||
return '';
|
return '';
|
||||||
}
|
}
|
||||||
|
|
||||||
// 1. Convert to lowercase in a Unicode-safe way
|
$profile = $this->loadCleanupProfile();
|
||||||
|
|
||||||
|
// 1. Convert to lowercase in a Unicode-safe way.
|
||||||
$query = mb_strtolower($query, 'UTF-8');
|
$query = mb_strtolower($query, 'UTF-8');
|
||||||
|
|
||||||
// 2. Treat hyphens and slashes as word separators
|
// 2. Treat hyphens and slashes as word separators.
|
||||||
$query = $this->languageCleanupConfig->replaceWordSeparatorsWithSpace($query);
|
$query = $this->languageCleanupConfig->replaceWordSeparatorsWithSpace($query);
|
||||||
|
|
||||||
// 3. Remove special characters, but keep:
|
// 3. Remove configured cleanup phrases before punctuation stripping.
|
||||||
// - letters
|
$query = $this->removePhrases($query, $profile['phrases']);
|
||||||
// - numbers
|
|
||||||
// - other Unicode letters
|
// 4. Remove special characters, but keep letters, numbers and other Unicode letters.
|
||||||
$query = preg_replace('/[^\p{L}\p{N}\s]/u', ' ', $query);
|
$query = preg_replace('/[^\p{L}\p{N}\s]/u', ' ', $query);
|
||||||
|
|
||||||
if ($query === null) {
|
if ($query === null) {
|
||||||
return '';
|
return '';
|
||||||
}
|
}
|
||||||
|
|
||||||
// 4. Normalize multiple whitespace characters
|
// 5. Normalize multiple whitespace characters.
|
||||||
$query = preg_replace('/\s+/u', ' ', $query);
|
$query = preg_replace('/\s+/u', ' ', $query);
|
||||||
$query = trim($query);
|
$query = trim((string) $query);
|
||||||
|
|
||||||
if ($query === '') {
|
if ($query === '') {
|
||||||
return '';
|
return '';
|
||||||
}
|
}
|
||||||
|
|
||||||
// 5. Tokenize the query
|
|
||||||
$tokens = preg_split('/\s+/u', $query);
|
$tokens = preg_split('/\s+/u', $query);
|
||||||
|
|
||||||
if ($tokens === false) {
|
if ($tokens === false) {
|
||||||
return '';
|
return '';
|
||||||
}
|
}
|
||||||
|
|
||||||
|
$profileTerms = array_fill_keys(array_merge($profile['stopwords'], $profile['meta_terms']), true);
|
||||||
$cleanTokens = [];
|
$cleanTokens = [];
|
||||||
|
|
||||||
foreach ($tokens as $token) {
|
foreach ($tokens as $token) {
|
||||||
|
|
||||||
$token = trim($token);
|
$token = trim($token);
|
||||||
|
|
||||||
if ($token === '') {
|
if ($token === '') {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Remove stop words
|
if ($this->stopWords->isStopWord($token) || isset($profileTerms[$token])) {
|
||||||
if ($this->stopWords->isStopWord($token)) {
|
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -86,4 +87,42 @@ final readonly class QueryCleaner
|
|||||||
|
|
||||||
return implode(' ', $cleanTokens);
|
return implode(' ', $cleanTokens);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array{stopwords:string[], phrases:string[], meta_terms:string[], protected_terms:string[]}
|
||||||
|
*/
|
||||||
|
private function loadCleanupProfile(): array
|
||||||
|
{
|
||||||
|
return $this->languageCleanupConfig->getCleanupProfile($this->retrieverConfig->queryCleanupProfile());
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param string[] $phrases
|
||||||
|
*/
|
||||||
|
private function removePhrases(string $query, array $phrases): string
|
||||||
|
{
|
||||||
|
foreach ($phrases as $phrase) {
|
||||||
|
$phrase = trim(mb_strtolower($phrase, 'UTF-8'));
|
||||||
|
|
||||||
|
if ($phrase === '') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$normalizedPhrase = $this->languageCleanupConfig->replaceWordSeparatorsWithSpace($phrase);
|
||||||
|
$parts = preg_split('/\s+/u', $normalizedPhrase, -1, PREG_SPLIT_NO_EMPTY) ?: [];
|
||||||
|
|
||||||
|
if ($parts === []) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$pattern = implode('\\s+', array_map(
|
||||||
|
static fn (string $part): string => preg_quote($part, '/'),
|
||||||
|
$parts
|
||||||
|
));
|
||||||
|
|
||||||
|
$query = preg_replace('/(?<!\p{L})(?:' . $pattern . ')(?!\p{L})/u', ' ', $query) ?? $query;
|
||||||
|
}
|
||||||
|
|
||||||
|
return $query;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
774
src/Service/Admin/EvalAdminService.php
Normal file
774
src/Service/Admin/EvalAdminService.php
Normal file
@@ -0,0 +1,774 @@
|
|||||||
|
<?php
|
||||||
|
|
||||||
|
declare(strict_types=1);
|
||||||
|
|
||||||
|
namespace App\Service\Admin;
|
||||||
|
|
||||||
|
use App\Eval\AgentEvalRunner;
|
||||||
|
use App\Eval\Dto\EvalCase;
|
||||||
|
use App\Eval\Dto\EvalResult;
|
||||||
|
use App\Eval\EvalCaseLoader;
|
||||||
|
use App\Eval\EvalReportWriter;
|
||||||
|
|
||||||
|
final readonly class EvalAdminService
|
||||||
|
{
|
||||||
|
/**
|
||||||
|
* @var array<string, string>
|
||||||
|
*/
|
||||||
|
private const TYPES = [
|
||||||
|
'retrieval' => 'Retrieval',
|
||||||
|
'shop_query' => 'Shopquery',
|
||||||
|
'followup' => 'Follow-up',
|
||||||
|
'answer_guard' => 'Answer-Guard',
|
||||||
|
];
|
||||||
|
|
||||||
|
public function __construct(
|
||||||
|
private EvalCaseLoader $caseLoader,
|
||||||
|
private AgentEvalRunner $runner,
|
||||||
|
private EvalReportWriter $reportWriter,
|
||||||
|
private string $projectDir,
|
||||||
|
) {
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<string, string>
|
||||||
|
*/
|
||||||
|
public function supportedTypes(): array
|
||||||
|
{
|
||||||
|
return self::TYPES;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<int, string>
|
||||||
|
*/
|
||||||
|
public function supportedTypeNames(): array
|
||||||
|
{
|
||||||
|
return array_keys(self::TYPES);
|
||||||
|
}
|
||||||
|
|
||||||
|
public function assertSupportedType(string $type): string
|
||||||
|
{
|
||||||
|
$type = trim($type);
|
||||||
|
|
||||||
|
if (!array_key_exists($type, self::TYPES)) {
|
||||||
|
throw new \InvalidArgumentException(sprintf('Unsupported eval type: %s', $type));
|
||||||
|
}
|
||||||
|
|
||||||
|
return $type;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<string, array<int, array{id:string,prompt:string,type:string}>>
|
||||||
|
*/
|
||||||
|
public function casesByType(): array
|
||||||
|
{
|
||||||
|
$casesByType = [];
|
||||||
|
|
||||||
|
foreach (array_keys(self::TYPES) as $type) {
|
||||||
|
$casesByType[$type] = array_map(
|
||||||
|
static fn (EvalCase $case): array => [
|
||||||
|
'id' => $case->id,
|
||||||
|
'type' => $case->type,
|
||||||
|
'prompt' => $case->prompt,
|
||||||
|
],
|
||||||
|
$this->loadCases($type)
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
return $casesByType;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<int, array<string, mixed>>
|
||||||
|
*/
|
||||||
|
public function overview(): array
|
||||||
|
{
|
||||||
|
$overview = [];
|
||||||
|
|
||||||
|
foreach (self::TYPES as $type => $label) {
|
||||||
|
$cases = $this->loadCases($type);
|
||||||
|
$report = $this->readTypeReport($type);
|
||||||
|
|
||||||
|
$overview[] = [
|
||||||
|
'type' => $type,
|
||||||
|
'label' => $label,
|
||||||
|
'case_count' => count($cases),
|
||||||
|
'report' => $report,
|
||||||
|
'status' => $this->statusFromReport($report),
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
return $overview;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<string, mixed>
|
||||||
|
*/
|
||||||
|
public function run(string $type, ?string $caseId = null): array
|
||||||
|
{
|
||||||
|
$type = $this->assertSupportedType($type);
|
||||||
|
$caseId = trim((string) $caseId);
|
||||||
|
$cases = $this->loadCases($type);
|
||||||
|
|
||||||
|
if ($caseId !== '') {
|
||||||
|
$cases = $this->filterCasesById($cases, $caseId);
|
||||||
|
|
||||||
|
if ($cases === []) {
|
||||||
|
[$type, $cases] = $this->findCasesByIdAcrossTypes($caseId);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if ($cases === []) {
|
||||||
|
if ($caseId !== '') {
|
||||||
|
throw new \RuntimeException(sprintf(
|
||||||
|
'Eval case "%s" was not found. Please select a case from the list for the chosen eval type.',
|
||||||
|
$caseId
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
throw new \RuntimeException(sprintf(
|
||||||
|
'No eval cases available for eval type "%s".',
|
||||||
|
$type
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
$results = $this->runner->runAll($cases);
|
||||||
|
$report = $this->buildReport($type, $caseId !== '' ? $caseId : null, $results);
|
||||||
|
|
||||||
|
$typeReportPath = $this->reportWriter->write($report, sprintf('%s-last-run.json', $type));
|
||||||
|
$lastReportPath = $this->reportWriter->write($report);
|
||||||
|
|
||||||
|
$report['written_to'] = $typeReportPath;
|
||||||
|
$report['last_run_written_to'] = $lastReportPath;
|
||||||
|
|
||||||
|
return $report;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array{type:string,id:string,prompt:string,assert_json:string,history_json:string,request_context_hint:string,source_label:string}
|
||||||
|
*/
|
||||||
|
public function emptyCaseDraft(string $type = 'retrieval'): array
|
||||||
|
{
|
||||||
|
$type = $this->assertSupportedType($type);
|
||||||
|
|
||||||
|
return [
|
||||||
|
'type' => $type,
|
||||||
|
'id' => '',
|
||||||
|
'prompt' => '',
|
||||||
|
'assert_json' => $this->encodePrettyJson($this->defaultAssertForType($type)),
|
||||||
|
'history_json' => '',
|
||||||
|
'request_context_hint' => '',
|
||||||
|
'source_label' => '',
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array{type:string,id:string,prompt:string,assert_json:string,history_json:string,request_context_hint:string,source_label:string}
|
||||||
|
*/
|
||||||
|
public function caseDraftFromReportResult(string $type, string $caseId): array
|
||||||
|
{
|
||||||
|
$type = $this->assertSupportedType($type);
|
||||||
|
$caseId = trim($caseId);
|
||||||
|
|
||||||
|
if ($caseId === '') {
|
||||||
|
throw new \InvalidArgumentException('Es wurde keine Quell-Case-ID übergeben.');
|
||||||
|
}
|
||||||
|
|
||||||
|
$report = $this->readTypeReport($type);
|
||||||
|
if ($report === null) {
|
||||||
|
throw new \RuntimeException(sprintf(
|
||||||
|
'Für den Eval-Typ "%s" liegt kein Report vor. Bitte den Eval zuerst ausführen.',
|
||||||
|
$type
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
$result = null;
|
||||||
|
foreach (($report['results'] ?? []) as $candidate) {
|
||||||
|
if (is_array($candidate) && (string) ($candidate['case_id'] ?? '') === $caseId) {
|
||||||
|
$result = $candidate;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!is_array($result)) {
|
||||||
|
throw new \RuntimeException(sprintf(
|
||||||
|
'Der Report enthält keinen Case "%s" für Eval-Typ "%s".',
|
||||||
|
$caseId,
|
||||||
|
$type
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
$details = is_array($result['details'] ?? null) ? $result['details'] : [];
|
||||||
|
$prompt = trim((string) ($result['prompt'] ?? $details['prompt'] ?? ''));
|
||||||
|
$history = $this->historyDraftFromDetails($details);
|
||||||
|
$assert = $this->suggestAssertFromReportResult($type, $result, $details);
|
||||||
|
|
||||||
|
return [
|
||||||
|
'type' => $type,
|
||||||
|
'id' => $this->suggestUniqueCaseId($type . '_' . $caseId . '_new'),
|
||||||
|
'prompt' => $prompt,
|
||||||
|
'assert_json' => $this->encodePrettyJson($assert),
|
||||||
|
'history_json' => $history === [] ? '' : $this->encodePrettyJson($history),
|
||||||
|
'request_context_hint' => '',
|
||||||
|
'source_label' => sprintf('Vorlage aus Report-Case %s (%s)', $caseId, self::TYPES[$type]),
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array{type:string,id:string,path:string,row:array<string,mixed>,case_count:int}
|
||||||
|
*/
|
||||||
|
public function createCase(
|
||||||
|
string $type,
|
||||||
|
string $id,
|
||||||
|
string $prompt,
|
||||||
|
string $assertJson,
|
||||||
|
string $historyJson = '',
|
||||||
|
string $requestContextHint = '',
|
||||||
|
): array {
|
||||||
|
$type = $this->assertSupportedType($type);
|
||||||
|
$id = $this->normalizeNewCaseId($id);
|
||||||
|
$prompt = trim($prompt);
|
||||||
|
$requestContextHint = trim($requestContextHint);
|
||||||
|
|
||||||
|
if ($prompt === '') {
|
||||||
|
throw new \InvalidArgumentException('Der Eval-Prompt darf nicht leer sein.');
|
||||||
|
}
|
||||||
|
|
||||||
|
if ($this->caseIdExists($id)) {
|
||||||
|
throw new \RuntimeException(sprintf(
|
||||||
|
'Ein Eval-Case mit der ID "%s" existiert bereits. Bitte eine neue ID verwenden.',
|
||||||
|
$id
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
$assert = $this->decodeJsonObject($assertJson, 'Assert-JSON');
|
||||||
|
$history = $this->decodeHistoryJson($historyJson);
|
||||||
|
|
||||||
|
$row = [
|
||||||
|
'id' => $id,
|
||||||
|
'type' => $type,
|
||||||
|
'prompt' => $prompt,
|
||||||
|
'assert' => $assert,
|
||||||
|
];
|
||||||
|
|
||||||
|
if ($history !== []) {
|
||||||
|
$row['history'] = $history;
|
||||||
|
}
|
||||||
|
|
||||||
|
if ($requestContextHint !== '') {
|
||||||
|
$row['request_context_hint'] = $requestContextHint;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Validate with the same DTO that the eval runner uses.
|
||||||
|
EvalCase::fromArray($row);
|
||||||
|
|
||||||
|
$path = $this->caseFilePath($type);
|
||||||
|
$line = json_encode(
|
||||||
|
$row,
|
||||||
|
JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES | JSON_THROW_ON_ERROR
|
||||||
|
);
|
||||||
|
|
||||||
|
$prefix = '';
|
||||||
|
if (is_file($path) && filesize($path) > 0) {
|
||||||
|
$contents = file_get_contents($path);
|
||||||
|
if (is_string($contents) && $contents !== '' && !str_ends_with($contents, "\n")) {
|
||||||
|
$prefix = "\n";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
$written = file_put_contents($path, $prefix . $line . PHP_EOL, FILE_APPEND | LOCK_EX);
|
||||||
|
if ($written === false) {
|
||||||
|
throw new \RuntimeException(sprintf('Eval-Case-Datei konnte nicht geschrieben werden: %s', $path));
|
||||||
|
}
|
||||||
|
|
||||||
|
return [
|
||||||
|
'type' => $type,
|
||||||
|
'id' => $id,
|
||||||
|
'path' => $path,
|
||||||
|
'row' => $row,
|
||||||
|
'case_count' => count($this->loadCases($type)),
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array{type:string,id:string,path:string,case_count:int}
|
||||||
|
*/
|
||||||
|
public function deleteCase(string $type, string $caseId): array
|
||||||
|
{
|
||||||
|
$type = $this->assertSupportedType($type);
|
||||||
|
$caseId = $this->normalizeExistingCaseId($caseId);
|
||||||
|
$path = $this->caseFilePath($type);
|
||||||
|
|
||||||
|
if (!is_file($path)) {
|
||||||
|
throw new \RuntimeException(sprintf('Eval-Case-Datei wurde nicht gefunden: %s', $path));
|
||||||
|
}
|
||||||
|
|
||||||
|
$lines = file($path, FILE_IGNORE_NEW_LINES);
|
||||||
|
if ($lines === false) {
|
||||||
|
throw new \RuntimeException(sprintf('Eval-Case-Datei konnte nicht gelesen werden: %s', $path));
|
||||||
|
}
|
||||||
|
|
||||||
|
$keptLines = [];
|
||||||
|
$deleted = false;
|
||||||
|
|
||||||
|
foreach ($lines as $line) {
|
||||||
|
$trimmed = trim((string) $line);
|
||||||
|
if ($trimmed === '') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
$decoded = json_decode($trimmed, true, 512, JSON_THROW_ON_ERROR);
|
||||||
|
} catch (\JsonException $e) {
|
||||||
|
throw new \RuntimeException(sprintf(
|
||||||
|
'Eval-Case-Datei enthält ungültiges JSON und wurde nicht verändert: %s',
|
||||||
|
$e->getMessage()
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!is_array($decoded)) {
|
||||||
|
throw new \RuntimeException('Eval-Case-Datei enthält eine ungültige NDJSON-Zeile und wurde nicht verändert.');
|
||||||
|
}
|
||||||
|
|
||||||
|
if ((string) ($decoded['id'] ?? '') === $caseId) {
|
||||||
|
$deleted = true;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$keptLines[] = $trimmed;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!$deleted) {
|
||||||
|
throw new \RuntimeException(sprintf(
|
||||||
|
'Eval-Case "%s" wurde im Typ "%s" nicht gefunden.',
|
||||||
|
$caseId,
|
||||||
|
$type
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
$contents = $keptLines === [] ? '' : implode(PHP_EOL, $keptLines) . PHP_EOL;
|
||||||
|
$written = file_put_contents($path, $contents, LOCK_EX);
|
||||||
|
if ($written === false) {
|
||||||
|
throw new \RuntimeException(sprintf('Eval-Case-Datei konnte nicht geschrieben werden: %s', $path));
|
||||||
|
}
|
||||||
|
|
||||||
|
return [
|
||||||
|
'type' => $type,
|
||||||
|
'id' => $caseId,
|
||||||
|
'path' => $path,
|
||||||
|
'case_count' => count($this->loadCases($type)),
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<int, EvalCase> $cases
|
||||||
|
* @return array<int, EvalCase>
|
||||||
|
*/
|
||||||
|
private function filterCasesById(array $cases, string $caseId): array
|
||||||
|
{
|
||||||
|
return array_values(array_filter(
|
||||||
|
$cases,
|
||||||
|
static fn (EvalCase $case): bool => $case->id === $caseId
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array{0:string,1:array<int, EvalCase>}
|
||||||
|
*/
|
||||||
|
private function findCasesByIdAcrossTypes(string $caseId): array
|
||||||
|
{
|
||||||
|
foreach (array_keys(self::TYPES) as $candidateType) {
|
||||||
|
$cases = $this->filterCasesById($this->loadCases($candidateType), $caseId);
|
||||||
|
|
||||||
|
if ($cases !== []) {
|
||||||
|
return [$candidateType, $cases];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return ['', []];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<string, mixed>|null
|
||||||
|
*/
|
||||||
|
public function readTypeReport(string $type): ?array
|
||||||
|
{
|
||||||
|
$type = $this->assertSupportedType($type);
|
||||||
|
|
||||||
|
return $this->readReportFile(sprintf('%s/tests/evals/reports/%s-last-run.json', $this->projectDir, $type));
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<string, mixed>|null
|
||||||
|
*/
|
||||||
|
public function readLastReport(): ?array
|
||||||
|
{
|
||||||
|
return $this->readReportFile(sprintf('%s/tests/evals/reports/last-run.json', $this->projectDir));
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<int, EvalCase>
|
||||||
|
*/
|
||||||
|
private function loadCases(string $type): array
|
||||||
|
{
|
||||||
|
return $this->caseLoader->load($this->assertSupportedType($type));
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<int, EvalResult> $results
|
||||||
|
* @return array<string, mixed>
|
||||||
|
*/
|
||||||
|
private function buildReport(string $type, ?string $caseId, array $results): array
|
||||||
|
{
|
||||||
|
$passed = count(array_filter(
|
||||||
|
$results,
|
||||||
|
static fn (EvalResult $result): bool => $result->passed
|
||||||
|
));
|
||||||
|
$failed = count($results) - $passed;
|
||||||
|
|
||||||
|
return [
|
||||||
|
'type' => $type,
|
||||||
|
'case_filter' => $caseId,
|
||||||
|
'total' => count($results),
|
||||||
|
'passed' => $passed,
|
||||||
|
'failed' => $failed,
|
||||||
|
'generated_at' => (new \DateTimeImmutable())->format(\DateTimeInterface::ATOM),
|
||||||
|
'results' => array_map(
|
||||||
|
static fn (EvalResult $result): array => $result->toArray(),
|
||||||
|
$results
|
||||||
|
),
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<string, mixed>|null
|
||||||
|
*/
|
||||||
|
private function readReportFile(string $path): ?array
|
||||||
|
{
|
||||||
|
if (!is_file($path)) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
$raw = file_get_contents($path);
|
||||||
|
|
||||||
|
if (!is_string($raw) || trim($raw) === '') {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
$decoded = json_decode($raw, true);
|
||||||
|
|
||||||
|
if (!is_array($decoded)) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
return $decoded;
|
||||||
|
}
|
||||||
|
|
||||||
|
private function normalizeNewCaseId(string $id): string
|
||||||
|
{
|
||||||
|
$id = trim($id);
|
||||||
|
|
||||||
|
if ($id === '') {
|
||||||
|
throw new \InvalidArgumentException('Die Eval-Case-ID darf nicht leer sein.');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (preg_match('/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/', $id) !== 1) {
|
||||||
|
throw new \InvalidArgumentException(
|
||||||
|
'Die Eval-Case-ID darf nur Buchstaben, Zahlen, Unterstriche und Bindestriche enthalten und muss mit einem Buchstaben oder einer Zahl beginnen.'
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
return $id;
|
||||||
|
}
|
||||||
|
|
||||||
|
private function normalizeExistingCaseId(string $id): string
|
||||||
|
{
|
||||||
|
$id = trim($id);
|
||||||
|
|
||||||
|
if ($id === '') {
|
||||||
|
throw new \InvalidArgumentException('Es wurde keine Eval-Case-ID zum Löschen übergeben.');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (preg_match('/^[a-zA-Z0-9][a-zA-Z0-9_-]*$/', $id) !== 1) {
|
||||||
|
throw new \InvalidArgumentException(
|
||||||
|
'Die Eval-Case-ID ist ungültig. Erlaubt sind nur Buchstaben, Zahlen, Unterstriche und Bindestriche.'
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
return $id;
|
||||||
|
}
|
||||||
|
|
||||||
|
private function caseIdExists(string $id): bool
|
||||||
|
{
|
||||||
|
foreach (array_keys(self::TYPES) as $type) {
|
||||||
|
foreach ($this->loadCases($type) as $case) {
|
||||||
|
if ($case->id === $id) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<string, mixed>
|
||||||
|
*/
|
||||||
|
private function decodeJsonObject(string $json, string $label): array
|
||||||
|
{
|
||||||
|
$json = trim($json);
|
||||||
|
|
||||||
|
if ($json === '') {
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
$decoded = json_decode($json, true, 512, JSON_THROW_ON_ERROR);
|
||||||
|
} catch (\JsonException $e) {
|
||||||
|
throw new \InvalidArgumentException(sprintf('%s ist ungültig: %s', $label, $e->getMessage()));
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!is_array($decoded) || !str_starts_with($json, '{') || ($decoded !== [] && array_is_list($decoded))) {
|
||||||
|
throw new \InvalidArgumentException(sprintf('%s muss ein JSON-Objekt sein.', $label));
|
||||||
|
}
|
||||||
|
|
||||||
|
return $decoded;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<int, array{prompt:string,answer:string}>
|
||||||
|
*/
|
||||||
|
private function decodeHistoryJson(string $json): array
|
||||||
|
{
|
||||||
|
$json = trim($json);
|
||||||
|
|
||||||
|
if ($json === '') {
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
$decoded = json_decode($json, true, 512, JSON_THROW_ON_ERROR);
|
||||||
|
} catch (\JsonException $e) {
|
||||||
|
throw new \InvalidArgumentException(sprintf('History-JSON ist ungültig: %s', $e->getMessage()));
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!is_array($decoded) || !str_starts_with($json, '[') || !array_is_list($decoded)) {
|
||||||
|
throw new \InvalidArgumentException('History-JSON muss eine JSON-Liste sein.');
|
||||||
|
}
|
||||||
|
|
||||||
|
$history = [];
|
||||||
|
|
||||||
|
foreach ($decoded as $entry) {
|
||||||
|
if (is_string($entry)) {
|
||||||
|
$entry = trim($entry);
|
||||||
|
if ($entry !== '') {
|
||||||
|
$history[] = [
|
||||||
|
'prompt' => 'Eval-Kontext',
|
||||||
|
'answer' => $entry,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!is_array($entry)) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$prompt = trim((string) ($entry['prompt'] ?? ''));
|
||||||
|
$answer = trim((string) ($entry['answer'] ?? $entry['response'] ?? $entry['answer_preview'] ?? ''));
|
||||||
|
|
||||||
|
if ($prompt === '' && $answer === '') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$history[] = [
|
||||||
|
'prompt' => $prompt !== '' ? $prompt : 'Eval-Kontext',
|
||||||
|
'answer' => $answer,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
return $history;
|
||||||
|
}
|
||||||
|
|
||||||
|
private function caseFilePath(string $type): string
|
||||||
|
{
|
||||||
|
$type = $this->assertSupportedType($type);
|
||||||
|
|
||||||
|
return sprintf('%s/tests/evals/cases/%s.ndjson', $this->projectDir, $type);
|
||||||
|
}
|
||||||
|
|
||||||
|
private function statusFromReport(?array $report): string
|
||||||
|
{
|
||||||
|
if ($report === null) {
|
||||||
|
return 'not_run';
|
||||||
|
}
|
||||||
|
|
||||||
|
$failed = (int) ($report['failed'] ?? 0);
|
||||||
|
$total = (int) ($report['total'] ?? 0);
|
||||||
|
|
||||||
|
if ($total <= 0) {
|
||||||
|
return 'empty';
|
||||||
|
}
|
||||||
|
|
||||||
|
return $failed === 0 ? 'green' : 'red';
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @return array<string, mixed>
|
||||||
|
*/
|
||||||
|
private function defaultAssertForType(string $type): array
|
||||||
|
{
|
||||||
|
return match ($type) {
|
||||||
|
'retrieval', 'answer_guard' => [
|
||||||
|
'min_results' => 1,
|
||||||
|
],
|
||||||
|
'shop_query', 'followup' => [
|
||||||
|
'expected_query' => '',
|
||||||
|
],
|
||||||
|
default => [],
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<string, mixed> $result
|
||||||
|
* @param array<string, mixed> $details
|
||||||
|
* @return array<string, mixed>
|
||||||
|
*/
|
||||||
|
private function suggestAssertFromReportResult(string $type, array $result, array $details): array
|
||||||
|
{
|
||||||
|
if (($type === 'shop_query' || $type === 'followup') && is_string($details['query'] ?? null)) {
|
||||||
|
$query = trim($details['query']);
|
||||||
|
if ($query !== '') {
|
||||||
|
return [
|
||||||
|
'expected_query' => $query,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (($type === 'shop_query' || $type === 'followup') && is_array($details['individual_queries'] ?? null)) {
|
||||||
|
$queries = array_values(array_filter(array_map(
|
||||||
|
static fn (mixed $value): string => trim((string) $value),
|
||||||
|
$details['individual_queries']
|
||||||
|
)));
|
||||||
|
|
||||||
|
if ($queries !== []) {
|
||||||
|
return [
|
||||||
|
'expected_individual_queries' => $queries,
|
||||||
|
'expected_individual_queries_exact' => true,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (is_array($details['document_refs'] ?? null)) {
|
||||||
|
$documentIds = [];
|
||||||
|
foreach ($details['document_refs'] as $documentRef) {
|
||||||
|
if (!is_array($documentRef)) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$documentId = trim((string) ($documentRef['id'] ?? ''));
|
||||||
|
if ($documentId !== '') {
|
||||||
|
$documentIds[] = $documentId;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if ($documentIds !== []) {
|
||||||
|
return [
|
||||||
|
'min_results' => 1,
|
||||||
|
'must_include_one_of_document_ids' => array_values(array_unique($documentIds)),
|
||||||
|
];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (is_array($details['document_ids'] ?? null)) {
|
||||||
|
$documentIds = array_values(array_filter(array_map(
|
||||||
|
static fn (mixed $value): string => trim((string) $value),
|
||||||
|
$details['document_ids']
|
||||||
|
)));
|
||||||
|
|
||||||
|
if ($documentIds !== []) {
|
||||||
|
return [
|
||||||
|
'min_results' => 1,
|
||||||
|
'must_include_one_of_document_ids' => array_values(array_unique($documentIds)),
|
||||||
|
];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
$resultCount = (int) ($details['result_count'] ?? -1);
|
||||||
|
if ($resultCount === 0) {
|
||||||
|
return [
|
||||||
|
'max_results' => 0,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
return $this->defaultAssertForType($type);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<string, mixed> $details
|
||||||
|
* @return array<int, array{prompt:string,answer:string}>
|
||||||
|
*/
|
||||||
|
private function historyDraftFromDetails(array $details): array
|
||||||
|
{
|
||||||
|
if (!is_array($details['history'] ?? null)) {
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
|
||||||
|
$history = [];
|
||||||
|
foreach ($details['history'] as $entry) {
|
||||||
|
if (!is_array($entry)) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$prompt = trim((string) ($entry['prompt'] ?? ''));
|
||||||
|
$answer = trim((string) ($entry['answer'] ?? $entry['answer_preview'] ?? ''));
|
||||||
|
|
||||||
|
if ($prompt === '' && $answer === '') {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
$history[] = [
|
||||||
|
'prompt' => $prompt !== '' ? $prompt : 'Eval-Kontext',
|
||||||
|
'answer' => $answer,
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
return $history;
|
||||||
|
}
|
||||||
|
|
||||||
|
private function suggestUniqueCaseId(string $base): string
|
||||||
|
{
|
||||||
|
$base = strtolower(trim($base));
|
||||||
|
$base = preg_replace('/[^a-z0-9_-]+/', '_', $base) ?? 'eval_case';
|
||||||
|
$base = trim($base, '_-');
|
||||||
|
|
||||||
|
if ($base === '') {
|
||||||
|
$base = 'eval_case';
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!$this->caseIdExists($base)) {
|
||||||
|
return $base;
|
||||||
|
}
|
||||||
|
|
||||||
|
for ($i = 2; $i <= 999; ++$i) {
|
||||||
|
$candidate = sprintf('%s_%d', $base, $i);
|
||||||
|
if (!$this->caseIdExists($candidate)) {
|
||||||
|
return $candidate;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return sprintf('%s_%s', $base, (new \DateTimeImmutable())->format('YmdHis'));
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @param array<mixed> $value
|
||||||
|
*/
|
||||||
|
private function encodePrettyJson(array $value): string
|
||||||
|
{
|
||||||
|
return json_encode(
|
||||||
|
$value,
|
||||||
|
JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES | JSON_THROW_ON_ERROR
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -134,6 +134,10 @@
|
|||||||
href="{{ path('admin_model_config_list') }}#agentLiveTest">
|
href="{{ path('admin_model_config_list') }}#agentLiveTest">
|
||||||
<i class="bi bi-rocket-takeoff-fill"></i> KI-Agent Live-Test
|
<i class="bi bi-rocket-takeoff-fill"></i> KI-Agent Live-Test
|
||||||
</a>
|
</a>
|
||||||
|
<a class="nav-link text-light {% if route starts with 'admin_evals' %}active fw-bold{% endif %}"
|
||||||
|
href="{{ path('admin_evals_index') }}">
|
||||||
|
<i class="bi bi-clipboard2-check"></i> Eval Suite
|
||||||
|
</a>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
<hr class="border-secondary">
|
<hr class="border-secondary">
|
||||||
<div class="text-info text-uppercase small mb-2">
|
<div class="text-info text-uppercase small mb-2">
|
||||||
|
|||||||
351
templates/admin/evals/case_new.html.twig
Normal file
351
templates/admin/evals/case_new.html.twig
Normal file
@@ -0,0 +1,351 @@
|
|||||||
|
{% extends 'admin/base.html.twig' %}
|
||||||
|
|
||||||
|
{% block title %}Eval-Cases verwalten{% endblock %}
|
||||||
|
|
||||||
|
{% block body %}
|
||||||
|
|
||||||
|
<div class="d-flex justify-content-between align-items-center mb-4 flex-wrap gap-2">
|
||||||
|
<div>
|
||||||
|
<h1 class="h3 mb-1">
|
||||||
|
<i class="bi bi-journal-plus"></i> Eval-Cases verwalten
|
||||||
|
</h1>
|
||||||
|
<div class="small text-secondary">
|
||||||
|
Neue Regression-Cases separat anlegen oder bestehende Cases entfernen, ohne die Eval-Suite-Übersicht aufzublähen.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<a href="{{ path('admin_evals_index', {type: case_draft.type|default('retrieval')}) }}"
|
||||||
|
class="btn btn-sm btn-outline-secondary">
|
||||||
|
Zurück zur Eval Suite
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{% for label in ['success', 'danger', 'warning', 'info'] %}
|
||||||
|
{% for message in app.flashes(label) %}
|
||||||
|
<div class="alert alert-{{ label }} shadow-sm">
|
||||||
|
{{ message }}
|
||||||
|
</div>
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
{% if case_draft.source_label|default('') %}
|
||||||
|
<div class="alert alert-info border-info bg-black text-light shadow-sm">
|
||||||
|
<strong>Vorlage geladen:</strong> {{ case_draft.source_label }}<br>
|
||||||
|
<span class="small text-secondary">
|
||||||
|
Bitte Case-ID, Prompt und Assertions prüfen, bevor du den Case speicherst.
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
<div class="alert alert-secondary border-secondary bg-black text-light shadow-sm mb-4">
|
||||||
|
<div class="fw-semibold text-warning mb-1">
|
||||||
|
<i class="bi bi-compass"></i> Kurz erklärt
|
||||||
|
</div>
|
||||||
|
<div class="small text-secondary">
|
||||||
|
Ein Eval-Case ist ein wiederholbarer Test. Du trägst ein, <strong class="text-light">was der Nutzer fragt</strong>
|
||||||
|
und <strong class="text-light">woran RetrieX gemessen werden soll</strong>. Der Test verändert keine Daten im Shop oder im RAG-Wissen,
|
||||||
|
sondern prüft nur, ob ein bekannter Fall weiterhin richtig läuft.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="row g-4">
|
||||||
|
<div class="col-xl-8">
|
||||||
|
<div class="card bg-black border-secondary text-light shadow-sm">
|
||||||
|
<div class="card-body">
|
||||||
|
<h5 class="text-warning mb-3">
|
||||||
|
<i class="bi bi-pencil-square"></i> Neuer Eval-Case
|
||||||
|
</h5>
|
||||||
|
|
||||||
|
<form method="post" action="{{ path('admin_evals_case_create') }}">
|
||||||
|
<input type="hidden" name="_token" value="{{ csrf_token('admin_eval_case_create') }}">
|
||||||
|
|
||||||
|
<div class="mb-4">
|
||||||
|
<label class="form-label">Eval-Typ</label>
|
||||||
|
<select name="type" class="form-select bg-dark text-light border-secondary">
|
||||||
|
{% for type, label in types %}
|
||||||
|
<option value="{{ type }}" {% if type == case_draft.type|default('retrieval') %}selected{% endif %}>
|
||||||
|
{{ label }}
|
||||||
|
</option>
|
||||||
|
{% endfor %}
|
||||||
|
</select>
|
||||||
|
<div class="form-text text-secondary">
|
||||||
|
Wähle zuerst, <strong class="text-light">was genau geprüft werden soll</strong>. Der Typ entscheidet auch,
|
||||||
|
in welche Datei der Case geschrieben wird: <code>tests/evals/cases/<type>.ndjson</code>.
|
||||||
|
</div>
|
||||||
|
<div class="small text-secondary mt-2 border border-secondary rounded p-3 bg-dark">
|
||||||
|
<div class="mb-1"><strong class="text-light">retrieval</strong>: prüft, ob die richtige Wissensquelle oder das richtige Dokument gefunden wird.</div>
|
||||||
|
<div class="mb-1"><strong class="text-light">shop_query</strong>: prüft, welche Suchquery an den Shop geschickt würde.</div>
|
||||||
|
<div class="mb-1"><strong class="text-light">followup</strong>: prüft eine Folgefrage, die den vorherigen Chatverlauf braucht.</div>
|
||||||
|
<div><strong class="text-light">answer_guard</strong>: prüft, dass RetrieX bei Unsinn oder fehlender Evidenz nichts erfindet.</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="mb-4">
|
||||||
|
<label class="form-label">Neue Case-ID</label>
|
||||||
|
<input type="text"
|
||||||
|
name="id"
|
||||||
|
value="{{ case_draft.id|default('') }}"
|
||||||
|
class="form-control bg-dark text-light border-secondary"
|
||||||
|
placeholder="followup_testomat808_device_price_001"
|
||||||
|
required>
|
||||||
|
<div class="form-text text-secondary">
|
||||||
|
Das ist der <strong class="text-light">interne Name des Tests</strong>. Er erscheint später in der Eval-Auswertung,
|
||||||
|
damit du den Fall wiedererkennst. Verwende keine Leerzeichen. Erlaubt sind Buchstaben, Zahlen, <code>_</code> und <code>-</code>.
|
||||||
|
</div>
|
||||||
|
<div class="small text-secondary mt-2 border border-secondary rounded p-3 bg-dark">
|
||||||
|
Gute Beispiele: <code>retrieval_lieferbedingungen_versand_001</code>,
|
||||||
|
<code>shop_query_testomat808_indikator300_001</code>,
|
||||||
|
<code>followup_testomat808_device_price_001</code>.<br>
|
||||||
|
Faustregel: <code>typ_thema_ziel_nummer</code>.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="mb-4">
|
||||||
|
<label class="form-label">Prompt</label>
|
||||||
|
<textarea name="prompt"
|
||||||
|
rows="3"
|
||||||
|
class="form-control bg-dark text-light border-secondary"
|
||||||
|
placeholder="und was kostet das gerät selber"
|
||||||
|
required>{{ case_draft.prompt|default('') }}</textarea>
|
||||||
|
<div class="form-text text-secondary">
|
||||||
|
Hier kommt <strong class="text-light">genau die Nutzerfrage</strong> hinein, die getestet werden soll.
|
||||||
|
Nicht die erwartete Antwort eintragen, sondern den Satz, den ein Nutzer in den Chat schreiben würde.
|
||||||
|
</div>
|
||||||
|
<div class="small text-secondary mt-2 border border-secondary rounded p-3 bg-dark">
|
||||||
|
Tippfehler dürfen bewusst drin bleiben, wenn genau dieser Tippfehler abgesichert werden soll.
|
||||||
|
Beispiel: <code>ich würde gern chlor im schwinnbad messen</code> prüft dann auch die Korrektur Richtung <code>schwimmbad</code>.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="mb-4">
|
||||||
|
<label class="form-label">Assert-JSON</label>
|
||||||
|
<textarea name="assert_json"
|
||||||
|
rows="9"
|
||||||
|
class="form-control bg-dark text-light border-secondary font-monospace"
|
||||||
|
spellcheck="false">{{ case_draft.assert_json|default('{}') }}</textarea>
|
||||||
|
<div class="form-text text-secondary">
|
||||||
|
Hier steht, <strong class="text-light">was der Test erwarten soll</strong>. Das Feld muss gültiges JSON sein,
|
||||||
|
also mit <code>{</code> anfangen und mit <code>}</code> enden. Keine Kommentare und kein Komma nach dem letzten Eintrag.
|
||||||
|
</div>
|
||||||
|
<div class="small text-secondary mt-2 border border-secondary rounded p-3 bg-dark">
|
||||||
|
<div class="mb-2"><strong class="text-light">Wenn eine Shopquery exakt stimmen soll:</strong></div>
|
||||||
|
<pre class="bg-black border border-secondary rounded p-2 small text-light mb-3"><code>{
|
||||||
|
"expected_query": "testomat 808"
|
||||||
|
}</code></pre>
|
||||||
|
<div class="mb-2"><strong class="text-light">Wenn bestimmte Wörter enthalten sein müssen:</strong></div>
|
||||||
|
<pre class="bg-black border border-secondary rounded p-2 small text-light mb-3"><code>{
|
||||||
|
"must_include_terms": [
|
||||||
|
"testomat",
|
||||||
|
"808"
|
||||||
|
]
|
||||||
|
}</code></pre>
|
||||||
|
<div class="mb-2"><strong class="text-light">Wenn ein Dokument gefunden werden muss:</strong></div>
|
||||||
|
<pre class="bg-black border border-secondary rounded p-2 small text-light mb-0"><code>{
|
||||||
|
"min_results": 1,
|
||||||
|
"must_include_one_of_document_ids": [
|
||||||
|
"DOKUMENT-ID"
|
||||||
|
]
|
||||||
|
}</code></pre>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="mb-4">
|
||||||
|
<label class="form-label">History-JSON <span class="text-secondary">optional</span></label>
|
||||||
|
<textarea name="history_json"
|
||||||
|
rows="8"
|
||||||
|
class="form-control bg-dark text-light border-secondary font-monospace"
|
||||||
|
spellcheck="false"
|
||||||
|
placeholder='[{"prompt":"vorherige Frage","answer":"vorherige Antwort"}]'>{{ case_draft.history_json|default('') }}</textarea>
|
||||||
|
<div class="form-text text-secondary">
|
||||||
|
Nur ausfüllen, wenn die aktuelle Frage den <strong class="text-light">vorherigen Chatverlauf</strong> braucht.
|
||||||
|
Für direkte Einzelprompts leer lassen. Das Feld muss eine JSON-Liste sein, also mit <code>[</code> anfangen und mit <code>]</code> enden.
|
||||||
|
</div>
|
||||||
|
<div class="small text-secondary mt-2 border border-secondary rounded p-3 bg-dark">
|
||||||
|
Typischer Einsatz: Der Nutzer fragt zuerst nach dem niedrigsten Grenzwert, danach nach dem Indikator
|
||||||
|
und anschließend <code>was kostet der indikator</code>. Dann braucht der Test die vorherigen Fragen und Antworten als History.
|
||||||
|
<pre class="bg-black border border-secondary rounded p-2 small text-light mt-2 mb-0"><code>[
|
||||||
|
{
|
||||||
|
"prompt": "mit welchem indikator",
|
||||||
|
"answer": "Der Wert 0,02 °dH wird beim Testomat 808 mit Indikatortyp 300 gemessen."
|
||||||
|
}
|
||||||
|
]</code></pre>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="mb-4">
|
||||||
|
<label class="form-label">Request Context Hint <span class="text-secondary">optional</span></label>
|
||||||
|
<textarea name="request_context_hint"
|
||||||
|
rows="3"
|
||||||
|
class="form-control bg-dark text-light border-secondary"
|
||||||
|
placeholder="Nur für Spezialfälle, wenn History nicht ausreicht.">{{ case_draft.request_context_hint|default('') }}</textarea>
|
||||||
|
<div class="form-text text-secondary">
|
||||||
|
Dieses Feld kannst du fast immer <strong class="text-light">leer lassen</strong>. Es ist nur für Sonderfälle gedacht,
|
||||||
|
wenn der Test Zusatzkontext braucht, der nicht sauber als History darstellbar ist.
|
||||||
|
</div>
|
||||||
|
<div class="small text-secondary mt-2 border border-secondary rounded p-3 bg-dark">
|
||||||
|
Beispiel für einen Sonderfall: <code>Im vorherigen Ergebnis waren mehrere Shop-Produkte sichtbar, aber keine normale Chatantwort.</code>
|
||||||
|
Für normale Regressionen ist <strong class="text-light">History-JSON die bessere Wahl</strong>.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="d-flex flex-wrap gap-2">
|
||||||
|
<button type="submit" class="btn btn-warning">
|
||||||
|
<i class="bi bi-save"></i> Eval-Case speichern
|
||||||
|
</button>
|
||||||
|
<a href="{{ path('admin_evals_index', {type: case_draft.type|default('retrieval')}) }}"
|
||||||
|
class="btn btn-outline-secondary">
|
||||||
|
Abbrechen
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
</form>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="col-xl-4">
|
||||||
|
<div class="card bg-black border-danger text-light shadow-sm mb-4">
|
||||||
|
<div class="card-body">
|
||||||
|
<h5 class="text-danger mb-3">
|
||||||
|
<i class="bi bi-trash3"></i> Bestehende Eval-Cases entfernen
|
||||||
|
</h5>
|
||||||
|
<p class="small text-secondary mb-3">
|
||||||
|
Hier kannst du falsch angelegte oder nicht mehr benötigte Cases aus den
|
||||||
|
<code>tests/evals/cases/*.ndjson</code>-Dateien entfernen. Das Löschen betrifft nur den Eval-Case,
|
||||||
|
nicht das RAG-Wissen, nicht den Shop und nicht die bestehenden Reports.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
{% for type, label in types %}
|
||||||
|
{% set cases = cases_by_type[type]|default([]) %}
|
||||||
|
<details class="border border-secondary rounded p-3 mb-3" {% if type == case_draft.type|default('retrieval') %}open{% endif %}>
|
||||||
|
<summary class="text-info" style="cursor:pointer;">
|
||||||
|
{{ label }} <span class="text-secondary">({{ cases|length }} Cases)</span>
|
||||||
|
</summary>
|
||||||
|
|
||||||
|
{% if cases is empty %}
|
||||||
|
<div class="small text-secondary mt-3">
|
||||||
|
Für diesen Typ gibt es aktuell keine Cases.
|
||||||
|
</div>
|
||||||
|
{% else %}
|
||||||
|
<div class="mt-3">
|
||||||
|
{% for case in cases %}
|
||||||
|
<div class="border-top border-secondary pt-3 mt-3">
|
||||||
|
<div class="small mb-2">
|
||||||
|
<code>{{ case.id }}</code>
|
||||||
|
<div class="text-secondary mt-1">{{ case.prompt }}</div>
|
||||||
|
</div>
|
||||||
|
<form method="post"
|
||||||
|
action="{{ path('admin_evals_case_delete') }}"
|
||||||
|
onsubmit="return confirm('Eval-Case {{ case.id }} wirklich löschen? Diese Änderung entfernt die NDJSON-Zeile dauerhaft.');">
|
||||||
|
<input type="hidden" name="_token" value="{{ csrf_token('admin_eval_case_delete_' ~ type ~ '_' ~ case.id) }}">
|
||||||
|
<input type="hidden" name="type" value="{{ type }}">
|
||||||
|
<input type="hidden" name="case_id" value="{{ case.id }}">
|
||||||
|
<button type="submit" class="btn btn-sm btn-outline-danger">
|
||||||
|
<i class="bi bi-trash3"></i> Case löschen
|
||||||
|
</button>
|
||||||
|
</form>
|
||||||
|
</div>
|
||||||
|
{% endfor %}
|
||||||
|
</div>
|
||||||
|
{% endif %}
|
||||||
|
</details>
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
<div class="small text-secondary">
|
||||||
|
Nach dem Löschen solltest du den betroffenen Eval-Typ einmal ausführen, damit der Report zum neuen Case-Bestand passt.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="card bg-black border-secondary text-light shadow-sm mb-4">
|
||||||
|
<div class="card-body">
|
||||||
|
<h5 class="text-info mb-3">
|
||||||
|
<i class="bi bi-info-circle"></i> Welcher Typ ist richtig?
|
||||||
|
</h5>
|
||||||
|
<div class="small text-secondary">
|
||||||
|
<div class="mb-3">
|
||||||
|
<strong class="text-light">Du willst prüfen, ob das richtige Dokument gefunden wird?</strong><br>
|
||||||
|
Dann nimm <code>retrieval</code>.
|
||||||
|
</div>
|
||||||
|
<div class="mb-3">
|
||||||
|
<strong class="text-light">Du willst prüfen, welche Suchwörter an den Shop gehen?</strong><br>
|
||||||
|
Dann nimm <code>shop_query</code>.
|
||||||
|
</div>
|
||||||
|
<div class="mb-3">
|
||||||
|
<strong class="text-light">Die Frage bezieht sich auf die vorherige Antwort?</strong><br>
|
||||||
|
Dann nimm <code>followup</code> und fülle <code>History-JSON</code> aus.
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<strong class="text-light">RetrieX soll bei Unsinn nichts erfinden?</strong><br>
|
||||||
|
Dann nimm <code>answer_guard</code>.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="card bg-black border-secondary text-light shadow-sm mb-4">
|
||||||
|
<div class="card-body">
|
||||||
|
<h5 class="text-info mb-3">
|
||||||
|
<i class="bi bi-braces"></i> Häufige Assertions
|
||||||
|
</h5>
|
||||||
|
<div class="small text-secondary mb-2">Exakte Query:</div>
|
||||||
|
<pre class="bg-dark border border-secondary rounded p-2 small text-light"><code>{
|
||||||
|
"expected_query": "testomat 808"
|
||||||
|
}</code></pre>
|
||||||
|
|
||||||
|
<div class="small text-secondary mb-2">Begriffe müssen enthalten sein:</div>
|
||||||
|
<pre class="bg-dark border border-secondary rounded p-2 small text-light"><code>{
|
||||||
|
"must_include_terms": [
|
||||||
|
"testomat",
|
||||||
|
"808"
|
||||||
|
]
|
||||||
|
}</code></pre>
|
||||||
|
|
||||||
|
<div class="small text-secondary mb-2">Begriffe dürfen nicht enthalten sein:</div>
|
||||||
|
<pre class="bg-dark border border-secondary rounded p-2 small text-light"><code>{
|
||||||
|
"must_not_include_terms": [
|
||||||
|
"indikator",
|
||||||
|
"300"
|
||||||
|
]
|
||||||
|
}</code></pre>
|
||||||
|
|
||||||
|
<div class="small text-secondary mb-2">Dokument muss enthalten sein:</div>
|
||||||
|
<pre class="bg-dark border border-secondary rounded p-2 small text-light"><code>{
|
||||||
|
"min_results": 1,
|
||||||
|
"must_include_one_of_document_ids": [
|
||||||
|
"DOKUMENT-ID"
|
||||||
|
]
|
||||||
|
}</code></pre>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="card bg-black border-secondary text-light shadow-sm mb-4">
|
||||||
|
<div class="card-body">
|
||||||
|
<h5 class="text-info mb-3">
|
||||||
|
<i class="bi bi-check2-square"></i> Vor dem Speichern prüfen
|
||||||
|
</h5>
|
||||||
|
<ul class="small text-secondary mb-0">
|
||||||
|
<li>Prüft der Case genau einen Zweck?</li>
|
||||||
|
<li>Ist die Case-ID eindeutig und ohne Leerzeichen?</li>
|
||||||
|
<li>Ist der Prompt eine echte Nutzerfrage?</li>
|
||||||
|
<li>Ist Assert-JSON gültiges JSON?</li>
|
||||||
|
<li>Ist History nur bei echten Folgefragen gefüllt?</li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="card bg-black border-secondary text-light shadow-sm">
|
||||||
|
<div class="card-body">
|
||||||
|
<h5 class="text-info mb-3">
|
||||||
|
<i class="bi bi-lightbulb"></i> Empfehlung
|
||||||
|
</h5>
|
||||||
|
<p class="small text-secondary mb-0">
|
||||||
|
Ein guter Eval-Case prüft genau einen Zweck. Lieber mehrere kleine Cases anlegen als einen großen, empfindlichen Case.
|
||||||
|
Wenn du unsicher bist, starte mit <code>expected_query</code> bei Shop-/Follow-up-Fällen oder mit
|
||||||
|
<code>must_include_one_of_document_ids</code> bei Retrieval-Fällen.
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{% endblock %}
|
||||||
547
templates/admin/evals/index.html.twig
Normal file
547
templates/admin/evals/index.html.twig
Normal file
@@ -0,0 +1,547 @@
|
|||||||
|
{% extends 'admin/base.html.twig' %}
|
||||||
|
|
||||||
|
{% block title %}RetrieX Eval Suite{% endblock %}
|
||||||
|
|
||||||
|
{% block body %}
|
||||||
|
|
||||||
|
<div class="d-flex justify-content-between align-items-center mb-4 flex-wrap gap-2">
|
||||||
|
<div>
|
||||||
|
<h1 class="h3 mb-1">
|
||||||
|
<i class="bi bi-clipboard2-check"></i> RetrieX Eval Suite
|
||||||
|
</h1>
|
||||||
|
<div class="small text-secondary">
|
||||||
|
Regressionen für Retrieval, Shopquery, Follow-up und Answer-Guard direkt im Admin prüfen.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="d-flex flex-wrap gap-2">
|
||||||
|
<a href="{{ path('admin_evals_case_new', {type: selected_type|default('retrieval')}) }}"
|
||||||
|
class="btn btn-sm btn-outline-warning">
|
||||||
|
<i class="bi bi-journal-plus"></i> Eval-Cases verwalten
|
||||||
|
</a>
|
||||||
|
<a href="{{ path('admin_model_config_list') }}"
|
||||||
|
class="btn btn-sm btn-outline-secondary">
|
||||||
|
Zurück zum KI-/LLM-Setup
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{% for label in ['success', 'danger', 'warning', 'info'] %}
|
||||||
|
{% for message in app.flashes(label) %}
|
||||||
|
<div class="alert alert-{{ label }} shadow-sm">
|
||||||
|
{{ message }}
|
||||||
|
</div>
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<div id="adminEvalRunOverlay"
|
||||||
|
class="position-fixed top-0 start-0 w-100 h-100 d-none"
|
||||||
|
style="background: rgba(0, 0, 0, .72); z-index: 1080;">
|
||||||
|
<div class="h-100 d-flex align-items-center justify-content-center px-3">
|
||||||
|
<div class="card bg-black border-warning text-light shadow-lg" style="max-width: 520px; width: 100%;">
|
||||||
|
<div class="card-body text-center py-5">
|
||||||
|
<div class="spinner-border text-warning mb-3" role="status" aria-hidden="true"></div>
|
||||||
|
<h5 class="text-warning mb-2" id="adminEvalRunOverlayLabel">Eval läuft ...</h5>
|
||||||
|
<div class="small text-secondary">
|
||||||
|
Die Regressionstests werden ausgeführt. Bitte die Seite nicht neu laden.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="row g-4 mb-4">
|
||||||
|
{% for item in overview %}
|
||||||
|
{% set report = item.report %}
|
||||||
|
{% set status = item.status %}
|
||||||
|
{% set badgeClass = status == 'green'
|
||||||
|
? 'bg-success'
|
||||||
|
: (status == 'red' ? 'bg-danger' : 'bg-secondary')
|
||||||
|
%}
|
||||||
|
<div class="col-md-6 col-xl-3">
|
||||||
|
<div class="card bg-black border-secondary text-light h-100 shadow-sm">
|
||||||
|
<div class="card-body">
|
||||||
|
<div class="d-flex justify-content-between align-items-start gap-2 mb-2">
|
||||||
|
<h5 class="text-info mb-0">{{ item.label }}</h5>
|
||||||
|
<span class="badge {{ badgeClass }}">
|
||||||
|
{% if status == 'green' %}
|
||||||
|
grün
|
||||||
|
{% elseif status == 'red' %}
|
||||||
|
rot
|
||||||
|
{% elseif status == 'empty' %}
|
||||||
|
leer
|
||||||
|
{% else %}
|
||||||
|
nicht gelaufen
|
||||||
|
{% endif %}
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="small text-secondary mb-3">
|
||||||
|
{{ item.case_count }} Cases
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{% if report %}
|
||||||
|
<div class="small">
|
||||||
|
<div><strong>Total:</strong> {{ report.total|default(0) }}</div>
|
||||||
|
<div><strong>Passed:</strong> {{ report.passed|default(0) }}</div>
|
||||||
|
<div><strong>Failed:</strong> {{ report.failed|default(0) }}</div>
|
||||||
|
<div class="text-secondary mt-2">
|
||||||
|
{{ report.generated_at|default('') }}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
{% else %}
|
||||||
|
<div class="small text-secondary">
|
||||||
|
Für diesen Typ liegt noch kein Admin-Report vor.
|
||||||
|
</div>
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
<div class="d-flex flex-wrap gap-2 mt-3">
|
||||||
|
<form method="post"
|
||||||
|
action="{{ path('admin_evals_run') }}"
|
||||||
|
class="d-inline js-admin-eval-run-form"
|
||||||
|
data-eval-type-label="{{ item.label|e('html_attr') }}">
|
||||||
|
<input type="hidden" name="_token" value="{{ csrf_token('admin_eval_run') }}">
|
||||||
|
<input type="hidden" name="type" value="{{ item.type }}">
|
||||||
|
<button type="submit" class="btn btn-sm btn-outline-warning js-admin-eval-run-button">
|
||||||
|
<span class="js-admin-eval-button-label">Run</span>
|
||||||
|
<span class="spinner-border spinner-border-sm ms-2 d-none js-admin-eval-button-spinner"
|
||||||
|
role="status"
|
||||||
|
aria-hidden="true"></span>
|
||||||
|
</button>
|
||||||
|
</form>
|
||||||
|
|
||||||
|
<a class="btn btn-sm btn-outline-info"
|
||||||
|
href="{{ path('admin_evals_index', {type: item.type}) }}">
|
||||||
|
Details
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
{% endfor %}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="row g-4 mb-4">
|
||||||
|
<div class="col-xl-5">
|
||||||
|
<div class="card bg-black border-secondary text-light h-100 shadow-sm">
|
||||||
|
<div class="card-body">
|
||||||
|
<h5 class="text-warning mb-3">
|
||||||
|
<i class="bi bi-play-circle"></i> Eval ausführen
|
||||||
|
</h5>
|
||||||
|
|
||||||
|
<form method="post"
|
||||||
|
action="{{ path('admin_evals_run') }}"
|
||||||
|
class="js-admin-eval-run-form"
|
||||||
|
data-eval-type-label="Ausgewählter Eval">
|
||||||
|
<input type="hidden" name="_token" value="{{ csrf_token('admin_eval_run') }}">
|
||||||
|
|
||||||
|
<div class="mb-3">
|
||||||
|
<label class="form-label">Eval-Typ</label>
|
||||||
|
<select name="type" class="form-select bg-dark text-light border-secondary js-admin-eval-type-select">
|
||||||
|
{% for type, label in types %}
|
||||||
|
<option value="{{ type }}" {% if type == selected_type %}selected{% endif %}>
|
||||||
|
{{ label }}
|
||||||
|
</option>
|
||||||
|
{% endfor %}
|
||||||
|
</select>
|
||||||
|
<div class="form-text text-secondary">
|
||||||
|
Ohne Case-ID wird der komplette Typ ausgeführt.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="mb-3">
|
||||||
|
<label class="form-label">Optional: Case</label>
|
||||||
|
<select name="case_id"
|
||||||
|
class="form-select bg-dark text-light border-secondary js-admin-eval-case-select">
|
||||||
|
<option value="">Alle Cases des ausgewählten Typs</option>
|
||||||
|
{% for type, cases in cases_by_type %}
|
||||||
|
{% for case in cases %}
|
||||||
|
<option value="{{ case.id }}"
|
||||||
|
data-eval-type="{{ type }}"
|
||||||
|
{% if type != selected_type %}hidden disabled{% endif %}>
|
||||||
|
{{ case.id }} — {{ case.prompt }}
|
||||||
|
</option>
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
</select>
|
||||||
|
<div class="form-text text-secondary">
|
||||||
|
Die Case-Liste wird passend zum Eval-Typ gefiltert. Leer lassen, um alle Cases des Typs auszuführen.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<button type="submit" class="btn btn-outline-warning js-admin-eval-run-button">
|
||||||
|
<span class="js-admin-eval-button-label">Eval starten</span>
|
||||||
|
<span class="spinner-border spinner-border-sm ms-2 d-none js-admin-eval-button-spinner"
|
||||||
|
role="status"
|
||||||
|
aria-hidden="true"></span>
|
||||||
|
</button>
|
||||||
|
</form>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="col-xl-7">
|
||||||
|
<div class="card bg-black border-secondary text-light h-100 shadow-sm">
|
||||||
|
<div class="card-body">
|
||||||
|
<h5 class="text-info mb-3">
|
||||||
|
<i class="bi bi-terminal"></i> CLI-Referenz
|
||||||
|
</h5>
|
||||||
|
|
||||||
|
<p class="small text-secondary mb-3">
|
||||||
|
Die Admin-Runs schreiben typspezifische Reports nach
|
||||||
|
<code>tests/evals/reports/<type>-last-run.json</code>
|
||||||
|
und zusätzlich den bekannten <code>last-run.json</code>.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<div class="small">
|
||||||
|
{% for type, label in types %}
|
||||||
|
<div class="mb-2">
|
||||||
|
<span class="text-info">{{ label }}</span><br>
|
||||||
|
<code>php bin/console mto:agent:eval:run {{ type }}</code>
|
||||||
|
</div>
|
||||||
|
{% endfor %}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{% if last_report %}
|
||||||
|
<hr class="border-secondary">
|
||||||
|
<div class="small text-secondary">
|
||||||
|
Letzter generischer Report:
|
||||||
|
<span class="text-light">{{ last_report.type|default('unknown') }}</span>,
|
||||||
|
{{ last_report.passed|default(0) }}/{{ last_report.total|default(0) }} bestanden,
|
||||||
|
{{ last_report.generated_at|default('') }}
|
||||||
|
</div>
|
||||||
|
{% endif %}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="card bg-black border-secondary text-light shadow-sm">
|
||||||
|
<div class="card-body">
|
||||||
|
<div class="d-flex justify-content-between align-items-center flex-wrap gap-2 mb-3">
|
||||||
|
<h5 class="text-warning mb-0">
|
||||||
|
<i class="bi bi-list-check"></i>
|
||||||
|
Report-Details: {{ types[selected_type]|default(selected_type) }}
|
||||||
|
</h5>
|
||||||
|
|
||||||
|
<div class="btn-group btn-group-sm" role="group" aria-label="Eval report types">
|
||||||
|
{% for type, label in types %}
|
||||||
|
<a class="btn {{ type == selected_type ? 'btn-info' : 'btn-outline-info' }}"
|
||||||
|
href="{{ path('admin_evals_index', {type: type}) }}">
|
||||||
|
{{ label }}
|
||||||
|
</a>
|
||||||
|
{% endfor %}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{% if selected_report %}
|
||||||
|
{% set selectedFailed = selected_report.failed|default(0) %}
|
||||||
|
<div class="row g-3 mb-3 small">
|
||||||
|
<div class="col-md-3">
|
||||||
|
<div class="border border-secondary rounded p-3 h-100">
|
||||||
|
<div class="text-secondary">Total</div>
|
||||||
|
<div class="h5 mb-0">{{ selected_report.total|default(0) }}</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div class="col-md-3">
|
||||||
|
<div class="border border-secondary rounded p-3 h-100">
|
||||||
|
<div class="text-secondary">Passed</div>
|
||||||
|
<div class="h5 text-success mb-0">{{ selected_report.passed|default(0) }}</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div class="col-md-3">
|
||||||
|
<div class="border border-secondary rounded p-3 h-100">
|
||||||
|
<div class="text-secondary">Failed</div>
|
||||||
|
<div class="h5 {{ selectedFailed == 0 ? 'text-success' : 'text-danger' }} mb-0">
|
||||||
|
{{ selectedFailed }}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div class="col-md-3">
|
||||||
|
<div class="border border-secondary rounded p-3 h-100">
|
||||||
|
<div class="text-secondary">Generated</div>
|
||||||
|
<div class="small text-light">{{ selected_report.generated_at|default('') }}</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="table-responsive">
|
||||||
|
<table class="table table-dark table-striped table-hover align-middle mb-0">
|
||||||
|
<thead class="table-secondary text-dark">
|
||||||
|
<tr>
|
||||||
|
<th>Status</th>
|
||||||
|
<th>Case</th>
|
||||||
|
<th>Dauer</th>
|
||||||
|
<th>Failures / Details</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
{% for result in selected_report.results|default([]) %}
|
||||||
|
<tr>
|
||||||
|
<td style="width: 110px;">
|
||||||
|
{% if result.passed|default(false) %}
|
||||||
|
<span class="badge bg-success">PASS</span>
|
||||||
|
{% else %}
|
||||||
|
<span class="badge bg-danger">FAIL</span>
|
||||||
|
{% endif %}
|
||||||
|
</td>
|
||||||
|
<td style="min-width: 260px;">
|
||||||
|
<code>{{ result.case_id|default('') }}</code>
|
||||||
|
<div class="small text-secondary mb-2">{{ result.type|default('') }}</div>
|
||||||
|
|
||||||
|
{% set casePrompt = result.prompt|default(result.details.prompt|default('')) %}
|
||||||
|
{% if casePrompt %}
|
||||||
|
<div class="small mb-2">
|
||||||
|
<span class="text-secondary">Prompt:</span><br>
|
||||||
|
<span class="text-light">{{ casePrompt }}</span>
|
||||||
|
</div>
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
<div class="mt-2">
|
||||||
|
<a href="{{ path('admin_evals_case_new', {source_type: selected_type, source_case_id: result.case_id|default('')}) }}"
|
||||||
|
class="btn btn-sm btn-outline-warning">
|
||||||
|
<i class="bi bi-journal-plus"></i> Als neuen Case vorbereiten
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{% set historyRows = result.details.history|default([]) %}
|
||||||
|
{% if historyRows is not empty %}
|
||||||
|
<details class="small">
|
||||||
|
<summary class="text-info" style="cursor:pointer;">
|
||||||
|
Kontext / History anzeigen
|
||||||
|
</summary>
|
||||||
|
<div class="mt-2 ps-2 border-start border-secondary">
|
||||||
|
{% for turn in historyRows %}
|
||||||
|
<div class="mb-2">
|
||||||
|
<div class="text-secondary">Vorheriger Prompt:</div>
|
||||||
|
<div class="text-light">{{ turn.prompt|default('') }}</div>
|
||||||
|
{% if turn.answer_preview|default('') %}
|
||||||
|
<div class="text-secondary mt-1">Antwort-Auszug:</div>
|
||||||
|
<div class="text-secondary">{{ turn.answer_preview }}</div>
|
||||||
|
{% endif %}
|
||||||
|
</div>
|
||||||
|
{% endfor %}
|
||||||
|
</div>
|
||||||
|
</details>
|
||||||
|
{% endif %}
|
||||||
|
</td>
|
||||||
|
<td style="width: 120px;">
|
||||||
|
{{ result.duration_ms|default(0) }} ms
|
||||||
|
</td>
|
||||||
|
<td>
|
||||||
|
{% if result.failures|default([]) is not empty %}
|
||||||
|
<ul class="mb-2 small text-danger">
|
||||||
|
{% for failure in result.failures %}
|
||||||
|
<li>{{ failure }}</li>
|
||||||
|
{% endfor %}
|
||||||
|
</ul>
|
||||||
|
{% else %}
|
||||||
|
<div class="small text-success mb-2">Keine Fehler.</div>
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
{% set documentRefs = result.details.document_refs|default([]) %}
|
||||||
|
{% if documentRefs is not empty %}
|
||||||
|
<div class="mb-2">
|
||||||
|
<div class="small text-secondary mb-1">Gefundene Dokumente</div>
|
||||||
|
<div class="table-responsive">
|
||||||
|
<table class="table table-dark table-sm table-bordered border-secondary align-middle mb-2">
|
||||||
|
<thead>
|
||||||
|
<tr class="small text-secondary">
|
||||||
|
<th style="width: 90px;">Ranks</th>
|
||||||
|
<th>Titel / Datei</th>
|
||||||
|
<th style="width: 170px;">Doc-ID</th>
|
||||||
|
<th style="width: 220px;">Chunks</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
{% for doc in documentRefs %}
|
||||||
|
<tr>
|
||||||
|
<td class="small">{{ doc.ranks|default([])|join(', ') }}</td>
|
||||||
|
<td>
|
||||||
|
<div class="fw-semibold">{{ doc.title|default('Ohne Titel') }}</div>
|
||||||
|
{% if doc.file_path|default('') %}
|
||||||
|
<div class="small text-secondary" style="word-break: break-all;">
|
||||||
|
{{ doc.file_path }}
|
||||||
|
</div>
|
||||||
|
{% endif %}
|
||||||
|
{% if doc.version_number|default('') %}
|
||||||
|
<div class="small text-secondary">Version: {{ doc.version_number }}</div>
|
||||||
|
{% endif %}
|
||||||
|
</td>
|
||||||
|
<td><code class="small">{{ doc.id|default('') }}</code></td>
|
||||||
|
<td class="small" style="word-break: break-all;">
|
||||||
|
{% for chunkId in doc.chunk_ids|default([]) %}
|
||||||
|
<code>{{ chunkId }}</code>{% if not loop.last %}<br>{% endif %}
|
||||||
|
{% endfor %}
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
{% endfor %}
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
{% set resultRows = result.details.result_rows|default([]) %}
|
||||||
|
{% if resultRows is not empty %}
|
||||||
|
<details class="mb-2">
|
||||||
|
<summary class="small text-info" style="cursor:pointer;">
|
||||||
|
Treffer / Chunks anzeigen
|
||||||
|
</summary>
|
||||||
|
<div class="table-responsive mt-2">
|
||||||
|
<table class="table table-dark table-sm table-bordered border-secondary align-middle mb-0">
|
||||||
|
<thead>
|
||||||
|
<tr class="small text-secondary">
|
||||||
|
<th style="width: 60px;">Rank</th>
|
||||||
|
<th>Titel / Datei</th>
|
||||||
|
<th style="width: 180px;">Chunk</th>
|
||||||
|
<th>Preview</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
{% for row in resultRows %}
|
||||||
|
<tr>
|
||||||
|
<td>{{ row.rank|default('') }}</td>
|
||||||
|
<td>
|
||||||
|
<div class="fw-semibold">{{ row.document_title|default('Ohne Titel') }}</div>
|
||||||
|
{% if row.file_path|default('') %}
|
||||||
|
<div class="small text-secondary" style="word-break: break-all;">{{ row.file_path }}</div>
|
||||||
|
{% endif %}
|
||||||
|
<div class="small text-secondary">Doc-ID: <code>{{ row.document_id|default('') }}</code></div>
|
||||||
|
</td>
|
||||||
|
<td class="small" style="word-break: break-all;">
|
||||||
|
<code>{{ row.chunk_id|default('') }}</code>
|
||||||
|
{% if row.chunk_index is defined and row.chunk_index is not same as(null) %}
|
||||||
|
<div class="text-secondary">Index: {{ row.chunk_index }}</div>
|
||||||
|
{% endif %}
|
||||||
|
</td>
|
||||||
|
<td class="small text-secondary">{{ row.text_preview|default('') }}</td>
|
||||||
|
</tr>
|
||||||
|
{% endfor %}
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
</details>
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary class="small text-info" style="cursor:pointer;">
|
||||||
|
JSON-Details anzeigen
|
||||||
|
</summary>
|
||||||
|
<pre class="bg-dark border border-secondary rounded p-2 mt-2 small text-light" style="white-space: pre-wrap; max-height: 260px; overflow: auto;">{{ result.details|default({})|json_encode(constant('JSON_PRETTY_PRINT')) }}</pre>
|
||||||
|
</details>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
{% else %}
|
||||||
|
<tr>
|
||||||
|
<td colspan="4" class="text-center text-secondary py-4">
|
||||||
|
Dieser Report enthält keine Resultate.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
{% endfor %}
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
{% else %}
|
||||||
|
<div class="alert alert-secondary mb-0">
|
||||||
|
Für {{ types[selected_type]|default(selected_type) }} liegt noch kein typspezifischer Admin-Report vor.
|
||||||
|
Starte den Eval oben oder per CLI.
|
||||||
|
</div>
|
||||||
|
{% endif %}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
<script>
|
||||||
|
document.addEventListener('DOMContentLoaded', function () {
|
||||||
|
const forms = Array.from(document.querySelectorAll('.js-admin-eval-run-form'));
|
||||||
|
const overlay = document.getElementById('adminEvalRunOverlay');
|
||||||
|
const overlayLabel = document.getElementById('adminEvalRunOverlayLabel');
|
||||||
|
|
||||||
|
function resolveEvalLabel(form) {
|
||||||
|
const select = form.querySelector('.js-admin-eval-type-select');
|
||||||
|
if (select && select.selectedOptions.length > 0) {
|
||||||
|
return select.selectedOptions[0].textContent.trim();
|
||||||
|
}
|
||||||
|
|
||||||
|
return (form.dataset.evalTypeLabel || 'Eval').trim();
|
||||||
|
}
|
||||||
|
|
||||||
|
function syncCaseSelect(form) {
|
||||||
|
const typeSelect = form.querySelector('.js-admin-eval-type-select');
|
||||||
|
const caseSelect = form.querySelector('.js-admin-eval-case-select');
|
||||||
|
|
||||||
|
if (!typeSelect || !caseSelect) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const selectedType = typeSelect.value;
|
||||||
|
|
||||||
|
Array.from(caseSelect.options).forEach(function (option) {
|
||||||
|
if (option.value === '') {
|
||||||
|
option.hidden = false;
|
||||||
|
option.disabled = false;
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const matchesType = option.dataset.evalType === selectedType;
|
||||||
|
option.hidden = !matchesType;
|
||||||
|
option.disabled = !matchesType;
|
||||||
|
|
||||||
|
if (!matchesType && option.selected) {
|
||||||
|
caseSelect.value = '';
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
function setAllRunButtonsDisabled() {
|
||||||
|
document.querySelectorAll('.js-admin-eval-run-button').forEach(function (button) {
|
||||||
|
button.disabled = true;
|
||||||
|
button.classList.add('disabled');
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
forms.forEach(function (form) {
|
||||||
|
syncCaseSelect(form);
|
||||||
|
|
||||||
|
const typeSelect = form.querySelector('.js-admin-eval-type-select');
|
||||||
|
if (typeSelect) {
|
||||||
|
typeSelect.addEventListener('change', function () {
|
||||||
|
syncCaseSelect(form);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
form.addEventListener('submit', function (event) {
|
||||||
|
const button = event.submitter && event.submitter.classList.contains('js-admin-eval-run-button')
|
||||||
|
? event.submitter
|
||||||
|
: form.querySelector('.js-admin-eval-run-button');
|
||||||
|
const label = resolveEvalLabel(form);
|
||||||
|
|
||||||
|
if (overlay && overlayLabel) {
|
||||||
|
overlayLabel.textContent = label + ' läuft ...';
|
||||||
|
overlay.classList.remove('d-none');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (button) {
|
||||||
|
const buttonLabel = button.querySelector('.js-admin-eval-button-label');
|
||||||
|
const spinner = button.querySelector('.js-admin-eval-button-spinner');
|
||||||
|
|
||||||
|
if (buttonLabel) {
|
||||||
|
buttonLabel.textContent = 'Läuft ...';
|
||||||
|
}
|
||||||
|
|
||||||
|
if (spinner) {
|
||||||
|
spinner.classList.remove('d-none');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
setAllRunButtonsDisabled();
|
||||||
|
document.body.style.cursor = 'progress';
|
||||||
|
});
|
||||||
|
});
|
||||||
|
});
|
||||||
|
</script>
|
||||||
|
|
||||||
|
{% endblock %}
|
||||||
@@ -4,15 +4,24 @@
|
|||||||
|
|
||||||
{% block body %}
|
{% block body %}
|
||||||
|
|
||||||
<div class="d-flex justify-content-between align-items-center mb-4">
|
<div class="d-flex justify-content-between align-items-center mb-4 flex-wrap gap-2">
|
||||||
<h1 class="h3 mb-0"><i class="bi bi-rocket-takeoff-fill"></i> KI Modell-Generierung</h1>
|
<h1 class="h3 mb-0"><i class="bi bi-rocket-takeoff-fill"></i> KI Modell-Generierung</h1>
|
||||||
|
|
||||||
{% if is_granted('ROLE_SUPER_ADMIN') %}
|
<div class="d-flex flex-wrap gap-2">
|
||||||
<a href="{{ path('admin_model_config_create') }}"
|
{% if is_granted('ROLE_KNOWLEDGE_ADMIN') %}
|
||||||
class="btn btn-sm btn-outline-info">
|
<a href="{{ path('admin_evals_index') }}"
|
||||||
Neue Konfiguration
|
class="btn btn-sm btn-outline-warning">
|
||||||
</a>
|
Eval Suite
|
||||||
{% endif %}
|
</a>
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
{% if is_granted('ROLE_SUPER_ADMIN') %}
|
||||||
|
<a href="{{ path('admin_model_config_create') }}"
|
||||||
|
class="btn btn-sm btn-outline-info">
|
||||||
|
Neue Konfiguration
|
||||||
|
</a>
|
||||||
|
{% endif %}
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
{# ========================================================= #}
|
{# ========================================================= #}
|
||||||
|
|||||||
4
tests/evals/cases/answer_guard.ndjson
Normal file
4
tests/evals/cases/answer_guard.ndjson
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
{"id":"answer_guard_noise_no_evidence_001","type":"answer_guard","prompt":"dsgfsdgfsdgf","assert":{"max_results":0}}
|
||||||
|
{"id":"answer_guard_mythical_medium_no_direct_evidence_001","type":"answer_guard","prompt":"gibt es einen testomat für drachenblut","assert":{"must_not_include_terms":["drachenblut"]}}
|
||||||
|
{"id":"answer_guard_lunar_water_no_direct_evidence_001","type":"answer_guard","prompt":"welcher testomat misst mondwasser im vakuum","assert":{"must_not_include_terms":["mondwasser","vakuum"]}}
|
||||||
|
{"id":"answer_guard_delivery_not_sdb_001","type":"answer_guard","prompt":"lieferbedingungen versand testomat","assert":{"min_results":1,"must_include_one_of_document_ids":["26ddf03d-9108-4a65-aa0e-a5df7613fa77"],"must_not_include_document_ids":["7166592f-85f2-425c-997b-73e323ae184d"]}}
|
||||||
4
tests/evals/cases/followup.ndjson
Normal file
4
tests/evals/cases/followup.ndjson
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
{"id":"followup_indicator_price_001","type":"followup","prompt":"was kostet der indikator","history":[{"prompt":"Was ist der niedrigste Grenzwert für die Wasserhärte, welcher mit einem Testomaten überwacht werden kann?","answer":"Der niedrigste Grenzwert für die Wasserhärte beträgt 0,02 °dH. Dieser Wert wird vom Testomat 808 gemessen."},{"prompt":"mit welchem indikator","answer":"Der niedrigste messbare Grenzwert für Wasserhärte mit dem Testomat 808 wird mit dem Indikatortyp 300 erreicht."}],"assert":{"expected_query":"testomat 808 300 indikator","must_include_terms":["testomat","808","300","indikator"],"must_not_include_terms":["300 s","301","302","303","testomat 2000"]}}
|
||||||
|
{"id":"followup_main_device_price_001","type":"followup","prompt":"und was kostet das gerät selber","history":[{"prompt":"was kostet der indikator","answer":"Shop-Suche abgeschlossen. Gesendete Suchquery: testomat 808 300 indikator. Testomat® 808 Indikator 300 500 ml, Produkt-Nummer 141001. Testomat® 808 Indikator 300 2 x 100 ml, Produkt-Nummer 140001. Der zugehörige Testomat ist Testomat 808."}],"assert":{"expected_query":"testomat 808","must_include_terms":["testomat","808"],"must_not_include_terms":["indikator","300","141001","140001"]}}
|
||||||
|
{"id":"followup_weak_shop_information_anchor_001","type":"followup","prompt":"suche im shop nach der information","history":[{"prompt":"welche grenzwerte kann der testomat 2000 thcl messen","answer":"Der relevante Produktanker ist Testomat 2000 THCL. Das Gerät ist für Chlorüberwachung / freies Chlor relevant."}],"assert":{"expected_query":"testomat 2000 thcl","must_include_terms":["testomat","2000","thcl"],"must_not_equal_query":"information","must_not_include_terms":["information"]}}
|
||||||
|
{"id":"followup_product_links_split_001","type":"followup","prompt":"gebe mir links zu den produkten aus dem shop","history":[{"prompt":"gerät zur messung Prozesswasser in medizinischen Geräten","answer":"Geeignete Produktanker sind Testomat 2000 Self Clean, Testomat 2000 CAL und Testomat 808."}],"assert":{"expected_individual_queries":["testomat 2000 self clean","testomat 2000 cal","testomat 808"],"expected_individual_queries_exact":true,"min_individual_queries":3,"max_individual_queries":3,"must_not_include_terms":["links zu aus"]}}
|
||||||
@@ -16,4 +16,4 @@
|
|||||||
{"id":"retrieval_negative_003","type":"retrieval","prompt":"testomat 2000 self clean reinigungsloesung","assert":{"min_results":1,"must_include_one_of_document_ids":["51589532-a1a1-46e0-94b2-a139dce78543","b8c3343b-931e-4994-9d53-a2130efc846f"],"must_include_any_terms":["reinigungslösung","self clean"],"must_not_include_document_ids":["26129c01-c09f-4c71-9c80-7ddffb6c77fb"]}}
|
{"id":"retrieval_negative_003","type":"retrieval","prompt":"testomat 2000 self clean reinigungsloesung","assert":{"min_results":1,"must_include_one_of_document_ids":["51589532-a1a1-46e0-94b2-a139dce78543","b8c3343b-931e-4994-9d53-a2130efc846f"],"must_include_any_terms":["reinigungslösung","self clean"],"must_not_include_document_ids":["26129c01-c09f-4c71-9c80-7ddffb6c77fb"]}}
|
||||||
{"id":"retrieval_short_001","type":"retrieval","prompt":"evo th","assert":{"min_results":1,"must_include_one_of_document_ids":["eb91c1be-4546-4ed5-8b01-f075519d675b","74fdad85-5e4e-4f08-8d95-402f3180ed55"],"must_include_any_terms":["evo"]}}
|
{"id":"retrieval_short_001","type":"retrieval","prompt":"evo th","assert":{"min_results":1,"must_include_one_of_document_ids":["eb91c1be-4546-4ed5-8b01-f075519d675b","74fdad85-5e4e-4f08-8d95-402f3180ed55"],"must_include_any_terms":["evo"]}}
|
||||||
{"id":"retrieval_short_002","type":"retrieval","prompt":"808","assert":{"min_results":1,"must_include_one_of_document_ids":["26129c01-c09f-4c71-9c80-7ddffb6c77fb"],"must_include_any_terms":["808"]}}
|
{"id":"retrieval_short_002","type":"retrieval","prompt":"808","assert":{"min_results":1,"must_include_one_of_document_ids":["26129c01-c09f-4c71-9c80-7ddffb6c77fb"],"must_include_any_terms":["808"]}}
|
||||||
{"id":"retrieval_noise_001","type":"retrieval","prompt":"dsgfsdgfsdgf","assert":{"max_results":0}}
|
{"id":"retrieval_notfound_doc","type":"retrieval","prompt":"hdfghdfghdfhg","assert":{"min_results":0}}
|
||||||
|
|||||||
5
tests/evals/cases/shop_query.ndjson
Normal file
5
tests/evals/cases/shop_query.ndjson
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
{"id":"shop_query_indicator_exact_001","type":"shop_query","prompt":"was kostet der Testomat 808 Indikator 300","assert":{"must_include_terms":["testomat","808","300","indikator"],"must_not_include_terms":["300 s","301","302","303","gerät selber"]}}
|
||||||
|
{"id":"shop_query_brewing_water_cleanup_001","type":"shop_query","prompt":"ich möchte für brauerei das brauwasser messen","assert":{"expected_query":"brauerei brauwasser","must_include_terms":["brauerei","brauwasser"],"must_not_include_terms":["möchte","messen","think"]}}
|
||||||
|
{"id":"shop_query_swimming_pool_typo_001","type":"shop_query","prompt":"ich würde gern chlor im schwinnbad messen","assert":{"expected_query":"chlor schwimmbad","must_include_terms":["chlor","schwimmbad"],"must_not_include_terms":["schwinnbad","messen"]}}
|
||||||
|
{"id":"shop_query_lab_cl_acronym_001","type":"shop_query","prompt":"Zeige mir die Preise zu Testomat LAB CL.","assert":{"expected_query":"testomat lab cl","must_include_terms":["testomat","lab","cl"],"must_not_equal_query":"testomat"}}
|
||||||
|
{"id":"shop_query_sio2_anchor_001","type":"shop_query","prompt":"suche gerät kühlsysteme Silikatüberwachung","assert":{"expected_query":"testomat 808 sio2","must_include_terms":["testomat","808","sio2"],"must_not_include_terms":["kühlsysteme","silikatüberwachung"]}}
|
||||||
Reference in New Issue
Block a user