All

Evaluation Models

Data structures for agent success evaluation results.

Evaluation Models

AgentSuccessEvaluationResult

Represents an agent performance evaluation result.

Prop

Type

GetAgentSuccessEvaluationResultsRequest

Request model for getting agent success evaluation results.

Prop

Type

GetAgentSuccessEvaluationResultsResponse

Response model for getting agent success evaluation results.

Prop

Type

RetrievedLearningEvaluationResult

The latest per-learning relevance/impact verdict for one target interaction. One row per (user_id, session_id, interaction_id, kind, learning_id); the stored set is the most recent successfully persisted evaluation for the session, not an append-only history. The same learning used on multiple interactions is evaluated separately against each response. Produced automatically when a session that published interactions with retrieved_learnings goes through group evaluation.

Prop

Type

GetRetrievedLearningEvaluationResultsRequest

Request model for POST /api/get_retrieved_learning_evaluation_results.

Prop

Type

GetRetrievedLearningEvaluationResultsResponse

Response model for POST /api/get_retrieved_learning_evaluation_results.

Prop

Type

EvaluationSourceSetRequest

One labeled request-source cohort for evaluation comparison. Sources match Request.source exactly, including the empty string for requests published without a source.

Prop

Type

SourceSetEvaluationMetrics

Metrics for one source set in SourceSetComparison.

Prop

Type

SourceSetComparison

Request-source comparison payload. Each evaluation result is assigned by joining to the session's first request and reading that request's source.

Prop

Type

GetEvaluationOverviewRequest

Input for the Evaluation Overview endpoint (POST /api/get_evaluation_overview).

Prop

Type

GetEvaluationOverviewResponse

Response model for the Evaluation Overview endpoint. The global fields populate the all-source hero, context, and rule-attribution sections. Source-set metrics are returned separately in source_set_comparison; the web portal uses those source-set metrics to scope the page when a user selects a single source.

Prop

Type

ShadowComparisonOutput

LLM judge verdict for a single per-turn Reflexio-vs-shadow comparison (F1). The position of the two responses shown to the judge is randomized per call to mitigate position bias; the mapping is recorded on ShadowComparisonVerdict.reflexio_is_request_1.

Prop

Type

ShadowComparisonVerdict

One per-turn comparison verdict, stored per (interaction_id, judge_prompt_version) in the shadow_comparison_verdicts table.

Prop

Type

ShadowWinRateTrendPoint

One daily bucket of per-turn shadow comparison verdicts (F1). Buckets are UTC-aligned and surfaced in ascending date order.

Prop

Type

ShadowWinRateTrendWindowTotal

Aggregate of all shadow verdicts across the trend window (F1). Used to render the headline win-rate tile on the Evaluation page.

Prop

Type

ShadowWinRateTrend

F1 shadow win-rate trend payload. Carried on GetEvaluationOverviewResponse.shadow_win_rate_trend. Verdicts produced under a previous rubric epoch are filtered out at storage time by judge_prompt_version, so the dashboard never silently mixes incompatible rubrics into the headline number.

Prop

Type

GetRecentShadowComparisonsResponse

Returned by GET /api/evaluations/shadow_comparisons/recent.

Prop

Type

GET /api/evaluations/shadow_comparisons/recent

Returns the N most recent per-turn shadow comparison verdicts (F1). Powers two surfaces on the Evaluation page:

The drawer triggered from the per-turn comparison tile — shows the N most recent verdicts so you can spot-check the judge.
The "Top 10 disagreements" widget — fetches a wider pool and the frontend filters to is_significantly_better=True losses to surface actionable rule-correction candidates.

Verdicts are restricted to the org's currently pinned Config.shadow_comparison_judge_prompt_version so verdicts from an older rubric never mix into the drawer. A 30-day lookback is enforced server-side so the storage layer can use an index range scan instead of a full table read.

Query parameters

Prop

Type

Response

GetRecentShadowComparisonsResponse.

Errors

503 Storage not configured — the server has no storage backend wired.
Returns verdicts: [] (200) when the storage backend does not implement the shadow_comparison_verdicts feature (e.g. the disk backend), or when no verdicts in the 30-day window match the pinned prompt version.

RegenerateRequest

Input for POST /api/evaluations/regenerate — kicks off a replay-the-judge job over a closed window.

Prop

Type

RegenerateStartResponse

Returned by POST /api/evaluations/regenerate.

Prop

Type

RegenerateFailure

One failed session in a regen job's failure list.

Prop

Type

RegenerateStatusResponse

Returned by GET /api/evaluations/regenerate/{job_id}. Carries the live lifecycle state of the regen worker plus the F3 sampling and concurrency counters described in Measuring Reflexio's Impact.

Prop

Type

GradeOnDemandRequest

Input for POST /api/evaluations/grade_on_demand — single-session click-through grading triggered when a dashboard user opens a session that wasn't in the sampled regen set. See Sampling and freshness.

Prop

Type

GradeOnDemandResponse

Returned by POST /api/evaluations/grade_on_demand. Results are cached for 24 hours via the operation-state mechanism — the second call for the same (session, agent_version) within the cache window will return cached: true with the original result_id.

Prop

Type