Evaluation Models
Data structures for agent success evaluation results.
Evaluation Models
AgentSuccessEvaluationResult
Represents an agent performance evaluation result.
Prop
Type
GetAgentSuccessEvaluationResultsRequest
Request model for getting agent success evaluation results.
Prop
Type
GetAgentSuccessEvaluationResultsResponse
Response model for getting agent success evaluation results.
Prop
Type
TrendPoint
One point on a grouped success-rate trend curve. Used by F2's session-level A/B comparison (see overview).
Prop
Type
SuccessRateTrendByGroup
Group-split trend data for the Evaluation page's dual-curve chart (F2).
Grouping is by Request.metadata.reflexio_retrieval_enabled, read from the
first request of each session in the window. Sessions whose first request
has the key absent OR a non-bool value land in untagged — surfaced (not
silently coerced) so customers can see how many sessions are tagged
inconsistently.
Prop
Type
GetEvaluationOverviewRequest
Input for the Evaluation Overview endpoint (POST /api/get_evaluation_overview).
Prop
Type
GetEvaluationOverviewResponse
Response model for the Evaluation Overview endpoint. The new
success_rate_trend_by_group field carries the F2 dual-curve trend; the
other fields populate the existing hero, context, and rule-attribution
sections of the Evaluation page.
Prop
Type
ShadowComparisonOutput
LLM judge verdict for a single per-turn Reflexio-vs-shadow comparison (F1).
The position of the two responses shown to the judge is randomized per call
to mitigate position bias; the mapping is recorded on
ShadowComparisonVerdict.reflexio_is_request_1.
Prop
Type
ShadowComparisonVerdict
One per-turn comparison verdict, stored per
(interaction_id, judge_prompt_version) in the
shadow_comparison_verdicts table.
Prop
Type
ShadowWinRateTrendPoint
One daily bucket of per-turn shadow comparison verdicts (F1). Buckets are UTC-aligned and surfaced in ascending date order.
Prop
Type
ShadowWinRateTrendWindowTotal
Aggregate of all shadow verdicts across the trend window (F1). Used to render the headline win-rate tile on the Evaluation page.
Prop
Type
ShadowWinRateTrend
F1 shadow win-rate trend payload. Carried on
GetEvaluationOverviewResponse.shadow_win_rate_trend.
Verdicts produced under a previous rubric epoch are filtered out at storage
time by judge_prompt_version, so the dashboard never silently mixes
incompatible rubrics into the headline number.
Prop
Type
GetRecentShadowComparisonsResponse
Returned by GET /api/evaluations/shadow_comparisons/recent.
Prop
Type
GET /api/evaluations/shadow_comparisons/recent
Returns the N most recent per-turn shadow comparison verdicts (F1). Powers two surfaces on the Evaluation page:
- The drawer triggered from the per-turn comparison tile — shows the N most recent verdicts so you can spot-check the judge.
- The "Top 10 disagreements" widget — fetches a wider pool and the
frontend filters to
is_significantly_better=Truelosses to surface actionable rule-correction candidates.
Verdicts are restricted to the org's currently pinned
Config.shadow_comparison_judge_prompt_version
so verdicts from an older rubric never mix into the drawer. A 30-day
lookback is enforced server-side so the storage layer can use an index
range scan instead of a full table read.
Query parameters
Prop
Type
Response
GetRecentShadowComparisonsResponse.
Errors
503 Storage not configured— the server has no storage backend wired.- Returns
verdicts: [](200) when the storage backend does not implement theshadow_comparison_verdictsfeature (e.g. the disk backend), or when no verdicts in the 30-day window match the pinned prompt version.
RegenerateRequest
Input for POST /api/evaluations/regenerate — kicks off a replay-the-judge job
over a closed window.
Prop
Type
RegenerateStartResponse
Returned by POST /api/evaluations/regenerate.
Prop
Type
RegenerateFailure
One failed session in a regen job's failure list.
Prop
Type
RegenerateStatusResponse
Returned by GET /api/evaluations/regenerate/{job_id}. Carries the live
lifecycle state of the regen worker plus the F3 sampling and concurrency
counters described in Measuring Reflexio's Impact.
Prop
Type
GradeOnDemandRequest
Input for POST /api/evaluations/grade_on_demand — single-session
click-through grading triggered when a dashboard user opens a session that
wasn't in the sampled regen set. See
Sampling and freshness.
Prop
Type
GradeOnDemandResponse
Returned by POST /api/evaluations/grade_on_demand. Results are cached for
24 hours via the operation-state mechanism — the second call for the same
(session, agent_version) within the cache window will
return cached: true with the original result_id.
Prop
Type