Reflexio Docs
API ReferenceSchemas

Evaluation Models

Data structures for agent success evaluation results.

Evaluation Models

AgentSuccessEvaluationResult

Represents an agent performance evaluation result.

Prop

Type

GetAgentSuccessEvaluationResultsRequest

Request model for getting agent success evaluation results.

Prop

Type

GetAgentSuccessEvaluationResultsResponse

Response model for getting agent success evaluation results.

Prop

Type

TrendPoint

One point on a grouped success-rate trend curve. Used by F2's session-level A/B comparison (see overview).

Prop

Type

SuccessRateTrendByGroup

Group-split trend data for the Evaluation page's dual-curve chart (F2).

Grouping is by Request.metadata.reflexio_retrieval_enabled, read from the first request of each session in the window. Sessions whose first request has the key absent OR a non-bool value land in untagged — surfaced (not silently coerced) so customers can see how many sessions are tagged inconsistently.

Prop

Type

GetEvaluationOverviewRequest

Input for the Evaluation Overview endpoint (POST /api/get_evaluation_overview).

Prop

Type

GetEvaluationOverviewResponse

Response model for the Evaluation Overview endpoint. The new success_rate_trend_by_group field carries the F2 dual-curve trend; the other fields populate the existing hero, context, and rule-attribution sections of the Evaluation page.

Prop

Type

ShadowComparisonOutput

LLM judge verdict for a single per-turn Reflexio-vs-shadow comparison (F1). The position of the two responses shown to the judge is randomized per call to mitigate position bias; the mapping is recorded on ShadowComparisonVerdict.reflexio_is_request_1.

Prop

Type

ShadowComparisonVerdict

One per-turn comparison verdict, stored per (interaction_id, judge_prompt_version) in the shadow_comparison_verdicts table.

Prop

Type

ShadowWinRateTrendPoint

One daily bucket of per-turn shadow comparison verdicts (F1). Buckets are UTC-aligned and surfaced in ascending date order.

Prop

Type

ShadowWinRateTrendWindowTotal

Aggregate of all shadow verdicts across the trend window (F1). Used to render the headline win-rate tile on the Evaluation page.

Prop

Type

ShadowWinRateTrend

F1 shadow win-rate trend payload. Carried on GetEvaluationOverviewResponse.shadow_win_rate_trend. Verdicts produced under a previous rubric epoch are filtered out at storage time by judge_prompt_version, so the dashboard never silently mixes incompatible rubrics into the headline number.

Prop

Type

GetRecentShadowComparisonsResponse

Returned by GET /api/evaluations/shadow_comparisons/recent.

GET /api/evaluations/shadow_comparisons/recent

Returns the N most recent per-turn shadow comparison verdicts (F1). Powers two surfaces on the Evaluation page:

  1. The drawer triggered from the per-turn comparison tile — shows the N most recent verdicts so you can spot-check the judge.
  2. The "Top 10 disagreements" widget — fetches a wider pool and the frontend filters to is_significantly_better=True losses to surface actionable rule-correction candidates.

Verdicts are restricted to the org's currently pinned Config.shadow_comparison_judge_prompt_version so verdicts from an older rubric never mix into the drawer. A 30-day lookback is enforced server-side so the storage layer can use an index range scan instead of a full table read.

Query parameters

Prop

Type

Response

GetRecentShadowComparisonsResponse.

Errors

  • 503 Storage not configured — the server has no storage backend wired.
  • Returns verdicts: [] (200) when the storage backend does not implement the shadow_comparison_verdicts feature (e.g. the disk backend), or when no verdicts in the 30-day window match the pinned prompt version.

RegenerateRequest

Input for POST /api/evaluations/regenerate — kicks off a replay-the-judge job over a closed window.

Prop

Type

RegenerateStartResponse

Returned by POST /api/evaluations/regenerate.

Prop

Type

RegenerateFailure

One failed session in a regen job's failure list.

Prop

Type

RegenerateStatusResponse

Returned by GET /api/evaluations/regenerate/{job_id}. Carries the live lifecycle state of the regen worker plus the F3 sampling and concurrency counters described in Measuring Reflexio's Impact.

Prop

Type

GradeOnDemandRequest

Input for POST /api/evaluations/grade_on_demand — single-session click-through grading triggered when a dashboard user opens a session that wasn't in the sampled regen set. See Sampling and freshness.

Prop

Type

GradeOnDemandResponse

Returned by POST /api/evaluations/grade_on_demand. Results are cached for 24 hours via the operation-state mechanism — the second call for the same (session, agent_version) within the cache window will return cached: true with the original result_id.

Prop

Type