Measuring Reflexio's Impact
Session-level A/B comparison and per-turn shadow win-rate on the Evaluation page — dual-curve trend, lift estimate, and head-to-head judge verdicts.
Measuring Reflexio's Impact
Enterprise Feature
The Evaluation page surfaces two complementary causal-measurement signals:
- Session-level A/B comparison (F2) — splits your sessions into "with Reflexio" and "control" groups and shows a side-by-side success-rate trend, plus a lift estimate with a 95% confidence interval.
- Per-turn shadow win-rate (F1) — for individual agent turns, a judge compares Reflexio's response with a shadow response generated without Reflexio's retrieved context and reports a head-to-head win rate.
This page covers what the dashboard renders, how to integrate the per-request metadata your agent must stamp (F2), how to publish shadow responses (F1), and the methodology contract Reflexio expects of the caller.
What the Dashboard Shows
The Evaluation page adds a Reflexio Impact row above the existing trend
chart whenever any session in the selected window carries the
reflexio_retrieval_enabled metadata key.
Dual-curve trend chart
The hero trend chart renders two curves where data is present:
| Curve | Meaning |
|---|---|
| Treatment (green) | Sessions whose first request had metadata.reflexio_retrieval_enabled = true |
| Control (gray) | Sessions whose first request had metadata.reflexio_retrieval_enabled = false |
| Untagged (dashed gray) | Sessions whose first request had the key absent or set to a non-boolean value |
Untagged sessions are surfaced (not silently coerced) so you can see how many of your sessions are missing or mis-tagging the metadata key.
Lift Estimate tile
A headline tile shows:
- Lift —
treatment_rate − control_rate, in percentage points. - Per-group sample size —
n_treatmentandn_controlover the window. - 95% Wald confidence interval half-width, clamped at ±50pp. When the CI
exceeds that cap, Reflexio displays it as
>±50ppand the tile shows a low-confidence indicator — treat the lift number as "not enough data" rather than as a real measurement.
When either group has zero sessions, the tile shows "no estimate" instead of a fake zero.
Where the metadata is read
Reflexio reads the metadata of the first request in each session and uses
that label for the entire session — sticky assignment. Later requests in the
same session can omit metadata entirely; they will inherit the session's
group from the first request.
Integration
In your agent's session-start code, decide whether the session is in the
treatment or control group (typically by RNG), then ensure the
reflexio_retrieval_enabled flag is stamped on the first Request of the
session.
Where the metadata lives
metadata is a free-form dict[str, Any] field on the
Request entity. It is persisted
across every storage backend (SQLite, Postgres, Supabase, and the disk YAML
layout) and read back by the aggregator from the first request of each
session in the window.
Reserved keys
| Key | Type | Purpose |
|---|---|---|
reflexio_retrieval_enabled | bool | F2 group-by signal. Set to True for treatment sessions, False for control. |
Additional keys are free for customer use — Reflexio only reads the reserved ones.
Stamping metadata via the publish path
ReflexioClient.publish_interaction accepts a metadata kwarg that is
mirrored onto every Request row written for the call. Reflexio reads it
back from the first request of each session, so stamping it on the very
first publish of a session is sufficient — later publishes can omit the
kwarg.
Randomization sketch
Randomization is yours to own. A typical end-to-end pattern using
ReflexioClient.publish_interaction:
import random
from reflexio import ReflexioClient
from reflexio.models.api_schema.domain.entities import InteractionData
client = ReflexioClient()
HOLDOUT_FRACTION = 0.10 # 10% control
def assign_session_group(session_id: str) -> bool:
"""Return True for treatment (Reflexio retrieval on), False for control."""
# Use a deterministic RNG keyed on session_id if you want assignments to
# be reproducible across retries; otherwise plain random is fine.
return random.random() >= HOLDOUT_FRACTION
in_treatment = assign_session_group(session_id)
if in_treatment:
rules = client.search(user_message) # apply Reflexio retrieval
else:
rules = [] # control: skip retrieval
interactions = [
InteractionData(role="user", content=user_message),
InteractionData(role="assistant", content=agent_response),
]
# `metadata` flows through the publish path onto the persisted Request row;
# the aggregator reads it back from the session's first request when
# computing the F2 group split.
client.publish_interaction(
user_id=user.id,
interactions=interactions,
session_id=session_id,
source="my-integration",
agent_version="v1",
metadata={"reflexio_retrieval_enabled": in_treatment},
)Group-assignment truth table
The aggregator (group_aggregation.assign_group_from_metadata) recognizes
only literal True and False. Everything else lands in untagged:
metadata.reflexio_retrieval_enabled value | Group |
|---|---|
True (literal Python bool) | treatment |
False (literal Python bool) | control |
| Key absent | untagged |
None, 1, 0, "true", "yes", etc. | untagged |
metadata is not a dict | untagged |
This is deliberately strict: silent coercion of strings like "true" would
hide integration bugs. Surface them in the untagged curve and fix them at
the source.
Methodology Contract
Reflexio computes the group split from session metadata. The validity of "lift" as a causal claim depends on you randomizing the assignment. Without random assignment, the comparison is observational, not causal, and the dashboard discloses this on the chart and tile.
In practice:
- Pick a holdout fraction (e.g., 5–10%) and assign new sessions to control at random.
- Keep the assignment sticky per session — don't flip it mid-session.
- Stamp the same
reflexio_retrieval_enabledvalue on every request in the session if you want defense in depth, but Reflexio only requires it on the first request.
Without random assignment, lift is observational
If you assign treatment/control based on user properties, time of day, or any non-random rule, the lift estimate measures correlation, not causation. The tile will still render, but treat the number as a directional signal — not a controlled experiment.
Sampling and freshness
Regen jobs sample at most 200 sessions per (day × group) stratum by default so
cost stays predictable as your traffic grows. The dashboard's trend chart will surface
the sampled n per point so it's always honest about how many sessions back each number.
If you click into a session that wasn't in the sampled set, Reflexio grades it on demand
via POST /api/evaluations/grade_on_demand
and caches the result for 24 hours.
To tune the defaults, set eval_sample_n_per_stratum and eval_concurrency_limit in
your Config.
API Reference
The success_rate_trend_by_group field on
GetEvaluationOverviewResponse
carries the three curves. See the
SuccessRateTrendByGroup
and TrendPoint schemas for
the response shape.
The metadata field is declared on the
Request model and is preserved
across all storage backends (SQLite, Postgres, Supabase, and the disk YAML
layout).
Per-turn comparison (F1)
The Evaluation page also renders a per-turn head-to-head win rate comparing your agent's response with Reflexio's retrieved context vs. without. F2 measures session-level outcomes; F1 zooms in to individual turns so you can spot-check the judge and surface specific cases where the Reflexio-less response was actually better.
To produce this signal, your agent code generates two responses per
turn and uploads both on the same Interaction via the existing
publish_interaction()
API.
Integration
For each agent turn you want graded, generate the regular response WITH Reflexio rules in context, then re-run your LLM WITHOUT them to produce a shadow response. Publish both on the same agent interaction:
from reflexio import ReflexioClient, InteractionData
client = ReflexioClient()
# Retrieve Reflexio rules for this turn (your retrieval call)
rules = client.search(user_message)
# Generate the regular response WITH rules
regular_response = llm.generate(user_message, rules=rules)
# Generate the shadow response WITHOUT rules
shadow_response = llm.generate(user_message, rules=[])
# Publish both on the same agent Interaction
client.publish_interaction(
user_id=user.id,
interactions=[
InteractionData(role="User", content=user_message),
InteractionData(
role="Agent",
content=regular_response, # what was served to the user
shadow_content=shadow_response, # used only for grading
),
],
session_id=session_id,
source="my-integration",
agent_version="v1",
)Where shadow_content lives
shadow_content is a field on every
InteractionData /
Interaction row, persisted
across every storage backend. Only the agent turn needs a shadow_content
— the judge compares Reflexio's response with the shadow against the same
user message. Turns without a shadow_content are skipped by the F1 pipeline.
What the dashboard shows
- A Per-turn comparison tile with the headline win rate over the
selected window (e.g.
67% win rate · 134 wins · 52 ties · 14 losses · n=200 sampled), plus a daily trend sparkline. - A View recent comparisons drawer with the most recent judged turns,
each rendered side-by-side with the judge's rationale. Backed by
GET /api/evaluations/shadow_comparisons/recent. - A Top disagreements widget listing turns where the shadow response
was significantly better than the Reflexio-augmented one
(
output.is_significantly_better=Truelosses) — actionable cases for updating your Reflexio rules.
Methodology contract
- The judge compares the two responses against the same user message in isolation; no prior conversation history is shown to the judge.
- Position (Request 1 vs Request 2) is randomized per call to
mitigate LLM judge position bias. The mapping is recorded on
ShadowComparisonVerdict.reflexio_is_request_1so wins/losses can be derived deterministically downstream. - The judge prompt version is pinned per org via
Config.shadow_comparison_judge_prompt_version. Verdicts are stored with the version that produced them; the dashboard filters to your currently pinned version so a future rubric bump never silently mixes epochs into the headline number. - Sampling: regen jobs are stratified per day at
Config.eval_sample_n_per_stratum(default 200) so cost stays predictable. Clicking into a non-sampled session triggers on-demand grading viaPOST /api/evaluations/grade_on_demandwith a 24h cache.
Where the data lands in the API
The win-rate trend is carried on
GetEvaluationOverviewResponse.shadow_win_rate_trend
(see the ShadowWinRateTrend
schema). Individual verdicts are returned by the recent-verdicts endpoint
above; each verdict is a
ShadowComparisonVerdict
wrapping a ShadowComparisonOutput.