Measuring Reflexio's Impact
Session-level source-set comparison and per-turn shadow win-rate on the Evaluation page.
Measuring Reflexio's Impact
Enterprise Feature
The Evaluation page surfaces two complementary measurement signals:
- Session-level source-set comparison — groups evaluated sessions by the
sourceon the first request in each session, then compares success rate, corrections, turns to resolution, escalation, distributions, and rule attribution across the source sets you choose. - Per-turn shadow win-rate (F1) — for individual agent turns, a judge compares Reflexio's response with a shadow response generated without Reflexio's retrieved context and reports a head-to-head win rate.
This page covers what the dashboard renders, how to use request sources for session-level comparisons, how to publish shadow responses (F1), and the methodology contract Reflexio expects of the caller.
What the Dashboard Shows
The Evaluation page defaults to an all-source view. Use the source filter at the top of the page to scope the hero chart, context tiles, learnings, and raw session detail to one first-request source. The Source-set comparison card still lets you build labeled cohorts from the request sources present in the selected evaluation window.
For each source set, Reflexio shows:
- success rate and success trend buckets
- average corrections per session
- average turns to resolution
- escalation rate
- corrections distribution
- rule attribution for sessions in that set
- imported scorer tiles when external scores can be matched by
session_id
Where the source is read
Reflexio reads the source of the first request in each session and uses
that source for the entire session — sticky assignment. If later requests in
the same session use a different source, the comparison still uses the first
request's source.
Integration
In your agent's session-start code, decide which source arm the session belongs
to, then publish the first request with that source value. Common examples:
| Source value | Meaning |
|---|---|
prod_without_reflexio | Baseline sessions where Reflexio retrieval is disabled; publish with evaluation_only=True so they are evaluated but not learned from |
prod_with_reflexio | Production sessions where Reflexio retrieval is enabled |
offline_candidate | Offline or shadow candidate sessions published only for evaluation |
Randomization is yours to own. A typical end-to-end pattern using
ReflexioClient.publish_interaction:
import random
from reflexio import ReflexioClient
from reflexio.models.api_schema.domain.entities import InteractionData
client = ReflexioClient()
HOLDOUT_FRACTION = 0.10 # 10% control
def assign_session_source(session_id: str) -> str:
"""Return the source cohort for this session."""
# Use a deterministic RNG keyed on session_id if you want assignments to
# be reproducible across retries; otherwise plain random is fine.
if random.random() < HOLDOUT_FRACTION:
return "prod_without_reflexio"
return "prod_with_reflexio"
source = assign_session_source(session_id)
if source == "prod_with_reflexio":
rules = client.search(user_message) # apply Reflexio retrieval
evaluation_only = False
else:
rules = [] # control: skip retrieval
evaluation_only = True # evaluate the baseline, but do not learn from it
agent_response = run_agent(user_message, context=rules)
interactions = [
InteractionData(role="user", content=user_message),
InteractionData(role="assistant", content=agent_response),
]
client.publish_interaction(
user_id=user.id,
interactions=interactions,
session_id=session_id,
source=source,
agent_version="v1",
evaluation_only=evaluation_only,
)curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/search" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"query": "<user_message>"
}
JSON
curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"user_id": "id",
"interaction_data_list": [
{
"role": "user",
"content": "<user_message>"
},
{
"role": "assistant",
"content": "<run_agent_result>"
}
],
"session_id": "<session_id>",
"source": "<assign_session_source_result>",
"agent_version": "v1",
"evaluation_only": "<evaluation_only>"
}
JSONMethodology Contract
Reflexio computes source-set comparisons from Request.source. The validity
of a source-set gap as a causal claim depends on you randomizing source
assignment. Without random assignment, the comparison is observational.
In practice:
- Pick a holdout fraction (e.g., 5–10%) and assign new sessions to control at random.
- Keep the assignment sticky per session — don't flip it mid-session.
- Use stable, explicit source names for each source set.
- Make sure the first request in each session carries the intended source.
- Publish no-Reflexio baseline sessions with
evaluation_only=Trueso they contribute to evaluation metrics without teaching Reflexio profiles or playbooks.
Without random assignment, lift is observational
If you assign source sets based on user properties, time of day, or any non-random rule, the comparison measures correlation, not causation. Treat the numbers as directional signals, not a controlled experiment.
Sampling and freshness
Regen jobs sample at most 200 sessions per (day × group) stratum by default so
cost stays predictable as your traffic grows. The dashboard surfaces the sampled
n per point so it's always honest about how many sessions back each number.
If you click into a session that wasn't in the sampled set, Reflexio grades it on demand
via POST /api/evaluations/grade_on_demand
and caches the result for 24 hours.
To tune the defaults, set eval_sample_n_per_stratum and eval_concurrency_limit in
your Config.
API Reference
The source_set_comparison field on
GetEvaluationOverviewResponse
carries available sources and per-set metrics. Send source_sets on
GetEvaluationOverviewRequest
to request labeled source-set comparisons.
The source field is declared on the
Request model and is preserved
across all storage backends (SQLite, Postgres, Supabase, and the disk YAML
layout).
Per-turn comparison (F1)
The Evaluation page also renders a per-turn head-to-head win rate comparing your agent's response with Reflexio's retrieved context vs. without. F2 measures session-level outcomes; F1 zooms in to individual turns so you can spot-check the judge and surface specific cases where the Reflexio-less response was actually better.
To produce this signal, your agent code generates two responses per
turn and uploads both on the same Interaction via the existing
publish_interaction()
API.
Integration
For each agent turn you want graded, generate the regular response WITH Reflexio rules in context, then re-run your LLM WITHOUT them to produce a shadow response. Publish both on the same agent interaction:
from reflexio import ReflexioClient, InteractionData
client = ReflexioClient()
# Retrieve Reflexio rules for this turn (your retrieval call)
rules = client.search(user_message)
# Generate the regular response WITH rules
regular_response = llm.generate(user_message, rules=rules)
# Generate the shadow response WITHOUT rules
shadow_response = llm.generate(user_message, rules=[])
# Publish both on the same agent Interaction
client.publish_interaction(
user_id=user.id,
interactions=[
InteractionData(role="User", content=user_message),
InteractionData(
role="Agent",
content=regular_response, # what was served to the user
shadow_content=shadow_response, # used only for grading
),
],
session_id=session_id,
source="my-integration",
agent_version="v1",
)curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/search" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"query": "<user_message>"
}
JSON
curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"user_id": "id",
"interaction_data_list": [
{
"role": "User",
"content": "<user_message>"
},
{
"role": "Agent",
"content": "<generate_result>",
"shadow_content": "<generate_result>"
}
],
"session_id": "<session_id>",
"source": "my-integration",
"agent_version": "v1"
}
JSONWhere shadow_content lives
shadow_content is a field on every
InteractionData /
Interaction row, persisted
across every storage backend. Only the agent turn needs a shadow_content
— the judge compares Reflexio's response with the shadow against the same
user message. Turns without a shadow_content are skipped by the F1 pipeline.
What the dashboard shows
- A Per-turn comparison tile with the headline win rate over the
selected window (e.g.
67% win rate · 134 wins · 52 ties · 14 losses · n=200 sampled), plus a daily trend sparkline. - A View recent comparisons drawer with the most recent judged turns,
each rendered side-by-side with the judge's rationale. Backed by
GET /api/evaluations/shadow_comparisons/recent. - A Top disagreements widget listing turns where the shadow response
was significantly better than the Reflexio-augmented one
(
output.is_significantly_better=Truelosses) — actionable cases for updating your Reflexio rules.
Methodology contract
- The judge compares the two responses against the same user message in isolation; no prior conversation history is shown to the judge.
- Position (Request 1 vs Request 2) is randomized per call to
mitigate LLM judge position bias. The mapping is recorded on
ShadowComparisonVerdict.reflexio_is_request_1so wins/losses can be derived deterministically downstream. - The judge prompt version is pinned per org via
Config.shadow_comparison_judge_prompt_version. Verdicts are stored with the version that produced them; the dashboard filters to your currently pinned version so a future rubric bump never silently mixes epochs into the headline number. - Sampling: regen jobs are stratified per day at
Config.eval_sample_n_per_stratum(default 200) so cost stays predictable. Clicking into a non-sampled session triggers on-demand grading viaPOST /api/evaluations/grade_on_demandwith a 24h cache.
Where the data lands in the API
The win-rate trend is carried on
GetEvaluationOverviewResponse.shadow_win_rate_trend
(see the ShadowWinRateTrend
schema). Individual verdicts are returned by the recent-verdicts endpoint
above; each verdict is a
ShadowComparisonVerdict
wrapping a ShadowComparisonOutput.