Session-level A/B comparison and per-turn shadow win-rate on the Evaluation page — dual-curve trend, lift estimate, and head-to-head judge verdicts.

Measuring Reflexio's Impact

Enterprise Feature

This feature requires Reflexio Enterprise (hosted at reflexio.com) or a self-hosted enterprise deployment. It is not available in the open-source version.

The Evaluation page surfaces two complementary causal-measurement signals:

Session-level A/B comparison (F2) — splits your sessions into "with Reflexio" and "control" groups and shows a side-by-side success-rate trend, plus a lift estimate with a 95% confidence interval.
Per-turn shadow win-rate (F1) — for individual agent turns, a judge compares Reflexio's response with a shadow response generated without Reflexio's retrieved context and reports a head-to-head win rate.

This page covers what the dashboard renders, how to integrate the per-request metadata your agent must stamp (F2), how to publish shadow responses (F1), and the methodology contract Reflexio expects of the caller.

What the Dashboard Shows

The Evaluation page adds a Reflexio Impact row above the existing trend chart whenever any session in the selected window carries the reflexio_retrieval_enabled metadata key.

Dual-curve trend chart

The hero trend chart renders two curves where data is present:

Curve	Meaning
Treatment (green)	Sessions whose first request had `metadata.reflexio_retrieval_enabled = true`
Control (gray)	Sessions whose first request had `metadata.reflexio_retrieval_enabled = false`
Untagged (dashed gray)	Sessions whose first request had the key absent or set to a non-boolean value

Untagged sessions are surfaced (not silently coerced) so you can see how many of your sessions are missing or mis-tagging the metadata key.

Lift Estimate tile

A headline tile shows:

Lift — treatment_rate − control_rate, in percentage points.
Per-group sample size — n_treatment and n_control over the window.
95% Wald confidence interval half-width, clamped at ±50pp. When the CI exceeds that cap, Reflexio displays it as >±50pp and the tile shows a low-confidence indicator — treat the lift number as "not enough data" rather than as a real measurement.

When either group has zero sessions, the tile shows "no estimate" instead of a fake zero.

Where the metadata is read

Reflexio reads the metadata of the first request in each session and uses that label for the entire session — sticky assignment. Later requests in the same session can omit metadata entirely; they will inherit the session's group from the first request.

Integration

In your agent's session-start code, decide whether the session is in the treatment or control group (typically by RNG), then ensure the reflexio_retrieval_enabled flag is stamped on the first Request of the session.

Where the metadata lives

metadata is a free-form dict[str, Any] field on the Request entity. It is persisted across every storage backend (SQLite, Postgres, Supabase, and the disk YAML layout) and read back by the aggregator from the first request of each session in the window.

Reserved keys

Key	Type	Purpose
`reflexio_retrieval_enabled`	`bool`	F2 group-by signal. Set to `True` for treatment sessions, `False` for control.

Additional keys are free for customer use — Reflexio only reads the reserved ones.

Stamping metadata via the publish path

ReflexioClient.publish_interaction accepts a metadata kwarg that is mirrored onto every Request row written for the call. Reflexio reads it back from the first request of each session, so stamping it on the very first publish of a session is sufficient — later publishes can omit the kwarg.

Randomization sketch

Randomization is yours to own. A typical end-to-end pattern using ReflexioClient.publish_interaction:

import random

from reflexio import ReflexioClient
from reflexio.models.api_schema.domain.entities import InteractionData

client = ReflexioClient()

HOLDOUT_FRACTION = 0.10  # 10% control


def assign_session_group(session_id: str) -> bool:
    """Return True for treatment (Reflexio retrieval on), False for control."""
    # Use a deterministic RNG keyed on session_id if you want assignments to
    # be reproducible across retries; otherwise plain random is fine.
    return random.random() >= HOLDOUT_FRACTION


in_treatment = assign_session_group(session_id)

if in_treatment:
    rules = client.search(user_message)  # apply Reflexio retrieval
else:
    rules = []                            # control: skip retrieval

interactions = [
    InteractionData(role="user", content=user_message),
    InteractionData(role="assistant", content=agent_response),
]

# `metadata` flows through the publish path onto the persisted Request row;
# the aggregator reads it back from the session's first request when
# computing the F2 group split.
client.publish_interaction(
    user_id=user.id,
    interactions=interactions,
    session_id=session_id,
    source="my-integration",
    agent_version="v1",
    metadata={"reflexio_retrieval_enabled": in_treatment},
)

Group-assignment truth table

The aggregator (group_aggregation.assign_group_from_metadata) recognizes only literal True and False. Everything else lands in untagged:

`metadata.reflexio_retrieval_enabled` value	Group
`True` (literal Python `bool`)	`treatment`
`False` (literal Python `bool`)	`control`
Key absent	`untagged`
`None`, `1`, `0`, `"true"`, `"yes"`, etc.	`untagged`
`metadata` is not a dict	`untagged`

This is deliberately strict: silent coercion of strings like "true" would hide integration bugs. Surface them in the untagged curve and fix them at the source.

Methodology Contract

Reflexio computes the group split from session metadata. The validity of "lift" as a causal claim depends on you randomizing the assignment. Without random assignment, the comparison is observational, not causal, and the dashboard discloses this on the chart and tile.

In practice:

Pick a holdout fraction (e.g., 5–10%) and assign new sessions to control at random.
Keep the assignment sticky per session — don't flip it mid-session.
Stamp the same reflexio_retrieval_enabled value on every request in the session if you want defense in depth, but Reflexio only requires it on the first request.

Without random assignment, lift is observational

If you assign treatment/control based on user properties, time of day, or any non-random rule, the lift estimate measures correlation, not causation. The tile will still render, but treat the number as a directional signal — not a controlled experiment.

Sampling and freshness

Regen jobs sample at most 200 sessions per (day × group) stratum by default so cost stays predictable as your traffic grows. The dashboard's trend chart will surface the sampled n per point so it's always honest about how many sessions back each number.

If you click into a session that wasn't in the sampled set, Reflexio grades it on demand via POST /api/evaluations/grade_on_demand and caches the result for 24 hours.

To tune the defaults, set eval_sample_n_per_stratum and eval_concurrency_limit in your Config.

API Reference

The success_rate_trend_by_group field on GetEvaluationOverviewResponse carries the three curves. See the SuccessRateTrendByGroup and TrendPoint schemas for the response shape.

The metadata field is declared on the Request model and is preserved across all storage backends (SQLite, Postgres, Supabase, and the disk YAML layout).

Per-turn comparison (F1)

The Evaluation page also renders a per-turn head-to-head win rate comparing your agent's response with Reflexio's retrieved context vs. without. F2 measures session-level outcomes; F1 zooms in to individual turns so you can spot-check the judge and surface specific cases where the Reflexio-less response was actually better.

To produce this signal, your agent code generates two responses per turn and uploads both on the same Interaction via the existing publish_interaction() API.

Integration

For each agent turn you want graded, generate the regular response WITH Reflexio rules in context, then re-run your LLM WITHOUT them to produce a shadow response. Publish both on the same agent interaction:

from reflexio import ReflexioClient, InteractionData

client = ReflexioClient()

# Retrieve Reflexio rules for this turn (your retrieval call)
rules = client.search(user_message)

# Generate the regular response WITH rules
regular_response = llm.generate(user_message, rules=rules)

# Generate the shadow response WITHOUT rules
shadow_response = llm.generate(user_message, rules=[])

# Publish both on the same agent Interaction
client.publish_interaction(
    user_id=user.id,
    interactions=[
        InteractionData(role="User", content=user_message),
        InteractionData(
            role="Agent",
            content=regular_response,        # what was served to the user
            shadow_content=shadow_response,  # used only for grading
        ),
    ],
    session_id=session_id,
    source="my-integration",
    agent_version="v1",
)

Where shadow_content lives

shadow_content is a field on every InteractionData / Interaction row, persisted across every storage backend. Only the agent turn needs a shadow_content — the judge compares Reflexio's response with the shadow against the same user message. Turns without a shadow_content are skipped by the F1 pipeline.

What the dashboard shows

A Per-turn comparison tile with the headline win rate over the selected window (e.g. 67% win rate · 134 wins · 52 ties · 14 losses · n=200 sampled), plus a daily trend sparkline.
A View recent comparisons drawer with the most recent judged turns, each rendered side-by-side with the judge's rationale. Backed by GET /api/evaluations/shadow_comparisons/recent.
A Top disagreements widget listing turns where the shadow response was significantly better than the Reflexio-augmented one (output.is_significantly_better=True losses) — actionable cases for updating your Reflexio rules.

Methodology contract

The judge compares the two responses against the same user message in isolation; no prior conversation history is shown to the judge.
Position (Request 1 vs Request 2) is randomized per call to mitigate LLM judge position bias. The mapping is recorded on ShadowComparisonVerdict.reflexio_is_request_1 so wins/losses can be derived deterministically downstream.
The judge prompt version is pinned per org via Config.shadow_comparison_judge_prompt_version. Verdicts are stored with the version that produced them; the dashboard filters to your currently pinned version so a future rubric bump never silently mixes epochs into the headline number.
Sampling: regen jobs are stratified per day at Config.eval_sample_n_per_stratum (default 200) so cost stays predictable. Clicking into a non-sampled session triggers on-demand grading via POST /api/evaluations/grade_on_demand with a 24h cache.

Where the data lands in the API

The win-rate trend is carried on GetEvaluationOverviewResponse.shadow_win_rate_trend (see the ShadowWinRateTrend schema). Individual verdicts are returned by the recent-verdicts endpoint above; each verdict is a ShadowComparisonVerdict wrapping a ShadowComparisonOutput.

Measuring Reflexio's Impact

On this page