Enterprise

Measuring Reflexio's Impact

Session-level source-set comparison and per-turn shadow win-rate on the Evaluation page.

Measuring Reflexio's Impact

Enterprise Feature

This feature requires Reflexio Enterprise (hosted at reflexio.ai) or a self-hosted enterprise deployment. It is not available in the open-source version.

The Evaluation page surfaces two complementary measurement signals:

Session-level source-set comparison — groups evaluated sessions by the source on the first request in each session, then compares success rate, corrections, turns to resolution, escalation, distributions, and rule attribution across the source sets you choose.
Per-turn shadow win-rate (F1) — for individual agent turns, a judge compares Reflexio's response with a shadow response generated without Reflexio's retrieved context and reports a head-to-head win rate.

This page covers what the dashboard renders, how to use request sources for session-level comparisons, how to publish shadow responses (F1), and the methodology contract Reflexio expects of the caller.

What the Dashboard Shows

The Evaluation page defaults to an all-source view. Use the source filter at the top of the page to scope the hero chart, context tiles, learnings, and raw session detail to one first-request source. The Source-set comparison card still lets you build labeled cohorts from the request sources present in the selected evaluation window.

For each source set, Reflexio shows:

success rate and success trend buckets
average corrections per session
average turns to resolution
escalation rate
corrections distribution
rule attribution for sessions in that set
imported scorer tiles when external scores can be matched by session_id

Where the source is read

Reflexio reads the source of the first request in each session and uses that source for the entire session — sticky assignment. If later requests in the same session use a different source, the comparison still uses the first request's source.

Integration

In your agent's session-start code, decide which source arm the session belongs to, then publish the first request with that source value. Common examples:

Source value	Meaning
`prod_without_reflexio`	Baseline sessions where Reflexio retrieval is disabled; publish with `evaluation_only=True` so they are evaluated but not learned from
`prod_with_reflexio`	Production sessions where Reflexio retrieval is enabled
`offline_candidate`	Offline or shadow candidate sessions published only for evaluation

Randomization is yours to own. A typical end-to-end pattern using ReflexioClient.publish_interaction:

import random

from reflexio import ReflexioClient
from reflexio.models.api_schema.domain.entities import InteractionData

client = ReflexioClient()

HOLDOUT_FRACTION = 0.10  # 10% control


def assign_session_source(session_id: str) -> str:
    """Return the source cohort for this session."""
    # Use a deterministic RNG keyed on session_id if you want assignments to
    # be reproducible across retries; otherwise plain random is fine.
    if random.random() < HOLDOUT_FRACTION:
        return "prod_without_reflexio"
    return "prod_with_reflexio"


source = assign_session_source(session_id)

if source == "prod_with_reflexio":
    rules = client.search(user_message)  # apply Reflexio retrieval
    evaluation_only = False
else:
    rules = []                            # control: skip retrieval
    evaluation_only = True                # evaluate the baseline, but do not learn from it

agent_response = run_agent(user_message, context=rules)

interactions = [
    InteractionData(role="user", content=user_message),
    InteractionData(role="assistant", content=agent_response),
]

client.publish_interaction(
    user_id=user.id,
    interactions=interactions,
    session_id=session_id,
    source=source,
    agent_version="v1",
    evaluation_only=evaluation_only,
)

curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/search" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
  "query": "<user_message>"
}
JSON

curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
  "user_id": "id",
  "interaction_data_list": [
    {
      "role": "user",
      "content": "<user_message>"
    },
    {
      "role": "assistant",
      "content": "<run_agent_result>"
    }
  ],
  "session_id": "<session_id>",
  "source": "<assign_session_source_result>",
  "agent_version": "v1",
  "evaluation_only": "<evaluation_only>"
}
JSON

Methodology Contract

Reflexio computes source-set comparisons from Request.source. The validity of a source-set gap as a causal claim depends on you randomizing source assignment. Without random assignment, the comparison is observational.

In practice:

Pick a holdout fraction (e.g., 5–10%) and assign new sessions to control at random.
Keep the assignment sticky per session — don't flip it mid-session.
Use stable, explicit source names for each source set.
Make sure the first request in each session carries the intended source.
Publish no-Reflexio baseline sessions with evaluation_only=True so they contribute to evaluation metrics without teaching Reflexio profiles or playbooks.

Without random assignment, lift is observational

If you assign source sets based on user properties, time of day, or any non-random rule, the comparison measures correlation, not causation. Treat the numbers as directional signals, not a controlled experiment.

Sampling and freshness

Regen jobs sample at most 200 sessions per (day × group) stratum by default so cost stays predictable as your traffic grows. The dashboard surfaces the sampled n per point so it's always honest about how many sessions back each number.

If you click into a session that wasn't in the sampled set, Reflexio grades it on demand via POST /api/evaluations/grade_on_demand and caches the result for 24 hours.

To tune the defaults, set eval_sample_n_per_stratum and eval_concurrency_limit in your Config.

API Reference

The source_set_comparison field on GetEvaluationOverviewResponse carries available sources and per-set metrics. Send source_sets on GetEvaluationOverviewRequest to request labeled source-set comparisons.

The source field is declared on the Request model and is preserved across all storage backends (SQLite, Postgres, Supabase, and the disk YAML layout).

Per-turn comparison (F1)

The Evaluation page also renders a per-turn head-to-head win rate comparing your agent's response with Reflexio's retrieved context vs. without. F2 measures session-level outcomes; F1 zooms in to individual turns so you can spot-check the judge and surface specific cases where the Reflexio-less response was actually better.

To produce this signal, your agent code generates two responses per turn and uploads both on the same Interaction via the existing publish_interaction() API.

Integration

For each agent turn you want graded, generate the regular response WITH Reflexio rules in context, then re-run your LLM WITHOUT them to produce a shadow response. Publish both on the same agent interaction:

from reflexio import ReflexioClient, InteractionData

client = ReflexioClient()

# Retrieve Reflexio rules for this turn (your retrieval call)
rules = client.search(user_message)

# Generate the regular response WITH rules
regular_response = llm.generate(user_message, rules=rules)

# Generate the shadow response WITHOUT rules
shadow_response = llm.generate(user_message, rules=[])

# Publish both on the same agent Interaction
client.publish_interaction(
    user_id=user.id,
    interactions=[
        InteractionData(role="User", content=user_message),
        InteractionData(
            role="Agent",
            content=regular_response,        # what was served to the user
            shadow_content=shadow_response,  # used only for grading
        ),
    ],
    session_id=session_id,
    source="my-integration",
    agent_version="v1",
)

curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/search" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
  "query": "<user_message>"
}
JSON

curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
  "user_id": "id",
  "interaction_data_list": [
    {
      "role": "User",
      "content": "<user_message>"
    },
    {
      "role": "Agent",
      "content": "<generate_result>",
      "shadow_content": "<generate_result>"
    }
  ],
  "session_id": "<session_id>",
  "source": "my-integration",
  "agent_version": "v1"
}
JSON

Where shadow_content lives

shadow_content is a field on every InteractionData / Interaction row, persisted across every storage backend. Only the agent turn needs a shadow_content — the judge compares Reflexio's response with the shadow against the same user message. Turns without a shadow_content are skipped by the F1 pipeline.

What the dashboard shows

A Per-turn comparison tile with the headline win rate over the selected window (e.g. 67% win rate · 134 wins · 52 ties · 14 losses · n=200 sampled), plus a daily trend sparkline.
A View recent comparisons drawer with the most recent judged turns, each rendered side-by-side with the judge's rationale. Backed by GET /api/evaluations/shadow_comparisons/recent.
A Top disagreements widget listing turns where the shadow response was significantly better than the Reflexio-augmented one (output.is_significantly_better=True losses) — actionable cases for updating your Reflexio rules.

Methodology contract

The judge compares the two responses against the same user message in isolation; no prior conversation history is shown to the judge.
Position (Request 1 vs Request 2) is randomized per call to mitigate LLM judge position bias. The mapping is recorded on ShadowComparisonVerdict.reflexio_is_request_1 so wins/losses can be derived deterministically downstream.
The judge prompt version is pinned per org via Config.shadow_comparison_judge_prompt_version. Verdicts are stored with the version that produced them; the dashboard filters to your currently pinned version so a future rubric bump never silently mixes epochs into the headline number.
Sampling: regen jobs are stratified per day at Config.eval_sample_n_per_stratum (default 200) so cost stays predictable. Clicking into a non-sampled session triggers on-demand grading via POST /api/evaluations/grade_on_demand with a 24h cache.

Where the data lands in the API

The win-rate trend is carried on GetEvaluationOverviewResponse.shadow_win_rate_trend (see the ShadowWinRateTrend schema). Individual verdicts are returned by the recent-verdicts endpoint above; each verdict is a ShadowComparisonVerdict wrapping a ShadowComparisonOutput.