Evaluating Agent Performance
Configure session-level agent success evaluation, publish evaluation-only sessions, and compare evaluation signals.
Evaluating Agent Performance
Reflexio evaluates agent success at the session level. A session is the set of requests that share a session_id; the evaluator reads the user turns, agent turns, tools used, and your success rubric, then writes an AgentSuccessEvaluationResult.
For method-level details, see the Evaluation API Reference, Evaluation Models, and publish_interaction.
How Evaluation Runs
When you publish a request with a session_id, Reflexio can schedule a group evaluation for that session:
- The request and interactions are stored.
- The session passes the deterministic sampling gate from
agent_success_config.sampling_rate. - The scheduler waits for session inactivity, then evaluates the full session.
- The result is stored with
user_id,session_id,agent_version, success/failure fields, and session metrics.
The default inactivity delay is 10 minutes after the latest request in the session. Publishing another request with the same session_id moves the scheduled evaluation later, so multi-turn sessions are evaluated after they settle.
Evaluation sampling happens once per session. The default sampling rate is 0.05, so only about 5% of sessions are evaluated automatically unless you raise it.
Configure Evaluation
Use agent_success_config to define the success rubric and sampling rate. Use root-level tool_can_use to tell the evaluator what tools the agent had available; this same tool context is also shared with playbook extraction.
from reflexio import ReflexioClient
from reflexio.models.config_schema import AgentSuccessConfig, ToolUseConfig
client = ReflexioClient()
config = client.get_config()
config.agent_success_config = AgentSuccessConfig(
success_definition_prompt="""
Evaluate whether the agent successfully resolved the user's task.
Success means:
- The agent understood the user's goal.
- The answer or action directly addressed that goal.
- Any required next step was clear.
- The user did not need to correct, repeat, or escalate the request.
""",
request_sources_enabled=["prod_with_reflexio", "prod_without_reflexio"],
sampling_rate=1.0, # useful during launch or audit windows
)
config.tool_can_use = [
ToolUseConfig(
tool_name="search_docs",
tool_description="Search product documentation for grounded answers.",
),
ToolUseConfig(
tool_name="create_ticket",
tool_description="Create a support ticket when the user needs follow-up.",
),
]
client.set_config(config)curl -X GET "${REFLEXIO_URL:-https://www.reflexio.ai}/api/get_config" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY"
curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/set_config" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"...": "updated full config object"
}
JSONKey fields:
| Field | Purpose |
|---|---|
success_definition_prompt | The rubric the LLM judge uses to decide whether the session succeeded. |
sampling_rate | Fraction of sessions to evaluate automatically, from 0.0 to 1.0. |
request_sources_enabled | Optional allowlist of source values eligible for evaluation. |
metadata_definition_prompt | Optional categories for evaluation metadata. |
tool_can_use | Root config list describing tools available to the agent. |
Set agent_success_config=None to disable automatic agent success evaluation.
Publish Sessions for Normal Evaluation
For ordinary production traffic, publish the same interactions you use for learning. These requests can contribute to profiles, playbooks, reflection, aggregation, and evaluation.
from reflexio import InteractionData, ReflexioClient, UserActionType
client = ReflexioClient()
client.publish_interaction(
user_id="user_123",
session_id="session_001",
source="prod_with_reflexio",
agent_version="v2.1.0",
interactions=[
InteractionData(
role="User",
content="Can you help me reset my account password?",
user_action=UserActionType.NONE,
),
InteractionData(
role="Agent",
content="Yes. Open Account Settings, choose Security, then select Reset Password.",
user_action=UserActionType.NONE,
),
],
)curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"user_id": "user_123",
"session_id": "session_001",
"source": "prod_with_reflexio",
"agent_version": "v2.1.0",
"interaction_data_list": [
{
"role": "User",
"content": "Can you help me reset my account password?",
"user_action": "none"
},
{
"role": "Agent",
"content": "Yes. Open Account Settings, choose Security, then select Reset Password.",
"user_action": "none"
}
]
}
JSONUse stable agent_version values when comparing releases. Use stable source values when comparing cohorts.
Publish Evaluation-Only Sessions
Use evaluation_only=True for traffic you want Reflexio to grade but not learn from. The common comparison workflow is a no-Reflexio baseline arm: your agent skips Reflexio retrieval for that session, then still publishes the transcript with a baseline source so the Evaluation page can compare it against Reflexio-enabled sessions.
from reflexio import InteractionData, ReflexioClient, UserActionType
client = ReflexioClient()
# Baseline/control arm: the agent did not use Reflexio context, but the
# transcript is still published so it can be evaluated and compared.
client.publish_interaction(
user_id="user_123",
session_id="session_001",
source="prod_without_reflexio",
agent_version="v2.1.0",
evaluation_only=True,
interactions=[
InteractionData(
role="User",
content="Can you help me reset my account password?",
user_action=UserActionType.NONE,
),
InteractionData(
role="Agent",
content="Open Account Settings, choose Security, then select Reset Password.",
user_action=UserActionType.NONE,
),
],
)curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"user_id": "user_123",
"session_id": "session_001",
"source": "prod_without_reflexio",
"agent_version": "v2.1.0",
"evaluation_only": true,
"interaction_data_list": [
{
"role": "User",
"content": "Can you help me reset my account password?",
"user_action": "none"
},
{
"role": "Agent",
"content": "Open Account Settings, choose Security, then select Reset Password.",
"user_action": "none"
}
]
}
JSONevaluation_only=True means:
- The request and interactions are stored.
- The session can be evaluated if it passes
sampling_rate. - The session is still grouped by
source, such asprod_without_reflexio. - The request is excluded from profile extraction, playbook extraction, reflection, and aggregation.
- The request still waits for the normal session-inactivity delay before evaluation.
- The flag requires a non-empty
session_id. - The flag cannot be combined with
force_extraction=True.
Good uses for evaluation_only=True:
- baseline/control sessions where the agent did not use Reflexio
- offline eval sets that should not teach Reflexio
- candidate model or prompt runs
- replayed historical sessions that should not teach Reflexio new rules
- holdout or shadow traffic that you want to grade but keep out of learning windows
Compare Source Sets
The Evaluation page and overview API compare cohorts by Request.source. Reflexio assigns the whole session to the source on the first request in that session. For a Reflexio lift measurement, publish the no-Reflexio baseline with source="prod_without_reflexio" and evaluation_only=True, then publish the Reflexio-enabled test arm with source="prod_with_reflexio".
import random
use_reflexio = random.random() >= 0.10
source = "prod_with_reflexio" if use_reflexio else "prod_without_reflexio"
if use_reflexio:
reflexio_context = client.search(user_message)
agent_response = run_agent(user_message, context=reflexio_context)
else:
agent_response = run_agent(user_message, context=[])
client.publish_interaction(
user_id=user_id,
session_id=session_id,
source=source,
agent_version="v2.1.0",
evaluation_only=not use_reflexio,
interactions=[
InteractionData(role="User", content=user_message),
InteractionData(role="Agent", content=agent_response),
],
)curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/search" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"query": "<user_message>"
}
JSON
curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"user_id": "<user_id>",
"session_id": "<session_id>",
"source": "<'prod_with_reflexio' if use_reflexio else 'prod_without_reflexio'>",
"agent_version": "v2.1.0",
"evaluation_only": "<not use_reflexio>",
"interaction_data_list": [
{
"role": "User",
"content": "<user_message>"
},
{
"role": "Agent",
"content": "<agent_response>"
}
]
}
JSONUse random assignment if you want a causal measurement. If the source sets are chosen by user type, geography, time of day, or any other non-random rule, treat the comparison as observational.
evaluation_only is not a cohort label. Use source for source-set comparison and evaluation_only=True only to control whether the request can teach Reflexio.
Publish Shadow Responses
For per-turn comparison, publish the response served to the user in content and the alternate response in shadow_content. Reflexio judges the two responses against the same user message and stores per-turn shadow comparison verdicts.
from reflexio import InteractionData, ReflexioClient
client = ReflexioClient()
regular_response = run_agent_with_reflexio(user_message)
shadow_response = run_agent_without_reflexio(user_message)
client.publish_interaction(
user_id=user_id,
session_id=session_id,
source="prod_with_reflexio",
agent_version="v2.1.0",
interactions=[
InteractionData(role="User", content=user_message),
InteractionData(
role="Agent",
content=regular_response,
shadow_content=shadow_response,
),
],
)curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"user_id": "<user_id>",
"session_id": "<session_id>",
"source": "prod_with_reflexio",
"agent_version": "v2.1.0",
"interaction_data_list": [
{
"role": "User",
"content": "<user_message>"
},
{
"role": "Agent",
"content": "<run_agent_with_reflexio_result>",
"shadow_content": "<run_agent_without_reflexio_result>"
}
]
}
JSONShadow comparison is per turn. Agent success evaluation is still session-level.
Grade or Regenerate Explicitly
Automatic evaluation waits for the session inactivity delay. Two API surfaces are available when you need explicit evaluation work:
| Mode | API surface | Behavior |
|---|---|---|
| On-demand gradeSingle session | POST /api/evaluations/grade_on_demandBest for UI actions, ops tools, and launch checklists. | Grades one session now. Skips the inactivity delay and already-evaluated checks; caches results for 24 hours per session/version. |
| Regenerate windowTime range | POST /api/evaluations/regenerateGET /api/evaluations/regenerate/{job_id}Best for historical re-scoring after rubric, model, or prompt changes. | Replays evaluation across a time window. Returns a job id, then exposes progress and completion status through polling. |
Example on-demand grade:
curl -X POST "$REFLEXIO_URL/api/evaluations/grade_on_demand" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"session_id": "session_001",
"agent_version": "v2.1.0"
}
JSONExample regenerate job:
curl -X POST "$REFLEXIO_URL/api/evaluations/regenerate" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"from_ts": 1764547200,
"to_ts": 1765152000
}
JSONSee the Evaluation Models reference for job status and response shapes.
Read Results
Use get_agent_success_evaluation_results for raw session-level results.
response = client.get_agent_success_evaluation_results(
agent_version="v2.1.0",
limit=100,
)
results = response.agent_success_evaluation_results
total = len(results)
successful = sum(1 for result in results if result.is_success)
success_rate = successful / total if total else 0
print(f"Success rate: {success_rate:.1%} ({successful}/{total})")
for result in results:
if not result.is_success:
print(result.session_id, result.failure_type, result.failure_reason)curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/get_agent_success_evaluation_results" \
-H "User-Agent: my-agent-reflexio" \
-H "Authorization: Bearer $REFLEXIO_API_KEY" \
-H "Content-Type: application/json" \
--data @- <<'JSON'
{
"agent_version": "v2.1.0",
"limit": 100
}
JSONResult fields include:
is_successfailure_typefailure_reasonnumber_of_correction_per_sessionuser_turns_to_resolutionis_escalatedagent_versionuser_idsession_id
Choose the Right Evaluation Path
sampling_rateProduction monitoring
Continuously monitor production quality with normal publish traffic.
- Path
- Normal publish
- Use when
- You want a steady quality signal while controlling evaluator cost.
agent_versionLaunch or migration check
Temporarily raise evaluation coverage and compare results by version.
- Path
- Raise sampling, then filter by version
- Use when
- You are validating a prompt, model, or agent release.
source + evaluation_onlyReflexio vs no Reflexio
Publish the no-Reflexio baseline and Reflexio-enabled test arm for comparison.
- Path
- Baseline source-set comparison
- Use when
- You need to measure Reflexio's impact across matched traffic.
evaluation_only=TrueGrade without learning
Publish no-Reflexio baseline traffic while excluding it from learning.
- Path
- Evaluation-only baseline publish
- Use when
- You need baseline sessions in evaluation metrics but not in Reflexio memory.
shadow_contentCompare two responses
Attach the alternate response to the same user turn for per-turn judging.
- Path
- Shadow response comparison
- Use when
- You want to compare served and alternate answers side by side.
regenerateRe-score historical sessions
Start a regenerate job after changing the rubric or evaluation window.
- Path
- Regenerate job
- Use when
- You need fresh scores for existing sessions.
grade_on_demandGrade one session now
Trigger a single-session grade from the UI or an operations tool.
- Path
- Grade on demand
- Use when
- You need a result before the inactivity delay completes.