Evaluating Agent Performance

Configure session-level agent success evaluation, publish evaluation-only sessions, and compare evaluation signals.

Evaluating Agent Performance

Reflexio evaluates agent success at the session level. A session is the set of requests that share a session_id; the evaluator reads the user turns, agent turns, tools used, and your success rubric, then writes an AgentSuccessEvaluationResult.

For method-level details, see the Evaluation API Reference, Evaluation Models, and publish_interaction.

How Evaluation Runs

When you publish a request with a session_id, Reflexio can schedule a group evaluation for that session:

The request and interactions are stored.
The session passes the deterministic sampling gate from agent_success_config.sampling_rate.
The scheduler waits for session inactivity, then evaluates the full session.
The result is stored with user_id, session_id, agent_version, success/failure fields, and session metrics.

The default inactivity delay is 10 minutes after the latest request in the session. Publishing another request with the same session_id moves the scheduled evaluation later, so multi-turn sessions are evaluated after they settle.

Evaluation sampling happens once per session. The default sampling rate is 0.05, so only about 5% of sessions are evaluated automatically unless you raise it.

Configure Evaluation

Use agent_success_config to define the success rubric and sampling rate. Use root-level tool_can_use to tell the evaluator what tools the agent had available; this same tool context is also shared with playbook extraction.

from reflexio import ReflexioClient
from reflexio.models.config_schema import AgentSuccessConfig, ToolUseConfig

client = ReflexioClient()
config = client.get_config()

config.agent_success_config = AgentSuccessConfig(
    success_definition_prompt="""
Evaluate whether the agent successfully resolved the user's task.

Success means:
- The agent understood the user's goal.
- The answer or action directly addressed that goal.
- Any required next step was clear.
- The user did not need to correct, repeat, or escalate the request.
""",
    request_sources_enabled=["prod_with_reflexio", "prod_without_reflexio"],
    sampling_rate=1.0,  # useful during launch or audit windows
)

config.tool_can_use = [
    ToolUseConfig(
        tool_name="search_docs",
        tool_description="Search product documentation for grounded answers.",
    ),
    ToolUseConfig(
        tool_name="create_ticket",
        tool_description="Create a support ticket when the user needs follow-up.",
    ),
]

client.set_config(config)

curl -X GET "${REFLEXIO_URL:-https://www.reflexio.ai}/api/get_config" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY"

curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/set_config" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
  "...": "updated full config object"
}
JSON

Key fields:

Field	Purpose
`success_definition_prompt`	The rubric the LLM judge uses to decide whether the session succeeded.
`sampling_rate`	Fraction of sessions to evaluate automatically, from `0.0` to `1.0`.
`request_sources_enabled`	Optional allowlist of `source` values eligible for evaluation.
`metadata_definition_prompt`	Optional categories for evaluation metadata.
`tool_can_use`	Root config list describing tools available to the agent.

Set agent_success_config=None to disable automatic agent success evaluation.

Publish Sessions for Normal Evaluation

For ordinary production traffic, publish the same interactions you use for learning. These requests can contribute to profiles, playbooks, reflection, aggregation, and evaluation.

from reflexio import InteractionData, ReflexioClient, UserActionType

client = ReflexioClient()

client.publish_interaction(
    user_id="user_123",
    session_id="session_001",
    source="prod_with_reflexio",
    agent_version="v2.1.0",
    interactions=[
        InteractionData(
            role="User",
            content="Can you help me reset my account password?",
            user_action=UserActionType.NONE,
        ),
        InteractionData(
            role="Agent",
            content="Yes. Open Account Settings, choose Security, then select Reset Password.",
            user_action=UserActionType.NONE,
        ),
    ],
)

curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
  "user_id": "user_123",
  "session_id": "session_001",
  "source": "prod_with_reflexio",
  "agent_version": "v2.1.0",
  "interaction_data_list": [
    {
      "role": "User",
      "content": "Can you help me reset my account password?",
      "user_action": "none"
    },
    {
      "role": "Agent",
      "content": "Yes. Open Account Settings, choose Security, then select Reset Password.",
      "user_action": "none"
    }
  ]
}
JSON

Use stable agent_version values when comparing releases. Use stable source values when comparing cohorts.

Publish Evaluation-Only Sessions

Use evaluation_only=True for traffic you want Reflexio to grade but not learn from. The common comparison workflow is a no-Reflexio baseline arm: your agent skips Reflexio retrieval for that session, then still publishes the transcript with a baseline source so the Evaluation page can compare it against Reflexio-enabled sessions.

from reflexio import InteractionData, ReflexioClient, UserActionType

client = ReflexioClient()

# Baseline/control arm: the agent did not use Reflexio context, but the
# transcript is still published so it can be evaluated and compared.
client.publish_interaction(
    user_id="user_123",
    session_id="session_001",
    source="prod_without_reflexio",
    agent_version="v2.1.0",
    evaluation_only=True,
    interactions=[
        InteractionData(
            role="User",
            content="Can you help me reset my account password?",
            user_action=UserActionType.NONE,
        ),
        InteractionData(
            role="Agent",
            content="Open Account Settings, choose Security, then select Reset Password.",
            user_action=UserActionType.NONE,
        ),
    ],
)

curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
  "user_id": "user_123",
  "session_id": "session_001",
  "source": "prod_without_reflexio",
  "agent_version": "v2.1.0",
  "evaluation_only": true,
  "interaction_data_list": [
    {
      "role": "User",
      "content": "Can you help me reset my account password?",
      "user_action": "none"
    },
    {
      "role": "Agent",
      "content": "Open Account Settings, choose Security, then select Reset Password.",
      "user_action": "none"
    }
  ]
}
JSON

evaluation_only=True means:

The request and interactions are stored.
The session can be evaluated if it passes sampling_rate.
The session is still grouped by source, such as prod_without_reflexio.
The request is excluded from profile extraction, playbook extraction, reflection, and aggregation.
The request still waits for the normal session-inactivity delay before evaluation.
The flag requires a non-empty session_id.
The flag cannot be combined with force_extraction=True.

Good uses for evaluation_only=True:

baseline/control sessions where the agent did not use Reflexio
offline eval sets that should not teach Reflexio
candidate model or prompt runs
replayed historical sessions that should not teach Reflexio new rules
holdout or shadow traffic that you want to grade but keep out of learning windows

Compare Source Sets

The Evaluation page and overview API compare cohorts by Request.source. Reflexio assigns the whole session to the source on the first request in that session. For a Reflexio lift measurement, publish the no-Reflexio baseline with source="prod_without_reflexio" and evaluation_only=True, then publish the Reflexio-enabled test arm with source="prod_with_reflexio".

import random

use_reflexio = random.random() >= 0.10
source = "prod_with_reflexio" if use_reflexio else "prod_without_reflexio"

if use_reflexio:
    reflexio_context = client.search(user_message)
    agent_response = run_agent(user_message, context=reflexio_context)
else:
    agent_response = run_agent(user_message, context=[])

client.publish_interaction(
    user_id=user_id,
    session_id=session_id,
    source=source,
    agent_version="v2.1.0",
    evaluation_only=not use_reflexio,
    interactions=[
        InteractionData(role="User", content=user_message),
        InteractionData(role="Agent", content=agent_response),
    ],
)

curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/search" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
  "query": "<user_message>"
}
JSON

curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
  "user_id": "<user_id>",
  "session_id": "<session_id>",
  "source": "<'prod_with_reflexio' if use_reflexio else 'prod_without_reflexio'>",
  "agent_version": "v2.1.0",
  "evaluation_only": "<not use_reflexio>",
  "interaction_data_list": [
    {
      "role": "User",
      "content": "<user_message>"
    },
    {
      "role": "Agent",
      "content": "<agent_response>"
    }
  ]
}
JSON

Use random assignment if you want a causal measurement. If the source sets are chosen by user type, geography, time of day, or any other non-random rule, treat the comparison as observational.

evaluation_only is not a cohort label. Use source for source-set comparison and evaluation_only=True only to control whether the request can teach Reflexio.

Publish Shadow Responses

For per-turn comparison, publish the response served to the user in content and the alternate response in shadow_content. Reflexio judges the two responses against the same user message and stores per-turn shadow comparison verdicts.

from reflexio import InteractionData, ReflexioClient

client = ReflexioClient()

regular_response = run_agent_with_reflexio(user_message)
shadow_response = run_agent_without_reflexio(user_message)

client.publish_interaction(
    user_id=user_id,
    session_id=session_id,
    source="prod_with_reflexio",
    agent_version="v2.1.0",
    interactions=[
        InteractionData(role="User", content=user_message),
        InteractionData(
            role="Agent",
            content=regular_response,
            shadow_content=shadow_response,
        ),
    ],
)

curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/publish_interaction" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
  "user_id": "<user_id>",
  "session_id": "<session_id>",
  "source": "prod_with_reflexio",
  "agent_version": "v2.1.0",
  "interaction_data_list": [
    {
      "role": "User",
      "content": "<user_message>"
    },
    {
      "role": "Agent",
      "content": "<run_agent_with_reflexio_result>",
      "shadow_content": "<run_agent_without_reflexio_result>"
    }
  ]
}
JSON

Shadow comparison is per turn. Agent success evaluation is still session-level.

Grade or Regenerate Explicitly

Automatic evaluation waits for the session inactivity delay. Two API surfaces are available when you need explicit evaluation work:

Mode	API surface	Behavior
On-demand gradeSingle session	`POST /api/evaluations/grade_on_demand`Best for UI actions, ops tools, and launch checklists.	Grades one session now. Skips the inactivity delay and already-evaluated checks; caches results for 24 hours per session/version.
Regenerate windowTime range	`POST /api/evaluations/regenerateGET /api/evaluations/regenerate/{job_id}`Best for historical re-scoring after rubric, model, or prompt changes.	Replays evaluation across a time window. Returns a job id, then exposes progress and completion status through polling.

Example on-demand grade:

curl -X POST "$REFLEXIO_URL/api/evaluations/grade_on_demand" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
    "session_id": "session_001",
    "agent_version": "v2.1.0"
  }
JSON

Example regenerate job:

curl -X POST "$REFLEXIO_URL/api/evaluations/regenerate" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
    "from_ts": 1764547200,
    "to_ts": 1765152000
  }
JSON

See the Evaluation Models reference for job status and response shapes.

Read Results

Use get_agent_success_evaluation_results for raw session-level results.

response = client.get_agent_success_evaluation_results(
    agent_version="v2.1.0",
    limit=100,
)

results = response.agent_success_evaluation_results
total = len(results)
successful = sum(1 for result in results if result.is_success)
success_rate = successful / total if total else 0

print(f"Success rate: {success_rate:.1%} ({successful}/{total})")

for result in results:
    if not result.is_success:
        print(result.session_id, result.failure_type, result.failure_reason)

curl -X POST "${REFLEXIO_URL:-https://www.reflexio.ai}/api/get_agent_success_evaluation_results" \
  -H "User-Agent: my-agent-reflexio" \
  -H "Authorization: Bearer $REFLEXIO_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<'JSON'
{
  "agent_version": "v2.1.0",
  "limit": 100
}
JSON

Result fields include:

is_success
failure_type
failure_reason
number_of_correction_per_session
user_turns_to_resolution
is_escalated
agent_version
user_id
session_id

Choose the Right Evaluation Path

Always-onsampling_rate

Production monitoring

Continuously monitor production quality with normal publish traffic.

Path: Normal publish
Use when: You want a steady quality signal while controlling evaluator cost.

Release auditagent_version

Launch or migration check

Temporarily raise evaluation coverage and compare results by version.

Path: Raise sampling, then filter by version
Use when: You are validating a prompt, model, or agent release.

Cohort liftsource + evaluation_only

Reflexio vs no Reflexio

Publish the no-Reflexio baseline and Reflexio-enabled test arm for comparison.

Path: Baseline source-set comparison
Use when: You need to measure Reflexio's impact across matched traffic.

Baseline armevaluation_only=True

Grade without learning

Publish no-Reflexio baseline traffic while excluding it from learning.

Path: Evaluation-only baseline publish
Use when: You need baseline sessions in evaluation metrics but not in Reflexio memory.

Turn A/Bshadow_content

Compare two responses

Attach the alternate response to the same user turn for per-turn judging.

Path: Shadow response comparison
Use when: You want to compare served and alternate answers side by side.

Backfillregenerate

Re-score historical sessions

Start a regenerate job after changing the rubric or evaluation window.

Path: Regenerate job
Use when: You need fresh scores for existing sessions.

Immediategrade_on_demand

Grade one session now

Trigger a single-session grade from the UI or an operations tool.

Path: Grade on demand
Use when: You need a result before the inactivity delay completes.