Agent Tracing
Capture agent conversations and tool activity so they can be evaluated with AgentEvaluation.
Trace contents
Floeval's tracing helpers turn agent execution into an AgentTrace with:
- ordered messages
- optional tool calls and tool results
- a final user-facing response
- optional metadata collected during execution
The resulting trace can then be:
- saved as a full agent dataset and scored later
- collected on the fly from a Python callable
- wrapped around LangChain or LangGraph style agents
Public API
Import tracing helpers from floeval.utils.agent_trace:
from floeval.utils.agent_trace import (
capture_trace,
create_span,
log_tool_result,
log_turn,
wrap_langchain_agent,
)
Core helpers
| Helper | Purpose |
|---|---|
capture_trace |
Decorate an agent function so Floeval creates a trace context automatically |
log_turn |
Record an AI response turn inside the active trace |
log_tool_result |
Record tool output inside the active trace |
create_span |
Alternate helper that records an output turn in a DeepEval-style shape |
wrap_langchain_agent |
Adapt a LangChain or LangGraph agent to Floeval's callable interface |
For advanced control, TraceCollector is also available.
Basic Python callable
This pattern is suitable for Mode 2 agent evaluation.
from floeval.utils.agent_trace import capture_trace, log_tool_result, log_turn
@capture_trace
def weather_agent(user_input: str) -> str:
tool_output = "Paris: 18C, light rain"
log_tool_result("weather_lookup", tool_output)
response = f"Current weather for Paris: {tool_output}"
log_turn(response)
return response
capture_trace creates the trace context, log_tool_result(...) records a tool message, log_turn(...) records the agent response, and the returned string becomes the trace's final response.
You can pass this function directly into AgentEvaluation(agent=...).
Async agent
capture_trace also supports async agents:
from floeval.utils.agent_trace import capture_trace, log_turn
@capture_trace
async def async_support_agent(user_input: str) -> str:
response = f"Handled asynchronously: {user_input}"
log_turn(response)
return response
LangChain or LangGraph agents
If your agent expects a messages-style interface, wrap it before evaluation:
from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset
from floeval.utils.agent_trace import wrap_langchain_agent
wrapped_agent = wrap_langchain_agent(langchain_agent)
evaluation = AgentEvaluation(
dataset=AgentDataset.from_file("partial_agent_dataset.json"),
agent=wrapped_agent,
llm_config=llm_config,
metrics=["goal_achievement"],
)
The wrapper adapts:
- string input into the
{"messages": [...]}shape many agent frameworks expect - agent output back into a string result that Floeval can trace
TraceCollector also tries to attach LangChain callbacks automatically when the wrapped agent exposes invoke(...) or ainvoke(...).
Returning AgentTrace directly
If your agent already produces a full trace, you can return AgentTrace yourself instead of relying on helper logging.
from floeval.config.schemas.io.agent_dataset import AgentTrace
def simple_agent(user_input: str) -> AgentTrace:
return AgentTrace.from_simple_response(
user_input=user_input,
response="Done.",
source="custom-runtime",
)
This works with TraceCollector and AgentEvaluation as long as the trace is valid.
Evaluation workflows
Two common workflows are supported:
Collect traces during evaluation
Use a partial agent dataset and pass agent= to AgentEvaluation. Floeval runs the agent, captures traces, and evaluates the resulting full samples.
Capture first, score later
Run your agent separately, persist the resulting traces into an agent dataset file, then evaluate that file later with Mode 1.
This approach is useful when:
- trace capture is expensive
- you want reproducible evaluations
- you want to inspect traces manually before scoring
Trace shape
Full trace datasets use this structure:
{
"trace": {
"messages": [
{"role": "human", "content": "User request"},
{
"role": "ai",
"content": "I'll look that up.",
"tool_calls": [{"name": "search", "args": {"q": "User request"}}]
},
{
"role": "tool",
"tool_name": "search",
"content": "Tool result"
},
{"role": "ai", "content": "Final answer"}
],
"final_response": "Final answer"
}
}
Roles in saved datasets should be:
humanaitool
Notes
- Prefer
@capture_tracefor custom Python agents. It is the simplest path. - Use
log_turn()only inside an active trace context. - Use
reference_tool_callsin your dataset when you want to scoreragas:tool_call_accuracy. - Use
reference_outcomewhen you want stronger outcome-based scoring such asgoal_achievementorragas:agent_goal_accuracy. - If your agent already exposes a stable trace format, returning
AgentTracedirectly can be cleaner than manual logging.