Agentic Workflow Evaluation
Use this guide to evaluate multi-agent pipelines where multiple agents collaborate in a directed acyclic graph (DAG). Each node in the DAG is an independent agent, and the nodes execute in topological order with dependencies resolved automatically.
When to use agentic workflow vs single-agent evaluation
| Use case | Evaluation type |
|---|---|
| One agent answers questions | Agent Evaluation |
| Multiple agents collaborate in sequence or parallel | Agentic Workflow |
| Each agent has a specific role (retrieval, reasoning, formatting) | Agentic Workflow |
Prerequisites
Agentic workflow evaluation requires the optional FloTorch integration:
This installs google-adk and related packages needed by WorkflowRunner.
How it works
- Floeval loads the DAG config defining nodes and edges.
- For each sample in the dataset,
WorkflowRunnerexecutes the DAG. - Each agent node runs in order; its output feeds into downstream nodes.
- After all nodes complete, the final output is captured as an
AgentSamplewith: trace— the combined last-node traceagent_traces— oneAgentTraceper completed agent nodemetadata.workflow_execution— per-agent summariesAgentEvaluationscores each completedAgentSampleusing the configured metrics.
DAG config structure
The DAG is defined as a JSON object with uid, name, nodes, and edges.
{
"uid": "eval-workflow-001",
"name": "Support Triage Workflow",
"nodes": [
{"id": "start", "type": "START", "label": "Start"},
{"id": "classify", "type": "AGENT", "label": "Classify", "agentName": "classifier-agent:latest"},
{"id": "respond", "type": "AGENT", "label": "Respond", "agentName": "responder-agent:latest"},
{"id": "end", "type": "END", "label": "End"}
],
"edges": [
{"sourceNodeId": "start", "targetNodeId": "classify"},
{"sourceNodeId": "classify", "targetNodeId": "respond"},
{"sourceNodeId": "respond", "targetNodeId": "end"}
]
}
Node types
| Type | Description |
|---|---|
START |
Entry point; receives the user input |
AGENT |
An agent that processes the incoming data; must have agentName |
END |
Marks the end of the workflow |
Parallel nodes
To run two agents in parallel, give them both the same upstream node as a parent:
{
"edges": [
{"sourceNodeId": "start", "targetNodeId": "retriever"},
{"sourceNodeId": "start", "targetNodeId": "classifier"},
{"sourceNodeId": "retriever", "targetNodeId": "synthesizer"},
{"sourceNodeId": "classifier","targetNodeId": "synthesizer"},
{"sourceNodeId": "synthesizer","targetNodeId": "end"}
]
}
retriever and classifier run in parallel; synthesizer runs after both complete.
CLI workflow
Config
# workflow_config.yaml
llm_config:
base_url: "https://gateway.example/openai/v1"
api_key: "your-gateway-key"
chat_model: "gpt-4o-mini"
evaluation_config:
metrics:
- "goal_achievement"
- "ragas:agent_goal_accuracy"
agent_workflow_config:
dataset_url: "https://your-storage/agent_dataset.json"
config:
uid: "eval-workflow-001"
name: "Support Triage Workflow"
nodes:
- {id: "start", type: "START", label: "Start"}
- {id: "classify", type: "AGENT", label: "Classify", agentName: "classifier-agent:latest"}
- {id: "respond", type: "AGENT", label: "Respond", agentName: "responder-agent:latest"}
- {id: "end", type: "END", label: "End"}
edges:
- {sourceNodeId: "start", targetNodeId: "classify"}
- {sourceNodeId: "classify", targetNodeId: "respond"}
- {sourceNodeId: "respond", targetNodeId: "end"}
Command
Dataset
Use a standard agent dataset. Partial samples (no trace) work best because Floeval will execute the workflow and fill in the traces.
Partial agent dataset (recommended)
{
"samples": [
{
"user_input": "My order has not arrived after two weeks.",
"reference_outcome": "An apology and a case escalation to the shipping team."
},
{
"user_input": "How do I reset my password?",
"reference_outcome": "Step-by-step password reset instructions."
}
]
}
Dataset field aliases are supported: question maps to user_input, and answer maps to reference_outcome.
JSON array format
The dataset can also be a JSON array at the root level (no samples wrapper):
[
{"question": "My order has not arrived.", "answer": "An apology and escalation."},
{"question": "How do I reset my password?", "answer": "Step-by-step instructions."}
]
JSONL format
{"user_input": "My order has not arrived.", "reference_outcome": "An apology and escalation."}
{"user_input": "How do I reset my password?", "reference_outcome": "Step-by-step instructions."}
Python API
import json
from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset, PartialAgentSample
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.flotorch import WorkflowRunner
llm_config = OpenAIProviderConfig(
base_url="https://gateway.example/openai/v1",
api_key="your-gateway-key",
chat_model="gpt-4o-mini",
)
dag_config = json.loads(open("workflow_config.json").read())
runner = WorkflowRunner(
dag_config=dag_config,
llm_config=llm_config,
app_name="floeval-workflow",
)
dataset = AgentDataset(
samples=[
PartialAgentSample(
user_input="My order has not arrived after two weeks.",
reference_outcome="An apology and a case escalation to the shipping team.",
),
PartialAgentSample(
user_input="How do I reset my password?",
reference_outcome="Step-by-step password reset instructions.",
),
]
)
evaluation = AgentEvaluation(
dataset=dataset,
agent_runner=runner,
llm_config=llm_config,
metrics=[
"goal_achievement",
"response_coherence",
"ragas:agent_goal_accuracy",
],
)
results = evaluation.run()
print(results.summary)
Load dataset from file
Async execution
WorkflowRunner reference
from floeval.flotorch import WorkflowRunner
runner = WorkflowRunner(
dag_config=dag_config, # dict: the DAG JSON object
llm_config=llm_config, # OpenAIProviderConfig
app_name="floeval-workflow", # optional label shown in logs
)
# Run a single sample
full_samples = await runner.run_on_dataset([partial_sample])
| Parameter | Required | Description |
|---|---|---|
dag_config |
Yes | DAG definition dict with uid, name, nodes, edges |
llm_config |
Yes | LLM credentials for each agent in the DAG |
app_name |
No | Log label; defaults to "floeval-workflow" |
Metrics
Agentic workflow evaluation supports all four agent metrics:
| Provider | Metric | Notes |
|---|---|---|
builtin |
goal_achievement |
LLM judge; scores whether the workflow's final output satisfies the user request |
builtin |
response_coherence |
Checks whether the final response is coherent with the full multi-agent trace |
ragas |
agent_goal_accuracy |
Compares the workflow output against reference_outcome |
ragas |
tool_call_accuracy |
Checks tool calls against reference_tool_calls; skips samples that lack reference_tool_calls |
Results
results = evaluation.run()
# Per-sample results
for row in results.sample_results:
print(row["user_input"])
print(row["final_response"])
print(row["metrics"])
# Each metric: {score, passed, reason, metadata}
# Agent traces per node
agent_traces = row.get("agent_traces", [])
for trace in agent_traces:
print(trace)
# Per-agent summaries
workflow_exec = row.get("workflow_execution", {})
# Aggregate summary (mean scores per metric)
print(results.summary)
The agent_traces list in each sample contains one AgentTrace per completed AGENT node, in execution order. The workflow_execution dict contains per-agent input, output, tool_calls, turn_count, and tool_call_count.
Common pitfalls
| Problem | Cause | Fix |
|---|---|---|
FloTorch extras required error |
floeval[flotorch] not installed |
pip install "floeval[flotorch]" |
| Agent not found | agentName in DAG node does not exist in the gateway |
Check the agent name and your gateway connection |
| Workflow hangs | Circular edge in DAG | Check the edges list for cycles |
Empty agent_traces |
DAG has no AGENT nodes | Ensure at least one AGENT type node is in nodes |
tool_call_accuracy is null |
Samples have no reference_tool_calls |
Either add reference_tool_calls to samples or omit this metric |
Related references
- Agent Evaluation for single-agent evaluation
- Agent Tracing for trace capture when building your own agent runner
- Metrics for the full metric catalog
- API Reference for
WorkflowRunner,AgentEvaluation, and config shapes