API Reference
Reference for Floeval's user-facing config, CLI, and Python API surface.
Public Python entry points
Top-level imports from floeval:
Additional public imports commonly used:
from floeval.api.agent_evaluation import AgentEvaluation
from floeval.api.metrics.custom import custom_metric, criteria
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.utils.agent_trace import capture_trace, log_turn, log_tool_result, wrap_langchain_agent
# Agentic workflow (requires floeval[flotorch])
from floeval.flotorch import WorkflowRunner, create_flotorch_runner
llm_config
Use OpenAIProviderConfig for OpenAI-compatible providers.
from floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
chat_model="gpt-4o-mini",
chat_endpoint="chat/completions",
embedding_model="text-embedding-3-small",
embedding_endpoint="embeddings",
system_prompt="You are a helpful assistant.", # optional
extra_kwargs=None, # optional
)
Fields
| Field | Required | Notes |
|---|---|---|
provider_type |
No | Defaults to "openai" |
base_url |
Yes | Base API URL |
api_key |
Yes | Provider credential |
chat_model |
Yes | Model used for chat/completions |
chat_endpoint |
No | Defaults to chat/completions |
embedding_model |
No | Optional in the schema, but commonly needed for provider-backed evaluation metrics |
embedding_endpoint |
No | Optional embedding endpoint |
system_prompt |
No | Used during response generation when provided |
extra_kwargs |
No | Provider-specific keyword args |
Config file structure
Floeval's CLI reads YAML, YML, or JSON config files.
evaluation_config
| Field | Required | Notes |
|---|---|---|
metrics |
Yes | List of metric specs |
default_provider |
No | Used when metric IDs are ambiguous |
metric_params |
No | Mapping of metric name or provider:metric to params |
prompts_file |
No | Prompt file used for prompt-aware partial generation |
agent_name |
No | Required for CLI partial agent evaluation |
dataset_generator_model |
No | Fallback location for partial-dataset generation model |
dataset_generation_config
| Field | Required | Notes |
|---|---|---|
generator_model |
Yes for generation flows | Model used to populate missing llm_response values |
batch_size |
No | Number of samples per generation batch; default 20 |
max_concurrency |
No | Max concurrent async generation calls; default 10 |
agent_workflow_config
Use this section instead of agent_name when running agentic workflow evaluation.
| Field | Required | Notes |
|---|---|---|
dataset_url |
Yes | URL or path to the agent dataset file |
config |
Yes | The DAG config object with uid, name, nodes, edges |
DAG config shape
| Field | Type | Description |
|---|---|---|
uid |
string | Unique workflow ID |
name |
string | Human-readable workflow name |
nodes |
list | Each node: {id, type, label} + agentName for AGENT nodes |
edges |
list | Each edge: {sourceNodeId, targetNodeId} |
Node types: START, AGENT, END.
agent_workflow_config:
dataset_url: "https://your-storage/agent_dataset.json"
config:
uid: "workflow-001"
name: "Triage Workflow"
nodes:
- {id: "start", type: "START", label: "Start"}
- {id: "agent_a", type: "AGENT", label: "Agent A", agentName: "my-agent:latest"}
- {id: "end", type: "END", label: "End"}
edges:
- {sourceNodeId: "start", targetNodeId: "agent_a"}
- {sourceNodeId: "agent_a", targetNodeId: "end"}
Example config
llm_config:
base_url: "https://api.openai.com/v1"
api_key: "your-api-key"
chat_model: "gpt-4o-mini"
chat_endpoint: "chat/completions"
embedding_model: "text-embedding-3-small"
embedding_endpoint: "embeddings"
system_prompt: "You are a concise assistant."
evaluation_config:
default_provider: "ragas"
metrics:
- "answer_relevancy"
- "faithfulness"
metric_params:
answer_relevancy:
threshold: 0.7
prompts_file: "prompts.yaml"
dataset_generation_config:
generator_model: "gpt-4o-mini"
batch_size: 20
max_concurrency: 10
CLI reference
floeval --version
floeval evaluate
| Option | Required | Notes |
|---|---|---|
-c, --config |
Yes | YAML, YML, or JSON config file |
-d, --dataset |
Yes | .json or .jsonl dataset |
-o, --output |
No | Saves results to JSON |
--agent |
No | Switches into agent-evaluation mode |
Behavior:
- standard mode auto-detects partial datasets by checking whether samples are missing
llm_response - agent mode loads agent datasets and uses
AgentEvaluation - partial agent datasets in CLI mode require
evaluation_config.agent_name
floeval generate
| Option | Required | Notes |
|---|---|---|
-c, --config |
Yes | Must include llm_config and dataset_generation_config |
-d, --dataset |
Yes | Partial standard dataset in .json or .jsonl |
-o, --output |
Yes | Output must use .json or .jsonl |
Behavior:
- fills missing
llm_responsevalues - supports prompt-driven expansion when samples contain
prompt_ids - chooses export format from the output file extension
Evaluation
Use Evaluation for standard LLM and RAG evaluation.
Constructor
| Parameter | Required | Notes |
|---|---|---|
dataset |
Yes | Dataset or PartialDataset |
metrics |
Yes | Metric specs in string, dict, or instance form |
default_provider |
No | Helps resolve ambiguous metric names |
llm_config |
No | Needed for provider-backed metrics and partial generation |
metric_params |
No | Shared params keyed by metric name |
dataset_generator_model |
No | Required when dataset is partial |
prompts_file |
No | Prompt file path used for prompt-aware generation |
Example
from floeval import Evaluation, DatasetLoader
dataset = DatasetLoader.from_file("dataset.json", partial_dataset=False)
evaluation = Evaluation(
dataset=dataset,
llm_config=llm_config,
default_provider="ragas",
metrics=["answer_relevancy", "faithfulness"],
metric_params={"answer_relevancy": {"threshold": 0.8}},
)
results = evaluation.run()
Async
WorkflowRunner
Use WorkflowRunner for multi-agent DAG evaluation. Requires pip install "floeval[flotorch]".
from floeval.flotorch import WorkflowRunner
runner = WorkflowRunner(
dag_config=dag_config, # dict: full DAG JSON object
llm_config=llm_config, # OpenAIProviderConfig
app_name="floeval-workflow", # optional
)
| Parameter | Required | Notes |
|---|---|---|
dag_config |
Yes | Dict with uid, name, nodes, edges |
llm_config |
Yes | LLM credentials for all agent nodes |
app_name |
No | Log label; defaults to "floeval-workflow" |
Pass the runner to AgentEvaluation via the agent_runner= parameter.
create_flotorch_runner
Convenience factory for single-agent FloTorch evaluation:
from floeval.flotorch import create_flotorch_runner
runner = create_flotorch_runner(
agent_name="support-agent",
llm_config=llm_config, # optional; falls back to FLOTORCH_BASE_URL env var
)
AgentEvaluation
Use AgentEvaluation for agent traces and partial agent datasets.
Constructor
| Parameter | Required | Notes |
|---|---|---|
dataset |
Yes | AgentDataset |
metrics |
Yes | Builtin and provider-qualified agent metrics |
llm_config |
No | Needed for LLM-judged agent metrics |
agent |
No | Python callable for Mode 2 |
agent_runner |
No | Runner object for Mode 4 |
default_provider |
No | Defaults to "builtin" |
metric_params |
No | Shared params keyed by metric name |
Example
from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset
dataset = AgentDataset.from_file("agent_dataset.json")
evaluation = AgentEvaluation(
dataset=dataset,
llm_config=llm_config,
metrics=["goal_achievement", "ragas:tool_call_accuracy"],
)
results = evaluation.run()
Async
DatasetLoader
Load standard evaluation datasets from files, lists, or dicts.
from floeval import DatasetLoader
full_dataset = DatasetLoader.from_file("dataset.json", partial_dataset=False)
partial_dataset = DatasetLoader.from_file("partial.json", partial_dataset=True)
dataset = DatasetLoader.from_samples(
[{"user_input": "Q?", "llm_response": "A."}],
partial_dataset=False,
)
Common methods
| Method | Returns | Notes |
|---|---|---|
from_file(path, partial_dataset=...) |
Dataset or PartialDataset |
Supports .json and .jsonl |
from_json(path, partial_dataset=...) |
Dataset or PartialDataset |
JSON only |
from_samples(samples, partial_dataset=...) |
Dataset or PartialDataset |
Build from Python objects |
from_dict(data, partial_dataset=...) |
Dataset or PartialDataset |
Build from a dict with samples |
Dataset models
Standard datasets
Sample
| Field | Notes |
|---|---|
user_input |
Required. Also accepted as question (alias). |
llm_response |
Required for full datasets |
contexts |
Optional, used by grounding and retrieval metrics |
ground_truth |
Optional, used by some recall and precision metrics. Also accepted as answer (alias). |
metadata |
Optional |
prompt_id |
Present after prompt-driven generation |
PartialSample
Same as Sample, but llm_response can be empty or omitted. Partial samples also support:
| Field | Notes |
|---|---|
prompt_ids |
Optional list of prompt IDs that expands one sample into multiple generated outputs |
Agent datasets
| Model | Notes |
|---|---|
AgentDataset |
Collection of full or partial agent samples |
AgentSample |
Full sample with trace |
PartialAgentSample |
Sample without trace yet |
AgentTrace |
Trace with messages, final_response, and optional metadata |
Agent trace message roles in saved datasets must be:
humanaitool
Agent dataset field aliases: question maps to user_input, answer maps to reference_outcome. Both formats are accepted in .json, .jsonl, and JSON array root files.
Results objects
EvaluationResult
summary includes:
total_samplesproviders_usedpass_ratesaggregate_scores
AgentEvaluationResult
Agent summaries contain per-metric averages for successful metric runs.
For agentic workflow results, each sample in sample_results also contains:
| Field | Notes |
|---|---|
agent_traces |
List of AgentTrace objects — one per completed AGENT node in the DAG |
workflow_execution |
Dict of per-agent summaries: {agent_name, input, output, tool_calls, turn_count, tool_call_count} |
MetricRegistry
Use MetricRegistry to inspect what is currently registered.
from floeval.api.metrics.registry import MetricRegistry
print(MetricRegistry.list_providers())
print(MetricRegistry.list_metrics("ragas"))
print(MetricRegistry.list_metrics("deepeval"))
print(MetricRegistry.list_metrics("builtin"))
print(MetricRegistry.list_all_metrics())
Common providers
| Provider | Notes |
|---|---|
ragas |
Standard and agent metrics |
deepeval |
Standard LLM and retrieval metrics |
builtin |
Agent judge metrics |
custom |
Appears after custom metrics are defined |