API Reference

Reference for Floeval's user-facing config, CLI, and Python API surface.


Public Python entry points

Top-level imports from floeval:

from floeval import Evaluation, Dataset, DatasetLoader, MetricRegistry

Additional public imports commonly used:

from floeval.api.agent_evaluation import AgentEvaluation
from floeval.api.metrics.custom import custom_metric, criteria
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.utils.agent_trace import capture_trace, log_turn, log_tool_result, wrap_langchain_agent

# Agentic workflow (requires floeval[flotorch])
from floeval.flotorch import WorkflowRunner, create_flotorch_runner

llm_config

Use OpenAIProviderConfig for OpenAI-compatible providers.

from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
    chat_endpoint="chat/completions",
    embedding_model="text-embedding-3-small",
    embedding_endpoint="embeddings",
    system_prompt="You are a helpful assistant.",  # optional
    extra_kwargs=None,                             # optional
)

Fields

Field Required Notes
provider_type No Defaults to "openai"
base_url Yes Base API URL
api_key Yes Provider credential
chat_model Yes Model used for chat/completions
chat_endpoint No Defaults to chat/completions
embedding_model No Optional in the schema, but commonly needed for provider-backed evaluation metrics
embedding_endpoint No Optional embedding endpoint
system_prompt No Used during response generation when provided
extra_kwargs No Provider-specific keyword args

Config file structure

Floeval's CLI reads YAML, YML, or JSON config files.

evaluation_config

Field Required Notes
metrics Yes List of metric specs
default_provider No Used when metric IDs are ambiguous
metric_params No Mapping of metric name or provider:metric to params
prompts_file No Prompt file used for prompt-aware partial generation
agent_name No Required for CLI partial agent evaluation
dataset_generator_model No Fallback location for partial-dataset generation model

dataset_generation_config

Field Required Notes
generator_model Yes for generation flows Model used to populate missing llm_response values
batch_size No Number of samples per generation batch; default 20
max_concurrency No Max concurrent async generation calls; default 10

agent_workflow_config

Use this section instead of agent_name when running agentic workflow evaluation.

Field Required Notes
dataset_url Yes URL or path to the agent dataset file
config Yes The DAG config object with uid, name, nodes, edges

DAG config shape

Field Type Description
uid string Unique workflow ID
name string Human-readable workflow name
nodes list Each node: {id, type, label} + agentName for AGENT nodes
edges list Each edge: {sourceNodeId, targetNodeId}

Node types: START, AGENT, END.

agent_workflow_config:
  dataset_url: "https://your-storage/agent_dataset.json"
  config:
    uid: "workflow-001"
    name: "Triage Workflow"
    nodes:
      - {id: "start",   type: "START", label: "Start"}
      - {id: "agent_a", type: "AGENT", label: "Agent A", agentName: "my-agent:latest"}
      - {id: "end",     type: "END",   label: "End"}
    edges:
      - {sourceNodeId: "start",   targetNodeId: "agent_a"}
      - {sourceNodeId: "agent_a", targetNodeId: "end"}

Example config

llm_config:
  base_url: "https://api.openai.com/v1"
  api_key: "your-api-key"
  chat_model: "gpt-4o-mini"
  chat_endpoint: "chat/completions"
  embedding_model: "text-embedding-3-small"
  embedding_endpoint: "embeddings"
  system_prompt: "You are a concise assistant."

evaluation_config:
  default_provider: "ragas"
  metrics:
    - "answer_relevancy"
    - "faithfulness"
  metric_params:
    answer_relevancy:
      threshold: 0.7
  prompts_file: "prompts.yaml"

dataset_generation_config:
  generator_model: "gpt-4o-mini"
  batch_size: 20
  max_concurrency: 10

CLI reference

floeval --version

floeval --version

floeval evaluate

floeval evaluate -c CONFIG -d DATASET [-o OUTPUT] [--agent]
Option Required Notes
-c, --config Yes YAML, YML, or JSON config file
-d, --dataset Yes .json or .jsonl dataset
-o, --output No Saves results to JSON
--agent No Switches into agent-evaluation mode

Behavior:

  • standard mode auto-detects partial datasets by checking whether samples are missing llm_response
  • agent mode loads agent datasets and uses AgentEvaluation
  • partial agent datasets in CLI mode require evaluation_config.agent_name

floeval generate

floeval generate -c CONFIG -d DATASET -o OUTPUT
Option Required Notes
-c, --config Yes Must include llm_config and dataset_generation_config
-d, --dataset Yes Partial standard dataset in .json or .jsonl
-o, --output Yes Output must use .json or .jsonl

Behavior:

  • fills missing llm_response values
  • supports prompt-driven expansion when samples contain prompt_ids
  • chooses export format from the output file extension

Evaluation

Use Evaluation for standard LLM and RAG evaluation.

Constructor

Parameter Required Notes
dataset Yes Dataset or PartialDataset
metrics Yes Metric specs in string, dict, or instance form
default_provider No Helps resolve ambiguous metric names
llm_config No Needed for provider-backed metrics and partial generation
metric_params No Shared params keyed by metric name
dataset_generator_model No Required when dataset is partial
prompts_file No Prompt file path used for prompt-aware generation

Example

from floeval import Evaluation, DatasetLoader

dataset = DatasetLoader.from_file("dataset.json", partial_dataset=False)

evaluation = Evaluation(
    dataset=dataset,
    llm_config=llm_config,
    default_provider="ragas",
    metrics=["answer_relevancy", "faithfulness"],
    metric_params={"answer_relevancy": {"threshold": 0.8}},
)

results = evaluation.run()

Async

results = await evaluation.arun()

WorkflowRunner

Use WorkflowRunner for multi-agent DAG evaluation. Requires pip install "floeval[flotorch]".

from floeval.flotorch import WorkflowRunner

runner = WorkflowRunner(
    dag_config=dag_config,       # dict: full DAG JSON object
    llm_config=llm_config,       # OpenAIProviderConfig
    app_name="floeval-workflow", # optional
)
Parameter Required Notes
dag_config Yes Dict with uid, name, nodes, edges
llm_config Yes LLM credentials for all agent nodes
app_name No Log label; defaults to "floeval-workflow"

Pass the runner to AgentEvaluation via the agent_runner= parameter.

create_flotorch_runner

Convenience factory for single-agent FloTorch evaluation:

from floeval.flotorch import create_flotorch_runner

runner = create_flotorch_runner(
    agent_name="support-agent",
    llm_config=llm_config,   # optional; falls back to FLOTORCH_BASE_URL env var
)

AgentEvaluation

Use AgentEvaluation for agent traces and partial agent datasets.

Constructor

Parameter Required Notes
dataset Yes AgentDataset
metrics Yes Builtin and provider-qualified agent metrics
llm_config No Needed for LLM-judged agent metrics
agent No Python callable for Mode 2
agent_runner No Runner object for Mode 4
default_provider No Defaults to "builtin"
metric_params No Shared params keyed by metric name

Example

from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset

dataset = AgentDataset.from_file("agent_dataset.json")

evaluation = AgentEvaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["goal_achievement", "ragas:tool_call_accuracy"],
)

results = evaluation.run()

Async

results = await evaluation.arun()

DatasetLoader

Load standard evaluation datasets from files, lists, or dicts.

from floeval import DatasetLoader

full_dataset = DatasetLoader.from_file("dataset.json", partial_dataset=False)
partial_dataset = DatasetLoader.from_file("partial.json", partial_dataset=True)

dataset = DatasetLoader.from_samples(
    [{"user_input": "Q?", "llm_response": "A."}],
    partial_dataset=False,
)

Common methods

Method Returns Notes
from_file(path, partial_dataset=...) Dataset or PartialDataset Supports .json and .jsonl
from_json(path, partial_dataset=...) Dataset or PartialDataset JSON only
from_samples(samples, partial_dataset=...) Dataset or PartialDataset Build from Python objects
from_dict(data, partial_dataset=...) Dataset or PartialDataset Build from a dict with samples

Dataset models

Standard datasets

Sample

Field Notes
user_input Required. Also accepted as question (alias).
llm_response Required for full datasets
contexts Optional, used by grounding and retrieval metrics
ground_truth Optional, used by some recall and precision metrics. Also accepted as answer (alias).
metadata Optional
prompt_id Present after prompt-driven generation

PartialSample

Same as Sample, but llm_response can be empty or omitted. Partial samples also support:

Field Notes
prompt_ids Optional list of prompt IDs that expands one sample into multiple generated outputs

Agent datasets

Model Notes
AgentDataset Collection of full or partial agent samples
AgentSample Full sample with trace
PartialAgentSample Sample without trace yet
AgentTrace Trace with messages, final_response, and optional metadata

Agent trace message roles in saved datasets must be:

  • human
  • ai
  • tool

Agent dataset field aliases: question maps to user_input, answer maps to reference_outcome. Both formats are accepted in .json, .jsonl, and JSON array root files.


Results objects

EvaluationResult

results = evaluation.run()

results.sample_results
results.aggregate_scores
results.summary

summary includes:

  • total_samples
  • providers_used
  • pass_rates
  • aggregate_scores

AgentEvaluationResult

agent_results = agent_evaluation.run()

agent_results.sample_results
agent_results.summary

Agent summaries contain per-metric averages for successful metric runs.

For agentic workflow results, each sample in sample_results also contains:

Field Notes
agent_traces List of AgentTrace objects — one per completed AGENT node in the DAG
workflow_execution Dict of per-agent summaries: {agent_name, input, output, tool_calls, turn_count, tool_call_count}

MetricRegistry

Use MetricRegistry to inspect what is currently registered.

from floeval.api.metrics.registry import MetricRegistry

print(MetricRegistry.list_providers())
print(MetricRegistry.list_metrics("ragas"))
print(MetricRegistry.list_metrics("deepeval"))
print(MetricRegistry.list_metrics("builtin"))
print(MetricRegistry.list_all_metrics())

Common providers

Provider Notes
ragas Standard and agent metrics
deepeval Standard LLM and retrieval metrics
builtin Agent judge metrics
custom Appears after custom metrics are defined