Floeval User Guide
Floeval is a multi-backend evaluation framework for LLM, RAG, prompt, and agent systems.
Supported evaluation types
Floeval supports five distinct evaluation types. Each maps to a different dataset shape and set of metrics.
| Eval type | What you are scoring | Typical dataset |
|---|---|---|
| LLM | Direct question-answer quality without retrieval | user_input + llm_response |
| RAG | Answer quality and retrieval performance with context | user_input + llm_response + contexts |
| Prompt | One or more system prompts against the same dataset | Partial dataset + prompts_file or inline prompts |
| Agent | Single-agent trace quality, tool use, and goal achievement | AgentDataset with full or partial samples |
| Agentic Workflow | Multi-agent DAG: each agent node is a step in the pipeline | AgentDataset + DAG config |
Supported workflows
| Workflow | Best for | Main entry point |
|---|---|---|
| Standard CLI evaluation | Evaluation from config and dataset files | floeval evaluate |
| Standard Python evaluation | App integration and programmatic control | Evaluation(...) |
| Prompt evaluation | Compare system-prompt variants at scale | Evaluation(..., prompts_file=...) |
| Agent evaluation | Scoring traces, tool use, and final outcomes | AgentEvaluation(...) or floeval evaluate --agent |
| Agentic workflow evaluation | Multi-agent DAG scoring | AgentEvaluation(..., agent_runner=WorkflowRunner(...)) |
| Trace capture | Turning agent runs into evaluation-ready traces | floeval.utils.agent_trace |
Core capabilities
- Run
ragas,deepeval,builtin, andcustommetrics in the same project - Evaluate full datasets that already contain
llm_response - Evaluate partial datasets and let Floeval generate missing responses automatically
- Expand partial datasets across prompt variants with
prompt_idsandprompts_file - Score agent traces with agent-specific metrics (goal achievement, tool accuracy, coherence)
- Run multi-agent DAG workflows and evaluate the combined output
- Capture traces from Python callables, LangChain-style agents, or FloTorch runners
- Define custom scoring logic using function decorators or LLM-as-judge criteria
Metrics by eval type (quick reference)
| Eval type | Recommended metrics |
|---|---|
llm |
ragas:answer_relevancy, deepeval:answer_relevancy |
rag |
ragas:faithfulness, ragas:context_precision, ragas:context_recall, deepeval:contextual_precision, deepeval:contextual_relevancy (plus others) |
prompt (no RAG) |
ragas:answer_relevancy, deepeval:answer_relevancy |
prompt (with RAG) |
Same as rag |
agent |
builtin:goal_achievement, builtin:response_coherence, ragas:agent_goal_accuracy, ragas:tool_call_accuracy |
agentic_workflow |
Same 4 agent metrics |
See Metrics for the complete catalog.
Documentation map
| Task | Reference |
|---|---|
| Install and set up Floeval | Prerequisites & Setup |
| Run a quick example | Minimal Examples |
| Run evaluation from files | Examples |
| Evaluate with system prompts | Prompt Evaluation |
| Evaluate an agent trace dataset | Agent Evaluation |
| Evaluate a multi-agent DAG workflow | Agentic Workflow |
| Add trace capture to an agent | Agent Tracing |
| Compare providers or metric IDs | Metrics |
| Define custom scoring logic | Custom Metrics |
| Check config fields and constructors | API Reference |
| Diagnose errors and failures | Troubleshooting |
Basic workflow
- Install Floeval. See Prerequisites & Setup for version and environment details.
- Create a config file with
llm_configandevaluation_config. - Create a dataset file in
.jsonor.jsonl. - Run
floeval evaluate -c config.yaml -d dataset.json.
Credential handling
Examples use placeholder API keys for clarity. In real usage, prefer loading secrets from environment variables or a secrets manager and injecting them into your config or Python code at runtime.
References
| Section | What it covers |
|---|---|
| Prerequisites & Setup | Installation, beta versioning, credentials, optional FloTorch setup |
| Minimal Examples | Short copy-paste examples for all eval types |
| Examples | CLI, Python, prompt expansion, mixed providers, agent, and workflow flows |
| Prompt Evaluation | Single and multi-prompt eval, prompts_file, prompt_ids, RAG vs no-RAG |
| Agent Evaluation | Single-agent dataset shapes, CLI --agent, Python callable, FloTorch runner |
| Agentic Workflow | Multi-agent DAG config, WorkflowRunner, traces per node |
| Agent Tracing | capture_trace, log_turn, log_tool_result, wrap_langchain_agent |
| Metrics | Full metric catalog by provider and eval type |
| Custom Metrics | @custom_metric decorator, criteria() LLM-as-judge |
| API Reference | Config keys, dataset models, CLI signatures, and public APIs |
| Troubleshooting | Install, config, dataset, generation, prompt, and agent-eval fixes |