Available Metrics
Floeval registers metrics across ragas, deepeval, and builtin, plus any custom metrics you define at runtime.
Metrics by eval type
Use this table as a quick reference to pick metrics for each evaluation type.
| Eval type | Provider | Metrics |
|---|---|---|
llm |
ragas | answer_relevancy |
llm |
deepeval | answer_relevancy |
rag |
ragas | answer_relevancy, faithfulness, context_precision, context_recall, context_entity_recall, noise_sensitivity |
rag |
deepeval | answer_relevancy, faithfulness, contextual_precision, contextual_recall, contextual_relevancy |
prompt (no RAG) |
ragas | answer_relevancy |
prompt (no RAG) |
deepeval | answer_relevancy |
prompt (with RAG) |
ragas | Same as rag |
prompt (with RAG) |
deepeval | Same as rag |
agent |
builtin | goal_achievement, response_coherence |
agent |
ragas | agent_goal_accuracy, tool_call_accuracy |
agentic_workflow |
builtin | goal_achievement, response_coherence |
agentic_workflow |
ragas | agent_goal_accuracy, tool_call_accuracy |
How to specify metrics
You can reference metrics in four ways:
| Format | Example | When to use it |
|---|---|---|
| Plain string | "answer_relevancy" |
Use with default_provider |
| Provider-qualified string | "deepeval:faithfulness" |
Use when the metric exists in multiple providers |
| Dict spec | {"id": "faithfulness", "provider": "deepeval", "params": {"threshold": 0.8}} |
Use when you need per-metric params |
| Metric instance | my_metric |
Use with custom programmatic metrics |
If the same metric ID exists in more than one provider and you do not set default_provider, use the provider:metric form.
Standard LLM and RAG metrics
RAGAS
| Metric | Best for | Required fields |
|---|---|---|
answer_relevancy |
General LLM answer quality | user_input, llm_response |
faithfulness |
Grounding against retrieved context | user_input, llm_response, contexts |
context_precision |
Whether relevant contexts are ranked first | user_input, llm_response, contexts, ground_truth |
context_recall |
Coverage of reference information by retrieved contexts | user_input, llm_response, contexts, ground_truth |
context_entity_recall |
Entity-level coverage in retrieved contexts | user_input, llm_response, contexts, ground_truth |
noise_sensitivity |
How often the model produces incorrect claims from noisy context | user_input, llm_response, contexts, ground_truth |
DeepEval
| Metric | Best for | Required fields |
|---|---|---|
answer_relevancy |
General LLM answer quality | user_input, llm_response |
faithfulness |
Grounding against context | user_input, llm_response, contexts |
contextual_precision |
Retrieval ranking precision | user_input, llm_response, contexts, ground_truth |
contextual_recall |
Retrieval recall against ground truth | user_input, llm_response, contexts, ground_truth |
contextual_relevancy |
Whether retrieved context is relevant to the question | user_input, llm_response, contexts |
hallucination |
Factual contradictions versus the provided context | user_input, llm_response, contexts |
toxicity |
Toxic, harmful, or offensive content | user_input, llm_response |
exact_match |
Exact string match against ground truth (no LLM required) | llm_response, ground_truth |
pattern_match |
Regex pattern match against the response (no LLM required) | llm_response |
json_correctness |
JSON schema validation of the response | llm_response |
Agent metrics
These metrics are for AgentEvaluation only. Do not use them with standard Evaluation.
Builtin
| Metric | Best for | Required fields | Notes |
|---|---|---|---|
goal_achievement |
Did the agent satisfy the user request? | trace, user_input |
Uses an LLM judge; reference_outcome helps when available |
response_coherence |
Is the final answer consistent with the conversation trace? | trace |
No reference_outcome needed; scores coherence of the trace itself |
RAGAS
| Metric | Best for | Required fields | Notes |
|---|---|---|---|
agent_goal_accuracy |
Compare the agent's final result to an expected outcome | trace, user_input, reference_outcome |
Best when reference_outcome is present |
tool_call_accuracy |
Check whether actual tool calls match expected tool calls | trace, reference_tool_calls |
Silently skips samples that have no reference_tool_calls; score is null for those samples |
Field guidance
Standard evaluation datasets
| Field | Commonly used by |
|---|---|
user_input |
All standard metrics |
llm_response |
All standard metrics |
contexts |
faithfulness, retrieval-focused metrics, contextual metrics |
ground_truth |
Recall and precision style metrics, especially DeepEval contextual metrics |
prompt_id |
Generated datasets that came from prompt expansion |
Dataset field aliases are also accepted: question is treated as user_input, and answer is treated as ground_truth.
Agent evaluation datasets
| Field | Commonly used by |
|---|---|
trace.messages |
All agent metrics |
trace.final_response |
All agent metrics |
reference_outcome |
goal_achievement, agent_goal_accuracy |
reference_tool_calls |
tool_call_accuracy |
Agent dataset field aliases: question maps to user_input, answer maps to reference_outcome.
If a metric needs fields that are missing from your samples, the failure is recorded in that sample's metric metadata and the overall run continues.
Choosing a provider
- Use
ragaswhen you want the default provider for common answer and retrieval metrics. - Use
deepevalwhen you want its contextual metrics or prefer its scoring behavior. - Use
builtinfor agent-specific judge metrics. - Use
customfor domain-specific checks you write yourself.
Example configurations
evaluation = Evaluation(
dataset=dataset,
llm_config=llm_config,
metrics=[
{"id": "faithfulness", "provider": "deepeval", "params": {"threshold": 0.8}},
"ragas:answer_relevancy",
],
)
Thresholds and params
Set thresholds through metric_params or through metric dict specs.
evaluation_config:
metrics:
- "answer_relevancy"
- "faithfulness"
metric_params:
answer_relevancy:
threshold: 0.7
faithfulness:
threshold: 0.8
Provider-qualified keys also work:
evaluation_config:
metrics:
- "ragas:answer_relevancy"
- "deepeval:faithfulness"
metric_params:
ragas:answer_relevancy:
threshold: 0.7
deepeval:faithfulness:
threshold: 0.8
Understanding results
All built-in provider scores are normalized to the 0.0 to 1.0 range.
| Score range | Typical interpretation |
|---|---|
0.9 - 1.0 |
Strong |
0.7 - 0.9 |
Good |
0.5 - 0.7 |
Mixed |
0.0 - 0.5 |
Weak |
Use:
- aggregate scores for overall system quality
- per-sample scores to inspect failure patterns
- metric metadata to understand thresholds, provider, and recorded errors
Discover metrics programmatically
from floeval.api.metrics.registry import MetricRegistry
print(MetricRegistry.list_providers())
print(MetricRegistry.list_metrics("ragas"))
print(MetricRegistry.list_metrics("deepeval"))
print(MetricRegistry.list_metrics("builtin"))
custom metrics appear in the registry after you define them.
Next steps
- Examples for provider-routing examples
- Prompt Evaluation for prompt-specific metric choices
- Agent Evaluation for agent-only metrics
- Agentic Workflow for multi-agent DAG evaluation
- Custom Metrics for user-defined scoring