Custom Metrics
Create your own evaluation logic when built-in metrics don't cover your needs.
Two types of custom metrics
1. @custom_metric decorator
Use @custom_metric for function-based metrics. Supported parameters include response, question, contexts, context (MetricContext), llm (SimpleLLMHelper), and sample.
from floeval.api.metrics.custom import custom_metric
@custom_metric(threshold=0.5)
def response_length(response: str) -> float:
return min(len(response) / 100.0, 1.0)
2. criteria() — LLM-as-judge
Use criteria() to define an evaluation criterion in natural language and have an LLM score responses against it.
from floeval.api.metrics.custom import criteria
empathy = criteria(
name="empathy",
description="Rate empathy 0-1. Consider acknowledgment, understanding, support.",
threshold=0.7
)
@custom_metric options
| Parameter | Description | Example |
|---|---|---|
threshold |
Pass/fail threshold (0–1). Default: 0.5 | @custom_metric(threshold=0.7) |
execute_via |
Route to "ragas" or "deepeval" provider. Default: standalone (no provider needed) | @custom_metric(execute_via="ragas") |
name |
Custom metric name. Default: function name | @custom_metric(name="my_metric") |
execute_via
By default, custom metrics run standalone and do not depend on provider execution. Use execute_via to route a custom metric through RAGAS or DeepEval:
@custom_metric(execute_via="ragas", threshold=0.6)
def semantic_consistency(response: str, llm) -> float:
"""Uses RAGAS provider context and LLM."""
# This metric can now access provider features
return 0.8
@custom_metric(execute_via="deepeval", threshold=0.7)
def tone_check(response: str) -> float:
"""Runs through DeepEval provider."""
return 0.9
Available parameters in @custom_metric
Custom metric functions can access these parameters (pass only what you need):
| Parameter | Type | Description |
|---|---|---|
response |
str | The LLM's response (aka llm_response) |
question |
str | The user's question (aka user_input) |
contexts |
list[str] | Retrieved documents (if using RAG) |
context |
MetricContext | Full evaluation context object |
llm |
SimpleLLMHelper | LLM interface for calling the model |
sample |
dict | Full sample data |
Parameter examples
from floeval.api.metrics.custom import custom_metric
# Simple: just response
@custom_metric()
def response_length(response: str) -> float:
return min(len(response) / 100.0, 1.0)
# Multiple params: response + question
@custom_metric()
def covers_topic(response: str, question: str) -> float:
return 1.0 if len(response) > 0 and len(question) > 0 else 0.0
# With contexts (RAG)
@custom_metric()
def uses_context(response: str, contexts: list) -> float:
if not contexts:
return 0.5 # No contexts provided
# Check if response uses info from contexts
ctx_text = " ".join(contexts).lower()
return 0.8 if any(word in response for word in ctx_text.split()) else 0.2
# With LLM helper (needs llm_config)
@custom_metric()
def tone_assessment(response: str, llm) -> float:
result = llm.generate(f"Is this professional? {response}")
return 1.0 if "yes" in result.lower() else 0.0
Your custom metric can return:
- float — Simple score 0–1
- dict — With
"score"and optional"metadata" - MetricResult — Full result object
Examples
Simple function (no API key needed)
from floeval import Evaluation, DatasetLoader
from floeval.api.metrics.custom import custom_metric
@custom_metric(threshold=0.3)
def response_length(response: str) -> float:
return min(len(response) / 100.0, 1.0)
@custom_metric(threshold=0.5)
def politeness(response: str) -> float:
polite_words = ["please", "thank you", "thanks"]
count = sum(1 for w in polite_words if w in response.lower())
return min(count / 3.0, 1.0)
dataset = DatasetLoader.from_samples([
{"user_input": "Help?", "llm_response": "Please let me help. Thank you for asking."},
{"user_input": "What?", "llm_response": "No idea."},
])
evaluation = Evaluation(dataset=dataset, metrics=["politeness", "response_length"])
results = evaluation.run()
Criteria-based (needs llm_config)
from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.api.metrics.custom import criteria
llm_config = OpenAIProviderConfig(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
chat_model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
)
empathy = criteria(
name="empathy",
description="Rate empathy 0-1. Consider acknowledgment, understanding, support.",
threshold=0.6
)
dataset = DatasetLoader.from_samples([
{"user_input": "I'm stressed.", "llm_response": "I understand. Would you like to talk?"},
{"user_input": "I'm stressed.", "llm_response": "Just work harder."},
])
evaluation = Evaluation(dataset=dataset, llm_config=llm_config, metrics=[empathy])
results = evaluation.run()
Custom with llm.generate()
from floeval.api.metrics.custom import custom_metric
@custom_metric(threshold=0.5)
def is_professional(response: str, llm) -> float:
answer = llm.generate(f"""
Is this professional for business? Answer only 'yes' or 'no'.
Response: {response}
""")
return 1.0 if "yes" in answer.lower() else 0.0
Custom with multiple params (question, contexts)
@custom_metric(threshold=0.4)
def context_relevance(response: str, question: str, contexts: list) -> float:
if not response or not question:
return 0.0
q_words = set(question.lower().split())
r_words = set(response.lower().split())
q_match = len(q_words & r_words) / max(len(q_words), 1)
ctx_usage = 0.0
if contexts:
ctx_text = " ".join(contexts).lower()
ctx_words = set(ctx_text.split())
ctx_usage = min(len(ctx_words & r_words) / max(len(r_words), 1) * 2, 1.0)
return min(q_match * 0.6 + ctx_usage * 0.4, 1.0)
Mix custom with built-in
evaluation = Evaluation(
dataset=dataset,
llm_config=llm_config,
metrics=["response_length", "answer_relevancy", "faithfulness"],
default_provider="ragas"
)
Next steps
- Minimal Examples — Compact runnable examples
- Metrics — Built-in metrics reference
- Examples — All usage patterns