Prompt Evaluation
Use this guide to evaluate one or more system prompts against the same dataset. Prompt evaluation lets you compare prompt variants at scale before shipping.
What prompt evaluation is
Prompt evaluation is a specialized form of LLM evaluation where you:
- Write a partial dataset with questions (no pre-generated responses)
- Define one or more system prompts
- Have Floeval generate responses for each prompt variant
- Score all responses using the same metrics
This answers the question: which prompt works best for my use case?
Sub-modes
| Sub-mode | Description | When to use |
|---|---|---|
| Prompt (no RAG) | Single or multi-prompt against questions only | No retrieval context needed |
| Prompt (with RAG) | Single or multi-prompt with a vector store or pre-fetched context | Retrieval is part of the pipeline |
The distinction matters for metrics: no-RAG prompts only support answer relevancy metrics, while RAG prompts support the full set of retrieval metrics.
Inline single prompt (CLI)
The simplest form: a single system_prompt and user_prompt defined directly in the config. The dataset is a standard partial dataset with user_input values (or question as an alias).
Config
# config.yaml
llm_config:
base_url: "https://api.openai.com/v1"
api_key: "your-api-key"
chat_model: "gpt-4o-mini"
embedding_model: "text-embedding-3-small"
evaluation_config:
default_provider: "ragas"
metrics:
- "answer_relevancy"
prompts_file: null # not needed for inline prompt
dataset_generation_config:
generator_model: "gpt-4o-mini"
To use an inline system prompt without a prompts file, set system_prompt inside llm_config:
llm_config:
base_url: "https://api.openai.com/v1"
api_key: "your-api-key"
chat_model: "gpt-4o-mini"
system_prompt: "You are a concise assistant. Keep answers under two sentences."
evaluation_config:
metrics:
- "answer_relevancy"
dataset_generation_config:
generator_model: "gpt-4o-mini"
Dataset
You can also use the question alias instead of user_input:
Run
Multi-prompt evaluation with a prompts file
A prompts file defines named prompt templates. Each sample in the dataset carries a list of prompt_ids that references the prompts to use. Floeval generates one response per prompt_id and scores each one separately.
Prompts file (YAML)
# prompts.yaml
prompts:
concise:
template: "Answer in one short paragraph."
detailed:
template: "Answer in detail with at least three points."
formal:
template: "Respond in formal business English."
JSON format is also accepted:
{
"prompts": {
"concise": {"template": "Answer in one short paragraph."},
"detailed": {"template": "Answer in detail with at least three points."},
"formal": {"template": "Respond in formal business English."}
}
}
Dataset with prompt_ids
Each sample lists which prompts should generate responses for it:
{
"samples": [
{
"user_input": "Summarize this customer support ticket.",
"prompt_ids": ["concise", "detailed", "formal"]
},
{
"user_input": "Explain the refund policy.",
"prompt_ids": ["concise", "detailed"]
}
]
}
The above dataset with three prompt IDs generates six total samples (3 prompts × 2 questions = 6 responses), each tagged with a prompt_id in the output.
Config
# config.yaml
llm_config:
base_url: "https://api.openai.com/v1"
api_key: "your-api-key"
chat_model: "gpt-4o-mini"
embedding_model: "text-embedding-3-small"
evaluation_config:
default_provider: "ragas"
metrics:
- "answer_relevancy"
prompts_file: "prompts.yaml"
dataset_generation_config:
generator_model: "gpt-4o-mini"
Run
Or generate first and inspect before scoring:
floeval generate -c config.yaml -d partial_dataset.json -o generated.json
floeval evaluate -c config.yaml -d generated.json -o results.json
Prompt with RAG (vectorstore)
When prompts run against a vectorstore, the config includes a vectorstore_id and Floeval fetches context for each sample before generating responses.
llm_config:
base_url: "https://gateway.example/openai/v1"
api_key: "your-gateway-key"
chat_model: "gpt-4o-mini"
embedding_model: "text-embedding-3-small"
evaluation_config:
default_provider: "ragas"
metrics:
- "answer_relevancy"
- "faithfulness"
- "context_precision"
prompts_file: "prompts.yaml"
dataset_generation_config:
generator_model: "gpt-4o-mini"
The dataset also specifies vectorstore_id at the data level:
Note
When using a vectorstore, Floeval automatically fetches context for each sample and injects it during generation. Use RAG metrics such as faithfulness and context_precision in this case.
Prompt with pre-fetched context
If you have context in the dataset already (no vectorstore fetch needed), include it in each sample:
{
"samples": [
{
"user_input": "What is RAG?",
"contexts": ["RAG combines retrieval with generation."],
"prompt_ids": ["concise", "detailed"]
}
]
}
Floeval skips the vectorstore fetch when contexts is present.
Python API
Single prompt (inline)
from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
chat_model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
system_prompt="You are a concise assistant. Keep answers under two sentences.",
)
partial_dataset = DatasetLoader.from_samples(
[
{"user_input": "What is RAG?"},
{"user_input": "How does vector search work?"},
],
partial_dataset=True,
)
evaluation = Evaluation(
dataset=partial_dataset,
llm_config=llm_config,
default_provider="ragas",
metrics=["answer_relevancy"],
dataset_generator_model="gpt-4o-mini",
)
results = evaluation.run()
print(results.aggregate_scores)
Multi-prompt with prompts file
from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
chat_model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
)
partial_dataset = DatasetLoader.from_samples(
[
{
"user_input": "Summarize this customer support ticket.",
"prompt_ids": ["concise", "detailed"],
}
],
partial_dataset=True,
)
evaluation = Evaluation(
dataset=partial_dataset,
llm_config=llm_config,
default_provider="ragas",
metrics=["answer_relevancy"],
dataset_generator_model="gpt-4o-mini",
prompts_file="prompts.yaml",
)
results = evaluation.run()
# One result row per (sample, prompt_id) combination
for row in results.sample_results:
print(row["prompt_id"], row["metrics"])
Async execution
Metrics to use
Prompt (no RAG)
| Provider | Metric |
|---|---|
| ragas | answer_relevancy |
| deepeval | answer_relevancy |
Prompt (with RAG)
| Provider | Metric |
|---|---|
| ragas | answer_relevancy, faithfulness, context_precision, context_recall, context_entity_recall, noise_sensitivity |
| deepeval | answer_relevancy, faithfulness, contextual_precision, contextual_recall, contextual_relevancy |
Understanding multi-prompt results
When prompt_ids is used, each row in sample_results carries a prompt_id field. You can group by prompt_id to compare which prompt performed best:
from collections import defaultdict
by_prompt = defaultdict(list)
for row in results.sample_results:
pid = row.get("prompt_id", "default")
by_prompt[pid].append(row["metrics"].get("answer_relevancy", {}).get("score"))
for pid, scores in by_prompt.items():
valid = [s for s in scores if s is not None]
avg = sum(valid) / len(valid) if valid else None
print(f"{pid}: avg answer_relevancy = {avg}")
Common pitfalls
| Problem | Cause | Fix |
|---|---|---|
| No responses generated | dataset_generator_model or generator_model is missing |
Add dataset_generation_config.generator_model to config |
All rows have the same prompt_id |
Samples missing prompt_ids field |
Add prompt_ids to each sample |
| Prompt template not applied | prompts_file path is wrong or prompt IDs do not match |
Check file path and that all IDs in prompt_ids exist in the prompts file |
| RAG metrics fail | No context in samples and no vectorstore | Either add contexts to samples or configure a vectorstore_id |
Related references
- Examples for prompt expansion dataset shapes
- Metrics for the full metric catalog
- API Reference for
Evaluationconstructor andprompts_filefield - Troubleshooting for generation issues