Prompt Evaluation

Use this guide to evaluate one or more system prompts against the same dataset. Prompt evaluation lets you compare prompt variants at scale before shipping.

What prompt evaluation is

Prompt evaluation is a specialized form of LLM evaluation where you:

Write a partial dataset with questions (no pre-generated responses)
Define one or more system prompts
Have Floeval generate responses for each prompt variant
Score all responses using the same metrics

This answers the question: which prompt works best for my use case?

Sub-modes

Sub-mode	Description	When to use
Prompt (no RAG)	Single or multi-prompt against questions only	No retrieval context needed
Prompt (with RAG)	Single or multi-prompt with a vector store or pre-fetched context	Retrieval is part of the pipeline

The distinction matters for metrics: no-RAG prompts only support answer relevancy metrics, while RAG prompts support the full set of retrieval metrics.

Inline single prompt (CLI)

The simplest form: a single system_prompt and user_prompt defined directly in the config. The dataset is a standard partial dataset with user_input values (or question as an alias).

Config

# config.yaml
llm_config:
  base_url: "https://api.openai.com/v1"
  api_key: "your-api-key"
  chat_model: "gpt-4o-mini"
  embedding_model: "text-embedding-3-small"

evaluation_config:
  default_provider: "ragas"
  metrics:
    - "answer_relevancy"
  prompts_file: null  # not needed for inline prompt

dataset_generation_config:
  generator_model: "gpt-4o-mini"

To use an inline system prompt without a prompts file, set system_prompt inside llm_config:

llm_config:
  base_url: "https://api.openai.com/v1"
  api_key: "your-api-key"
  chat_model: "gpt-4o-mini"
  system_prompt: "You are a concise assistant. Keep answers under two sentences."

evaluation_config:
  metrics:
    - "answer_relevancy"

dataset_generation_config:
  generator_model: "gpt-4o-mini"

Dataset

{
  "samples": [
    {"user_input": "What is RAG?"},
    {"user_input": "How does vector search work?"}
  ]
}

You can also use the question alias instead of user_input:

{
  "samples": [
    {"question": "What is RAG?"},
    {"question": "How does vector search work?"}
  ]
}

Run

floeval evaluate -c config.yaml -d partial_dataset.json -o results.json

Multi-prompt evaluation with a prompts file

A prompts file defines named prompt templates. Each sample in the dataset carries a list of prompt_ids that references the prompts to use. Floeval generates one response per prompt_id and scores each one separately.

Prompts file (YAML)

# prompts.yaml
prompts:
  concise:
    template: "Answer in one short paragraph."
  detailed:
    template: "Answer in detail with at least three points."
  formal:
    template: "Respond in formal business English."

JSON format is also accepted:

{
  "prompts": {
    "concise": {"template": "Answer in one short paragraph."},
    "detailed": {"template": "Answer in detail with at least three points."},
    "formal": {"template": "Respond in formal business English."}
  }
}

Dataset with `prompt_ids`

Each sample lists which prompts should generate responses for it:

{
  "samples": [
    {
      "user_input": "Summarize this customer support ticket.",
      "prompt_ids": ["concise", "detailed", "formal"]
    },
    {
      "user_input": "Explain the refund policy.",
      "prompt_ids": ["concise", "detailed"]
    }
  ]
}

The above dataset with three prompt IDs generates six total samples (3 prompts × 2 questions = 6 responses), each tagged with a prompt_id in the output.

Config

# config.yaml
llm_config:
  base_url: "https://api.openai.com/v1"
  api_key: "your-api-key"
  chat_model: "gpt-4o-mini"
  embedding_model: "text-embedding-3-small"

evaluation_config:
  default_provider: "ragas"
  metrics:
    - "answer_relevancy"
  prompts_file: "prompts.yaml"

dataset_generation_config:
  generator_model: "gpt-4o-mini"

Run

floeval evaluate -c config.yaml -d partial_dataset.json -o results.json

Or generate first and inspect before scoring:

floeval generate -c config.yaml -d partial_dataset.json -o generated.json
floeval evaluate -c config.yaml -d generated.json -o results.json

Prompt with RAG (vectorstore)

When prompts run against a vectorstore, the config includes a vectorstore_id and Floeval fetches context for each sample before generating responses.

llm_config:
  base_url: "https://gateway.example/openai/v1"
  api_key: "your-gateway-key"
  chat_model: "gpt-4o-mini"
  embedding_model: "text-embedding-3-small"

evaluation_config:
  default_provider: "ragas"
  metrics:
    - "answer_relevancy"
    - "faithfulness"
    - "context_precision"
  prompts_file: "prompts.yaml"

dataset_generation_config:
  generator_model: "gpt-4o-mini"

The dataset also specifies vectorstore_id at the data level:

data:
  vectorstore_id: "my-vectorstore-id"

Note

When using a vectorstore, Floeval automatically fetches context for each sample and injects it during generation. Use RAG metrics such as faithfulness and context_precision in this case.

Prompt with pre-fetched context

If you have context in the dataset already (no vectorstore fetch needed), include it in each sample:

{
  "samples": [
    {
      "user_input": "What is RAG?",
      "contexts": ["RAG combines retrieval with generation."],
      "prompt_ids": ["concise", "detailed"]
    }
  ]
}

Floeval skips the vectorstore fetch when contexts is present.

Python API

Single prompt (inline)

from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
    system_prompt="You are a concise assistant. Keep answers under two sentences.",
)

partial_dataset = DatasetLoader.from_samples(
    [
        {"user_input": "What is RAG?"},
        {"user_input": "How does vector search work?"},
    ],
    partial_dataset=True,
)

evaluation = Evaluation(
    dataset=partial_dataset,
    llm_config=llm_config,
    default_provider="ragas",
    metrics=["answer_relevancy"],
    dataset_generator_model="gpt-4o-mini",
)

results = evaluation.run()
print(results.aggregate_scores)

Multi-prompt with prompts file

from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

partial_dataset = DatasetLoader.from_samples(
    [
        {
            "user_input": "Summarize this customer support ticket.",
            "prompt_ids": ["concise", "detailed"],
        }
    ],
    partial_dataset=True,
)

evaluation = Evaluation(
    dataset=partial_dataset,
    llm_config=llm_config,
    default_provider="ragas",
    metrics=["answer_relevancy"],
    dataset_generator_model="gpt-4o-mini",
    prompts_file="prompts.yaml",
)

results = evaluation.run()

# One result row per (sample, prompt_id) combination
for row in results.sample_results:
    print(row["prompt_id"], row["metrics"])

Async execution

results = await evaluation.arun()

Metrics to use

Prompt (no RAG)

Provider	Metric
ragas	`answer_relevancy`
deepeval	`answer_relevancy`

Prompt (with RAG)

Provider	Metric
ragas	`answer_relevancy`, `faithfulness`, `context_precision`, `context_recall`, `context_entity_recall`, `noise_sensitivity`
deepeval	`answer_relevancy`, `faithfulness`, `contextual_precision`, `contextual_recall`, `contextual_relevancy`

Understanding multi-prompt results

When prompt_ids is used, each row in sample_results carries a prompt_id field. You can group by prompt_id to compare which prompt performed best:

from collections import defaultdict

by_prompt = defaultdict(list)
for row in results.sample_results:
    pid = row.get("prompt_id", "default")
    by_prompt[pid].append(row["metrics"].get("answer_relevancy", {}).get("score"))

for pid, scores in by_prompt.items():
    valid = [s for s in scores if s is not None]
    avg = sum(valid) / len(valid) if valid else None
    print(f"{pid}: avg answer_relevancy = {avg}")

Common pitfalls

Problem	Cause	Fix
No responses generated	`dataset_generator_model` or `generator_model` is missing	Add `dataset_generation_config.generator_model` to config
All rows have the same `prompt_id`	Samples missing `prompt_ids` field	Add `prompt_ids` to each sample
Prompt template not applied	`prompts_file` path is wrong or prompt IDs do not match	Check file path and that all IDs in `prompt_ids` exist in the prompts file
RAG metrics fail	No context in samples and no vectorstore	Either add `contexts` to samples or configure a `vectorstore_id`

Examples for prompt expansion dataset shapes
Metrics for the full metric catalog
API Reference for Evaluation constructor and prompts_file field
Troubleshooting for generation issues

Prompt Evaluation

What prompt evaluation is

Sub-modes

Inline single prompt (CLI)

Config

Dataset

Run

Multi-prompt evaluation with a prompts file

Prompts file (YAML)

Dataset with prompt_ids

Config

Run

Prompt with RAG (vectorstore)

Prompt with pre-fetched context

Python API

Single prompt (inline)

Multi-prompt with prompts file

Async execution

Metrics to use

Prompt (no RAG)

Prompt (with RAG)

Understanding multi-prompt results

Common pitfalls

Related references

Dataset with `prompt_ids`