Troubleshooting
Solutions to common installation, configuration, dataset, and runtime issues.
Installation issues
Python version is too old
Symptom:
- install errors mentioning
requires-python >=3.11
Fix:
Upgrade to Python 3.11 or newer, then reinstall.
Package install fails
Try upgrading pip first:
If you are installing from the local repo:
Config issues
Authentication failed
Check that llm_config.api_key and llm_config.base_url are correct for your provider.
llm_config:
base_url: "https://api.openai.com/v1"
api_key: "your-api-key"
chat_model: "gpt-4o-mini"
embedding_model: "text-embedding-3-small"
If you load credentials from environment variables, print or validate them before constructing OpenAIProviderConfig.
Missing llm_config
Both floeval evaluate and floeval generate expect a top-level llm_config section.
Partial dataset but no generator model
Symptom:
dataset_generator_model is required for partial datasetsNo generator_model in dataset_generation_config
Fix:
You can also provide evaluation_config.dataset_generator_model, but dataset_generation_config.generator_model is the clearest option.
Agent CLI mode missing agent_name
Symptom:
- partial agent evaluation fails with a message that
agent_nameis required
Fix:
This is required for CLI partial agent evaluation with --agent.
Dataset issues
faithfulness or retrieval metrics fail
Make sure your samples include the fields those metrics need:
contextsfor grounding and retrieval checksground_truthfor recall and precision style metrics where applicable
Minimal example:
{
"samples": [
{
"user_input": "What is RAG?",
"llm_response": "RAG stands for Retrieval-Augmented Generation.",
"contexts": ["RAG combines retrieval with generation."],
"ground_truth": "Retrieval-Augmented Generation"
}
]
}
Standard dataset missing llm_response
If a standard dataset sample has no llm_response, treat it as a partial dataset and add generation config.
Agent dataset role errors
In saved agent trace datasets, message roles must be:
humanaitool
Invalid JSON
Validate the file before running Floeval:
Generate command issues
Missing output path
floeval generate always requires -o or --output:
Wrong output extension
floeval generate currently exports only:
.json.jsonl
Dataset already has llm_response
If the input dataset already contains responses, use floeval evaluate instead of floeval generate.
Prompt expansion not happening
Check all of the following:
- samples include
prompt_ids evaluation_config.prompts_filepoints to a real YAML or JSON prompt file- prompt IDs used in the dataset exist in that prompt file
Agent evaluation issues
FloTorch import errors
Symptom:
- agent CLI or runner setup fails because FloTorch modules are missing
Fix:
FloTorch credentials missing
Provide gateway credentials through either:
llm_config.base_urlandllm_config.api_keyFLOTORCH_BASE_URLandFLOTORCH_API_KEY
tool_call_accuracy returns an error
This metric needs reference_tool_calls in each agent sample.
Example:
Runtime issues
Rate limits or slow runs
Try:
- using fewer samples while iterating
- using a smaller model such as
gpt-4o-mini - lowering generation fan-out with
dataset_generation_config.max_concurrency
Low scores unexpectedly
Check:
- whether the model is answering the intended question
- whether contexts are relevant and clean
- whether thresholds are too strict for your stage of development
- whether you selected the right provider and metric for the task
Validation
Verify the installation
Verify a standard dataset loads
from floeval import DatasetLoader
dataset = DatasetLoader.from_file("dataset.json", partial_dataset=False)
print(len(dataset.samples))
Verify an agent dataset loads
from floeval.config.schemas.io.agent_dataset import AgentDataset
dataset = AgentDataset.from_file("agent_dataset.json")
print(len(dataset.samples))
print(dataset.is_partial)
Smoke-test your own files
Replace your_config.yaml and your_dataset.json with real files from your project.
Prompt evaluation issues
No responses generated for prompt variants
Symptom:
- evaluation completes but
prompt_idis missing from results - only one result row per sample instead of one per prompt
Check all of the following:
- samples in the dataset include
prompt_ids(list of IDs, not a single string) evaluation_config.prompts_fileis set and points to a real file- prompt IDs in
prompt_idsall exist in the prompts file dataset_generation_config.generator_modelis set (required for partial datasets)
Minimal working dataset:
Minimal prompts file:
RAG metrics fail in prompt evaluation
Prompt (with RAG) needs either:
contextsin each sample, or- a configured
vectorstore_id(passed throughdata.vectorstore_idin the worker config)
If neither is present, use no-RAG metrics only (answer_relevancy).
Agentic workflow issues
FloTorch extras required
Symptom:
ImportErroror runtime error mentioninggoogle-adkorFloTorch
Fix:
Agent node not found
Symptom:
- workflow execution fails with a message about the agent not being found
Check:
agentNamein the DAG node matches the name of an agent deployed in your FloTorch gatewayllm_config.base_urlpoints to the correct gateway
DAG never reaches the end
Symptom:
- workflow hangs or times out
Check:
- there are no cycles in the
edgeslist - every non-START node has at least one incoming edge
- the DAG has exactly one
ENDnode
Empty agent_traces in results
Symptom:
agent_tracesis empty or missing from sample results
Check that the DAG config has at least one AGENT type node:
nodes:
- {id: "start", type: "START", label: "Start"}
- {id: "agent_a", type: "AGENT", label: "Agent A", agentName: "my-agent:latest"}
- {id: "end", type: "END", label: "End"}
A DAG with only START and END nodes will produce no agent traces.
tool_call_accuracy is null for all workflow samples
This metric silently skips samples that have no reference_tool_calls. Add expected tool calls to your dataset:
{
"user_input": "What is the order status?",
"reference_outcome": "The order is shipped.",
"reference_tool_calls": [
{"name": "get_order_status", "args": {"order_id": "12345"}}
]
}