Evaluation Metrics

Prompt Amplifier includes a comprehensive evaluation module for measuring prompt quality, retrieval accuracy, and comparing different configurations.

Prompt Quality Metrics

Measure the quality of expanded prompts with calculate_expansion_quality:

from prompt_amplifier.evaluation import calculate_expansion_quality

original = "Summarize the sales data"
expanded = """**GOAL:** Generate a comprehensive summary of sales data.

**SECTIONS:**
1. Executive Overview
2. Key Performance Indicators
3. Trend Analysis
4. Recommendations

**INSTRUCTIONS:**
- Include quarterly comparisons
- Highlight top-performing products
- Format numbers with proper currency symbols
"""

metrics = calculate_expansion_quality(original, expanded)

print(f"Expansion Ratio: {metrics.expansion_ratio:.1f}x")
print(f"Structure Score: {metrics.structure_score:.2f}")
print(f"Specificity Score: {metrics.specificity_score:.2f}")
print(f"Completeness Score: {metrics.completeness_score:.2f}")
print(f"Readability Score: {metrics.readability_score:.2f}")
print(f"Overall Score: {metrics.overall_score:.2f}")

What Each Metric Measures

Metric	Description	Range
Expansion Ratio	Length increase from original	1x - N x
Structure Score	Headers, bullets, numbered lists	0.0 - 1.0
Specificity Score	Action verbs, constraints, examples	0.0 - 1.0
Completeness Score	Goal, sections, instructions present	0.0 - 1.0
Readability Score	Sentence length appropriateness	0.0 - 1.0
Overall Score	Weighted combination	0.0 - 1.0

Retrieval Metrics

Evaluate retrieval quality with standard IR metrics:

from prompt_amplifier.evaluation import calculate_retrieval_metrics

# Similarity scores from retrieval
scores = [0.92, 0.85, 0.71, 0.58, 0.42]

# Ground truth: indices 0, 1, 4 are truly relevant
relevant = [0, 1, 4]

metrics = calculate_retrieval_metrics(
    retrieved_scores=scores,
    relevant_indices=relevant,
    k=5
)

print(f"Precision@5: {metrics.precision_at_k:.2f}")
print(f"Recall@5: {metrics.recall_at_k:.2f}")
print(f"MRR: {metrics.mrr:.2f}")
print(f"NDCG: {metrics.ndcg:.2f}")
print(f"Average Score: {metrics.average_score:.2f}")

Retrieval Metrics Explained

Metric	Description
Precision@k	Fraction of retrieved docs that are relevant
Recall@k	Fraction of relevant docs that were retrieved
MRR	Mean Reciprocal Rank - position of first relevant
NDCG	Normalized Discounted Cumulative Gain
Average Score	Mean similarity score of retrieved docs

Diversity Score

Measure how diverse your retrieved results are:

from prompt_amplifier.evaluation import calculate_diversity_score

# Embeddings of retrieved documents
embeddings = [
    [0.1, 0.2, 0.3],
    [0.4, 0.5, 0.6],
    [0.7, 0.8, 0.9],
]

diversity = calculate_diversity_score(embeddings)
print(f"Diversity: {diversity:.2f}")
# Higher = more diverse results

Coherence Score

Measure how well the expanded prompt uses the context:

from prompt_amplifier.evaluation import calculate_coherence_score

prompt = "Analyze quarterly sales for North America region..."
context_chunks = [
    "Q1 sales in North America reached $1.2M",
    "North American market shows 15% growth",
]

coherence = calculate_coherence_score(prompt, context_chunks)
print(f"Coherence: {coherence:.2f}")
# Higher = prompt better incorporates context

Compare Embedders

Benchmark different embedders on your data:

from prompt_amplifier.evaluation import compare_embedders
from prompt_amplifier.embedders import (
    TFIDFEmbedder,
    SentenceTransformerEmbedder,
)

texts = [
    "Machine learning fundamentals",
    "Deep neural networks",
    "Natural language processing",
    "Computer vision applications",
]

queries = [
    "How does NLP work?",
    "Explain deep learning",
]

results = compare_embedders(
    texts=texts,
    queries=queries,
    embedders=[TFIDFEmbedder(), SentenceTransformerEmbedder()],
    embedder_names=["TF-IDF", "Sentence Transformers"],
)

for name, data in results.items():
    print(f"\n{name}:")
    print(f"  Dimension: {data['dimension']}")
    print(f"  Embedding time: {data['embedding_time_ms']:.1f}ms")
    print(f"  Query time: {data['query_time_ms']:.1f}ms")
    print(f"  Avg scores: {data['avg_query_scores']}")

Example Output

TF-IDF:
  Dimension: 4
  Embedding time: 2.3ms
  Query time: 0.1ms
  Avg scores: [0.15, 0.22]

Sentence Transformers:
  Dimension: 384
  Embedding time: 125.4ms
  Query time: 12.3ms
  Avg scores: [0.78, 0.85]

Benchmark Generators

Compare different LLMs for prompt expansion:

from prompt_amplifier.evaluation import benchmark_generators
from prompt_amplifier.generators import (
    OpenAIGenerator,
    AnthropicGenerator,
    GoogleGenerator,
)

results = benchmark_generators(
    prompt="Summarize Q4 performance",
    context="Q4 revenue was $5.2M with 23% growth...",
    generators=[
        OpenAIGenerator(),
        AnthropicGenerator(),
        GoogleGenerator(),
    ],
    generator_names=["GPT-4", "Claude", "Gemini"],
    num_runs=3,  # Average over multiple runs
)

for name, data in results.items():
    print(f"\n{name}:")
    print(f"  Avg time: {data['avg_time_ms']:.0f}ms")
    print(f"  Avg expansion: {data['avg_expansion_ratio']:.1f}x")
    print(f"  Avg quality: {data['avg_quality_score']:.2f}")

Evaluation Suite

Run comprehensive evaluations with the EvaluationSuite:

from prompt_amplifier import PromptForge
from prompt_amplifier.evaluation import EvaluationSuite

# Setup PromptForge with your data
forge = PromptForge()
forge.add_texts([
    "POC Health: Healthy means all milestones on track.",
    "Winscore ranges from 0-100, higher is better.",
    "Feature fit percentage indicates product match.",
])

# Create evaluation suite
suite = EvaluationSuite()

# Add test cases
suite.add_test_case(
    name="Deal Status",
    prompt="How's the deal going?",
    expected_keywords=["POC", "health", "milestone"],
)
suite.add_test_case(
    name="Metrics Check",
    prompt="What metrics should I track?",
    expected_keywords=["Winscore", "feature fit"],
)
suite.add_test_case(
    name="Product Fit",
    prompt="Is our product a good fit?",
    expected_keywords=["feature", "fit", "percentage"],
)

# Run all tests
results = suite.run(forge)

# Print formatted report
suite.print_report(results)

Sample Report Output

======================================================================
EVALUATION REPORT
======================================================================

📝 Test: Deal Status
   Prompt: How's the deal going?...
   ✅ Success
   ⏱️  Time: 1523ms
   📊 Expansion: 8.2x
   🎯 Quality: 0.75
   🔑 Keywords: 100%

📝 Test: Metrics Check
   Prompt: What metrics should I track?...
   ✅ Success
   ⏱️  Time: 1456ms
   📊 Expansion: 7.5x
   🎯 Quality: 0.82
   🔑 Keywords: 100%

📝 Test: Product Fit
   Prompt: Is our product a good fit?...
   ✅ Success
   ⏱️  Time: 1389ms
   📊 Expansion: 6.9x
   🎯 Quality: 0.71
   🔑 Keywords: 67%

======================================================================
Summary: 3/3 tests passed
Average Quality: 0.76
Average Expansion: 7.5x

CLI Evaluation

Run evaluations from command line:

# Compare embedders
prompt-amplifier compare-embedders --docs ./docs/

# Run evaluation suite
prompt-amplifier evaluate --docs ./docs/ --prompts "How's the deal?" "Check metrics"

Best Practices

1. Create Representative Test Cases

suite.add_test_case(
    name="Edge case: Very short",
    prompt="Hi",
)
suite.add_test_case(
    name="Edge case: Technical query",
    prompt="What's the MTTR for critical incidents?",
)

2. Track Metrics Over Time

import json
from datetime import datetime

results = suite.run(forge)
metrics = {
    "timestamp": datetime.now().isoformat(),
    "version": "0.2.0",
    "avg_quality": sum(r["quality_metrics"]["overall_score"] 
                       for r in results if r["success"]) / len(results),
}

with open("metrics_history.jsonl", "a") as f:
    f.write(json.dumps(metrics) + "\n")

3. A/B Test Configurations

configs = [
    {"embedder": TFIDFEmbedder(), "name": "TF-IDF"},
    {"embedder": SentenceTransformerEmbedder(), "name": "ST"},
]

for config in configs:
    forge = PromptForge(embedder=config["embedder"])
    forge.add_texts(texts)
    results = suite.run(forge)
    print(f"{config['name']}: {results[0]['quality_metrics']['overall_score']:.2f}")