Performance Evaluation¶

Measure extraction quality by comparing results against human-labeled data. Get precision, recall, and F1 scores for each field in your schema.

What is Performance Evaluation?¶

estimate_performance() runs your extraction pipeline on sample data and compares the results to expected outputs. It calculates field-level metrics to show you where your extraction is accurate and where it needs improvement.

Warning: This function makes API calls and will incur costs based on the sample size you specify.

Data Requirements¶

You need two datasets:

Input data: The raw text data (CSV, parquet, DataFrame, directory of files, etc.)
Expected output data: A DataFrame with human-labeled extraction results

Both datasets must have a matching ID column to link input records to their expected outputs.

Matching ID for Different Input Types¶

DataFrame/CSV/Parquet: Use any existing ID column in your data (e.g., id, report_id, doc_id)
Directory of files (PDFs, text files, etc.): Use delm_file_name - this column is automatically created with the filename for each file loaded
Single file: Not typically used for evaluation (since you'd only have one record)

Expected Output Format¶

Your expected output DataFrame needs: - Matching ID column: Links to input data - For DataFrames/CSV: any ID column (e.g., id, report_id) - For file directories: a column with filenames that will match delm_file_name - Expected results column: Contains the correct extraction as a dict or JSON string

The expected results must match your schema structure:

# For a simple schema
expected_dict = {
    "price": 75.0,
    "currency": "USD",
    "commodity": "oil"
}

# For a nested schema
expected_dict = {
    "prices": [
        {"price": 75.0, "currency": "USD"},
        {"price": 80.0, "currency": "EUR"}
    ]
}

Complete Examples¶

Example 1: DataFrame Input¶

import pandas as pd
from delm import DELM, Schema, ExtractionVariable
from delm.utils.performance_estimation import estimate_performance

# 1. Load input data (raw text)
input_df = pd.read_csv("data/raw_texts.csv")
# Columns: id, text

# 2. Load expected output data (human labels)
expected_df = pd.read_csv("data/human_labels.csv")
# Columns: id, expected_extraction (as JSON string or dict)

# 3. Define your schema
schema = Schema.simple([
    ExtractionVariable(
        name="price",
        description="Price value mentioned",
        data_type="number"
    ),
    ExtractionVariable(
        name="currency",
        description="Currency (USD, EUR, etc.)",
        data_type="string"
    ),
    ExtractionVariable(
        name="commodity",
        description="Commodity type (oil, gold, etc.)",
        data_type="string"
    )
])

# 4. Create DELM configuration
delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    temperature=0.0
)

# 5. Run performance evaluation
metrics, comparison_df = estimate_performance(
    config=delm,  # Can also pass DELMConfig, dict, or YAML path
    data_source=input_df,
    expected_extraction_output_df=expected_df,
    true_json_column="expected_extraction",
    matching_id_column="id",
    record_sample_size=50  # Process 50 records (-1 for all)
)

# 6. Display results
print(f"{'Field':<20} {'Precision':>10} {'Recall':>10} {'F1':>10}")
print("-" * 52)
for field, m in metrics.items():
    print(f"{field:<20} {m['precision']:10.3f} {m['recall']:10.3f} {m['f1']:10.3f}")

Example 2: Directory of PDFs¶

import pandas as pd
from delm import DELM, Schema, ExtractionVariable
from delm.utils.performance_estimation import estimate_performance

# 1. Prepare expected output data with filenames
# Your expected_df must have a column with the filename for matching
expected_df = pd.DataFrame({
    'filename': ['report1.pdf', 'report2.pdf', 'report3.pdf'],
    'expected_extraction': [
        {"price": 75.0, "currency": "USD", "commodity": "oil"},
        {"price": 1950.0, "currency": "USD", "commodity": "gold"},
        {"price": 3.50, "currency": "USD", "commodity": "gas"}
    ]
})

# 2. Define schema and DELM config
schema = Schema.simple([
    ExtractionVariable(name="price", description="Price value", data_type="number"),
    ExtractionVariable(name="currency", description="Currency", data_type="string"),
    ExtractionVariable(name="commodity", description="Commodity type", data_type="string")
])

delm = DELM(schema=schema, provider="openai", model="gpt-4o-mini", temperature=0.0)

# 3. Run evaluation on directory
# The system creates a 'delm_file_name' column automatically for each PDF
metrics, comparison_df = estimate_performance(
    config=delm,
    data_source="data/pdfs/",  # Directory of PDF files
    expected_extraction_output_df=expected_df,
    true_json_column="expected_extraction",
    matching_id_column="delm_file_name",  # Use the auto-generated filename column
    record_sample_size=-1  # Process all files
)

# 4. Display results
for field, m in metrics.items():
    print(f"{field:<20} {m['precision']:10.3f} {m['recall']:10.3f} {m['f1']:10.3f}")

Important: When using a directory of files, your expected_extraction_output_df must have a column that matches the filenames in the directory. Use delm_file_name as the matching_id_column.

Output Example¶

Field                Precision     Recall         F1
----------------------------------------------------
price                    0.950      0.900      0.924
currency                 0.875      0.933      0.903
commodity                1.000      0.850      0.919

Understanding Metrics¶

Precision: Of the items your pipeline extracted, what percentage were correct? - Formula: TP / (TP + FP) - High precision = few false positives (didn't extract things that shouldn't be there)

Recall: Of the correct items that should be extracted, what percentage did your pipeline find? - Formula: TP / (TP + FN) - High recall = few false negatives (didn't miss things that should be extracted)

F1 Score: Harmonic mean of precision and recall - Formula: 2 * (Precision × Recall) / (Precision + Recall) - Balanced measure of overall extraction quality

Analyzing Results¶

Inspect the Comparison DataFrame¶

The comparison_df returned by estimate_performance() contains the expected vs. extracted results for each record:

# View columns
print(comparison_df.columns)
# Output: ['id', 'expected_dict', 'extracted_dict']

# Examine first few results
print(comparison_df.head())

# Find discrepancies
for idx, row in comparison_df.iterrows():
    if row['expected_dict'] != row['extracted_dict']:
        print(f"Record {row['id']}:")
        print(f"  Expected: {row['expected_dict']}")
        print(f"  Extracted: {row['extracted_dict']}")

Metrics Dictionary Structure¶

Each field in your schema has its own metrics:

# Example metrics structure
{
    "price": {
        "precision": 0.95,
        "recall": 0.90,
        "f1": 0.924,
        "tp": 45,   # True positives
        "fp": 3,    # False positives
        "fn": 5     # False negatives
    },
    "currency": {
        "precision": 0.875,
        "recall": 0.933,
        "f1": 0.903,
        "tp": 42,
        "fp": 6,
        "fn": 3
    }
}