Performance Evaluation¶

Evaluate extraction quality against ground truth data.

estimate_performance()¶

Calculate precision, recall, and F1 scores for extraction results.

from delm.utils.performance_estimation import estimate_performance

metrics, comparison_df = estimate_performance(
    config: DELM | DELMConfig | str | Path,
    data_source: str | Path | pd.DataFrame,
    expected_extraction_output_df: pd.DataFrame,
    true_json_column: str,
    matching_id_column: str,
    record_sample_size: int = -1,
    save_file_log: bool = False,
    log_dir: str | Path | None = ".delm/logs/performance_estimation",
    console_log_level: str = "INFO",
    file_log_level: str = "DEBUG"
) -> tuple[dict, pd.DataFrame]

Parameters: - config: DELM instance, DELMConfig, or path to config YAML - data_source: Input data - expected_extraction_output_df: DataFrame with ground truth extractions - true_json_column: Column in expected_df containing ground truth JSON - matching_id_column: Column to match records (e.g., "id" or "delm_file_name") - record_sample_size: Number of records to evaluate (-1 = all) - save_file_log, log_dir, console_log_level, file_log_level: Logging settings

Returns: Tuple of: 1. Metrics dictionary - Field-level metrics:

{
    "price": {
        "precision": 0.95,
        "recall": 0.90,
        "f1": 0.92,
        "tp": 38, "fp": 2, "fn": 4
    },
    "company": {...}
}

Comparison DataFrame - Row-by-row comparisons:

columns: [matching_id_column, "expected_dict", "extracted_dict"]

Warning: Makes real API calls (costs apply).

Example¶

from delm import DELM, Schema, ExtractionVariable
from delm.utils.performance_estimation import estimate_performance
import pandas as pd

# Prepare ground truth data
expected_df = pd.DataFrame({
    "id": [1, 2, 3],
    "expected_extraction": [
        {"price": 10.5, "company": "Apple"},
        {"price": 20.0, "company": "Microsoft"},
        {"price": 15.0, "company": "Google"}
    ]
})

# Run performance estimation
delm = DELM(schema=schema, provider="openai", model="gpt-4o-mini")

metrics, comparison = estimate_performance(
    config=delm,
    data_source="data.csv",
    expected_extraction_output_df=expected_df,
    true_json_column="expected_extraction",
    matching_id_column="id"
)

# Analyze results
for field, scores in metrics.items():
    print(f"{field}: Precision={scores['precision']:.2f}, Recall={scores['recall']:.2f}")

# Inspect individual failures
comparison[comparison["extracted_dict"] != comparison["expected_dict"]]

Matching Records¶

For DataFrames with IDs¶

Use any existing ID column:

matching_id_column="record_id"

For Directories of Files¶

Use delm_file_name (automatically generated):

expected_df = pd.DataFrame({
    "delm_file_name": ["doc1.pdf", "doc2.pdf"],
    "expected_extraction": [...]
})

estimate_performance(
    config=delm,
    data_source="pdfs/",  # Directory
    expected_extraction_output_df=expected_df,
    true_json_column="expected_extraction",
    matching_id_column="delm_file_name"  # Match by filename
)