DELM¶
Data Extraction with Language Models – A Python toolkit for extracting structured data from unstructured text using LLMs.
Why DELM?¶
Extracting structured data from documents at scale is harder than it should be. You need consistent prompts, validation logic, retry handling, cost tracking, and robust file processing—before you even get to your actual research questions.
DELM provides the infrastructure layer so you can focus on defining what to extract, not how to extract it:
- Declare your schema, not your prompts – Specify fields with types, validation rules, and descriptions. DELM generates prompts, validates outputs, and handles malformed responses.
- Test before you spend – Estimate costs on sample data, set hard budget limits, and automatically cache results to avoid paying for the same extraction twice.
- Scale without breaking – Process 100K+ documents with automatic checkpointing, concurrent batching, and text preprocessing (splitting, relevance filtering) built in.
- Model independence – Switch between OpenAI, Anthropic, Google, or any provider Instructor supports without rewriting code.
- Measure quality – Built-in precision/recall evaluation against ground truth, with field-level metrics for debugging.
Quick Example¶
from delm import DELM, Schema, ExtractionVariable
# Define what to extract
schema = Schema.simple(
ExtractionVariable("company", "Company name", "string"),
ExtractionVariable("price", "Stock price", "number")
)
# Configure extraction
delm = DELM(
schema=schema,
provider="openai",
model="gpt-4o-mini"
)
# Extract from data
results = delm.extract("financial_reports.csv")
Getting Started¶
→ Installation & First Extraction
Install DELM, set up API keys, and run your first extraction in under 5 minutes.
Documentation¶
User Guide¶
Core concepts and common workflows:
- Defining Schemas – Simple, nested, and multiple extraction structures
- Customizing Prompts – Control prompt templates and system messages
- Loading Data – Supported file formats and input methods
- Preprocessing Text – Splitting and relevance scoring strategies
- Cost Management – Estimate, track, and limit API costs
- Caching – Reduce costs with automatic result caching
- Evaluation – Measure extraction quality with precision/recall
- Output Data – Understanding and transforming results
Advanced Topics¶
Power user features for large-scale deployments:
- Large Jobs & Checkpointing – Robust extraction for 100K+ records
- Configuration Files – YAML-based configuration for reproducibility
- Logging & Debugging – Control logging output and verbosity
- Two-Stage Processing – Separate preprocessing from extraction
API Reference¶
Complete technical documentation:
- DELM – Main pipeline class
- Schema – Schema factory methods
- ExtractionVariable – Field definitions
- Cost Estimation – Cost utilities
- Performance Evaluation – Evaluation metrics
- Post-Processing – Result transformation
- Splitting Strategies – Text chunking
- Relevance Scorers – Relevance scoring
- System Constants – Column names and defaults
Support¶
- GitHub: Center-for-Applied-AI/delm
- Issues: Report bugs or request features on GitHub
- PyPI: pypi.org/project/delm