Configuration Files¶

Use YAML configuration files to define reusable, version-controlled extraction pipelines.

When to Use Config Files¶

Primary API: We recommend using the Python API (DELM()) for most use cases - it's more intuitive and has better IDE support.

Use YAML configs when: - You want to version control extraction configurations alongside code - You need to share configs across different scripts or team members - You're running experiments with many configuration variations - You want to separate configuration from code logic

Python API (Recommended)¶

from delm import DELM, Schema, ExtractionVariable

schema = Schema.simple([
    ExtractionVariable(name="price", description="Price value", data_type="number")
])

delm = DELM(
    schema=schema,
    provider="openai",
    model="gpt-4o-mini",
    temperature=0.0
)

results = delm.extract("data.csv")

YAML Config API¶

Loading from YAML¶

from delm import DELM

# Load config from YAML
delm = DELM.from_config("config.yaml")

results = delm.extract("data.csv")

Config File Structure¶

DELM config files use a flat structure with all parameters at the top level:

# config.yaml

# Schema (REQUIRED) - can be inline dict or path to schema file
schema:
  schema_type: "simple"
  variables:
    - name: "price"
      description: "Price value mentioned"
      data_type: "number"

# OR reference external schema file
# schema: "schema.yaml"

# LLM Settings
provider: "openai"              # REQUIRED: "openai", "anthropic", "google", "groq", etc.
model: "gpt-4o-mini"            # REQUIRED: Model identifier
temperature: 0.0                 # Default: 0.0, Range: 0.0-2.0

# Processing Settings
batch_size: 10                   # Default: 10, chunks per batch
max_workers: 1                   # Default: 1, concurrent workers per batch
max_retries: 3                   # Default: 3, API retry attempts
base_delay: 1.0                  # Default: 1.0, seconds between retries
tokens_per_minute: null          # Default: null, max tokens per minute
requests_per_minute: null        # Default: null, max requests per minute
max_completion_tokens: 4096      # Default: 4096, max completion tokens per request

# Cost Management
track_cost: true                 # Default: true
max_budget: null                 # Default: null, max spend in dollars (requires track_cost: true)
model_input_cost_per_1M_tokens: null   # Default: auto-detected from model database
model_output_cost_per_1M_tokens: null  # Default: auto-detected from model database

# Data Preprocessing  
target_column: "text"            # Default: "text", input text column name
drop_target_column: false        # Default: false, whether to drop target column after processing
score_filter: null               # Default: null, pandas query like "delm_score >= 0.7"

# Text Splitting (optional)
splitting_strategy:
  type: "ParagraphSplit"         # Options: "ParagraphSplit", "FixedWindowSplit", "RegexSplit", null
  # window: 5                    # For FixedWindowSplit only
  # stride: 5                    # For FixedWindowSplit only
  # pattern: "\n\n"              # For RegexSplit only

# Relevance Scoring (optional)
relevance_scorer:
  type: "KeywordScorer"          # Options: "KeywordScorer", "FuzzyScorer", null
  keywords: ["price", "forecast"]  # For KeywordScorer/FuzzyScorer

# Prompt Customization (optional)
prompt_template: |
  Extract the following information from the text:

  {variables}

  Text to analyze:
  {text}

system_prompt: "You are a precise data-extraction assistant."

# Caching Settings
cache_backend: "sqlite"          # Default: "sqlite", Options: "sqlite", "lmdb", "filesystem"
cache_path: ".delm/cache"        # Default: ".delm/cache"
cache_max_size_mb: 512           # Default: 512
cache_synchronous: "normal"      # Default: "normal", Options: "normal", "full" (SQLite only)

Separate Schema Files¶

You can define schemas in separate YAML files:

config.yaml:

schema: "schema.yaml"  # Path to schema file
provider: "openai"
model: "gpt-4o-mini"
# ... other settings

schema.yaml:

schema_type: "simple"
variables:
  - name: "price"
    description: "Price value mentioned in text"
    data_type: "number"
    required: false

  - name: "company"
    description: "Company name if mentioned"
    data_type: "string"
    required: false
    validate_in_text: true

See the Schemas documentation for complete schema specification details.

Complete Example¶

config.yaml:

# Schema definition
schema:
  schema_type: "nested"
  container_name: "commodities"
  variables:
    - name: "commodity_type"
      description: "Type of commodity mentioned"
      data_type: "string"
      required: true
      allowed_values: ["oil", "gas", "gold", "copper"]
      validate_in_text: true

    - name: "price"
      description: "Price value if mentioned"
      data_type: "number"
      required: false

    - name: "unit"
      description: "Unit of measurement (barrel, ounce, ton)"
      data_type: "string"
      required: false

# LLM configuration
provider: "openai"
model: "gpt-4o-mini"
temperature: 0.0
batch_size: 20
max_workers: 4
max_retries: 3
base_delay: 1.0
tokens_per_minute: 500000
requests_per_minute: 500

# Cost tracking
track_cost: true
max_budget: 50.0

# Preprocessing
target_column: "text"
drop_target_column: false

splitting_strategy:
  type: "ParagraphSplit"

relevance_scorer:
  type: "KeywordScorer"
  keywords: ["price", "forecast", "guidance", "commodity"]

score_filter: "delm_score >= 0.5"

# Custom prompts
prompt_template: |
  Extract commodity price information from the following text.

  {variables}

  IMPORTANT: Only extract information explicitly mentioned in the text.

  Text:
  {text}

system_prompt: "You are a commodity price extraction specialist. Extract only factual information explicitly stated in the text."

# Caching
cache_backend: "sqlite"
cache_path: ".delm/cache"
cache_max_size_mb: 1024

Usage:

from delm import DELM

# Load and run
delm = DELM.from_config("config.yaml")
results = delm.extract("data/reports.csv")

# Get cost summary
cost_summary = delm.get_cost_summary()
print(f"Total cost: ${cost_summary['total_cost']:.2f}")

Configuration Reference¶

Schema (Required)¶

Parameter	Type	Description
`schema`	dict or string	Schema definition (inline dict or path to YAML file)

LLM Extraction Config¶

Contains all LLM-related settings including provider, model, prompts, processing, and cost tracking.

Parameter	Type	Default	Description
`provider`	string	REQUIRED	LLM provider ("openai", "anthropic", "google", "groq", "together", "fireworks")
`model`	string	REQUIRED	Model identifier (e.g., "gpt-4o-mini", "claude-3-sonnet")
`temperature`	float	0.0	Sampling temperature (0.0-2.0)
`prompt_template`	string	(default)	User prompt template with `{variables}` and `{text}` placeholders
`system_prompt`	string	"You are a precise data-extraction assistant."	System prompt sent to LLM
`max_retries`	int	3	Number of retry attempts on API failure
`batch_size`	int	10	Number of chunks processed per batch
`max_workers`	int	1	Concurrent workers (within each batch)
`base_delay`	float	1.0	Seconds between retry attempts
`max_completion_tokens`	int	4096	Maximum completion tokens per request
`track_cost`	bool	true	Enable cost tracking
`max_budget`	float	null	Maximum budget in dollars (requires `track_cost: true`)
`model_input_cost_per_1M_tokens`	float	null	Custom input token cost (auto-detected from model database if null)
`model_output_cost_per_1M_tokens`	float	null	Custom output token cost (auto-detected from model database if null)

Data Preprocessing Config¶

Controls text splitting, relevance scoring, and filtering.

Parameter	Type	Default	Description
`target_column`	string	"text"	Input text column name
`drop_target_column`	bool	false	Drop target column after splitting
`splitting_strategy`	dict	null	Text splitting configuration (e.g., `{"type": "ParagraphSplit"}`)
`relevance_scorer`	dict	null	Relevance scoring configuration (e.g., `{"type": "KeywordScorer", "keywords": [...]}`)
`score_filter`	string	null	Pandas query to filter chunks (e.g., "delm_score >= 0.7")

Semantic Cache Config¶

Controls caching of LLM responses.

Parameter	Type	Default	Description
`cache_backend`	string	"sqlite"	Cache backend ("sqlite", "lmdb", "filesystem")
`cache_path`	string	".delm/cache"	Cache storage path
`cache_max_size_mb`	int	512	Maximum cache size in MB before pruning
`cache_synchronous`	string	"normal"	SQLite synchronous mode ("normal", "full")