Pipeline Configuration¶

Each run is driven by a YAML configuration that describes how to prepare documents, which model to call, and how to manage costs. Keep extraction logic declarative so you can version and review changes alongside your code.

Minimal Configuration¶

llm_extraction:
  provider: "openai"
  name: "gpt-4o-mini"
  temperature: 0.0
  batch_size: 10
  track_cost: true
  max_budget: 50.0

data_preprocessing:
  target_column: "text"
  splitting:
    type: "ParagraphSplit"
  scoring:
    type: "KeywordScorer"
    keywords: ["price", "forecast", "guidance"]

schema:
  spec_path: "schema_spec.yaml"

Key Sections¶

`llm_extraction`¶

Controls which provider and model to use, as well as sampling and batching behavior.

provider: Any supported provider slug (openai, anthropic, google, groq, together, fireworks).
name: Model identifier as expected by the provider.
temperature: Sampling temperature; keep at 0.0 for deterministic extraction.
batch_size: Number of chunks processed concurrently. Tune based on provider rate limits.
track_cost: Enable token and currency tracking.
max_budget: Optional soft ceiling; processing stops when the budget is exceeded.

`data_preprocessing`¶

Defines how input documents are transformed before prompting.

target_column: DataFrame column used as the primary text input.
splitting: Strategy for splitting large documents (ParagraphSplit, SentenceSplit, custom classes).
scoring: Optional filter to select relevant chunks (e.g., KeywordScorer).
Additional preprocessing components such as cleaning or enrichment can be added as nested blocks.

`schema`¶

Points to the schema specification file and optional prompt customization.

spec_path: Path to a schema_spec.yaml file that describes the extraction schema.
system_prompt: Override the default system prompt for all runs.
prompt_template: Format string used to render each prompt with {variables}, {text}, and optional {context} placeholders.

Experiment Management¶

When you instantiate DELM.from_yaml, pass an experiment_name and experiment_directory. DELM creates a subdirectory that stores:

Input checkpoints and chunk metadata.
Cached responses and retry logs.
Cost summaries (cost_summary.json).
Final extraction outputs.

Keeping experiments isolated makes it easy to resume failed runs or compare different configurations.

Cost Tracking¶

If track_cost is enabled, the pipeline records token usage and provider fees. After a run, call:

summary = pipeline.get_cost_summary()
print(summary["total_cost"])

The summary includes per-provider totals, cached token counts, and budget status. Use this data to optimize prompts before running at production scale.

Config Best Practices¶

Commit configuration files to version control and review changes like any other code.
Start with smaller batch sizes until you confirm provider rate limits.
Specify max_budget when iterating on new schemas to avoid unexpected spending.
Use descriptive experiment names (e.g., include dataset and model) to keep folders organized.