Pipeline API¶

The DELM class coordinates configuration validation, experiment setup, preprocessing, and batched extraction. Use this page to review constructor arguments and helper methods.

delm.delm.DELM ¶

Extraction pipeline with pluggable strategies.

Attributes:	`config` – DELMConfig instance for this pipeline. `experiment_name` – Name of the experiment. `experiment_directory` – Directory for experiment outputs. `overwrite_experiment` – Whether to overwrite existing experiment data. `auto_checkpoint_and_resume_experiment` – Whether to auto-resume experiments.

init ¶

__init__(
    *,
    config: DELMConfig,
    experiment_name: str,
    experiment_directory: Path,
    overwrite_experiment: bool = False,
    auto_checkpoint_and_resume_experiment: bool = True,
    use_disk_storage: bool = True,
    save_file_log: bool = True,
    log_dir: Union[str, Optional][Path] = None,
    console_log_level: str = DEFAULT_CONSOLE_LOG_LEVEL,
    file_log_level: str = DEFAULT_FILE_LOG_LEVEL,
    override_logging: bool = True
) -> None

Initialize the DELM extraction pipeline.

Parameters:

config (DELMConfig) –

DELM configuration for this pipeline.
experiment_name (str) –

Name of the experiment.
experiment_directory (Path) –

Base directory for experiment outputs.
overwrite_experiment (bool, default: False ) –

Whether to overwrite existing experiment data.
auto_checkpoint_and_resume_experiment (bool, default: True ) –

Whether to auto‑resume from checkpoints.
use_disk_storage (bool, default: True ) –

If True, use disk‑based experiment manager; otherwise in‑memory.
save_file_log (bool, default: True ) –

If True, write a rotating log file under log_dir.
log_dir (Union[str, Optional][Path], default: None ) –

Directory for log files. If None and save_file_log is True, defaults to DEFAULT_LOG_DIR/<experiment_name>.
console_log_level (str, default: DEFAULT_CONSOLE_LOG_LEVEL ) –

Log level for console output.
file_log_level (str, default: DEFAULT_FILE_LOG_LEVEL ) –

Log level for file output.
override_logging (bool, default: True ) –

If True, force reconfiguration of logging for the process.

Raises:	`ValueError` – If the provided `config` is invalid.

from_yaml `classmethod` ¶

from_yaml(
    config_path: Union[str, Path],
    experiment_name: str,
    experiment_directory: Path,
    **kwargs: Any
) -> "DELM"

Create a DELM instance from a YAML configuration file.

Parameters:	`config_path` (`Union[str, Path]`) – Path to YAML configuration file. `experiment_name` (`str`) – Name of the experiment. `experiment_directory` (`Path`) – Base directory for experiment outputs. `kwargs`** (`Any`, default: `{}` ) – Additional keyword arguments for DELM constructor.

Returns:	`'DELM'` – Configured DELM instance.

from_dict `classmethod` ¶

from_dict(
    config_dict: Dict[str, Any],
    experiment_name: str,
    experiment_directory: Path,
    **kwargs: Any
) -> "DELM"

Create a DELM instance from a configuration dictionary.

Parameters:	`config_dict` (`Dict[str, Any]`) – Configuration dictionary. `experiment_name` (`str`) – Name of the experiment. `experiment_directory` (`Path`) – Base directory for experiment outputs. `kwargs`** (`Any`, default: `{}` ) – Additional keyword arguments for DELM constructor.

Returns:	`'DELM'` – Configured DELM instance.

prep_data ¶

prep_data(
    data: Union[str, Path] | DataFrame,
    sample_size: int = -1,
) -> pd.DataFrame

Preprocess data using the instance config and always save to the experiment manager.

Parameters:	`data` (`Union[str, Path] \| DataFrame`) – Input data as a string path, `Path`, or `DataFrame`. `sample_size` (`int`, default: `-1` ) – Optional number of records to sample before processing. `-1` (default) processes all rows; a positive value samples deterministically using `SYSTEM_RANDOM_SEED`.

Returns:	`DataFrame` – A DataFrame containing chunked (and optionally scored) data ready for extraction.

process_via_llm ¶

process_via_llm(
    preprocessed_file_path: Optional[Path] = None,
) -> pd.DataFrame

Process data through LLM extraction using configuration from constructor, with batch checkpointing and resuming.

Parameters:	`preprocessed_file_path` (`Optional[Path]`, default: `None` ) – The path to the preprocessed data. If None, the preprocessed data will be loaded from the experiment manager.

Returns:	`DataFrame` – A DataFrame containing the extracted data.

get_extraction_results ¶

get_extraction_results() -> pd.DataFrame

Get the results from the experiment manager.

Returns:	`DataFrame` – A DataFrame containing the extraction results.

get_cost_summary ¶

get_cost_summary() -> dict[str, Any]

Get the cost summary from the cost tracker.

Returns:	`dict[str, Any]` – A dictionary containing the cost summary.

Raises:	`ValueError` – If cost tracking is not enabled in the configuration.