Pipeline API

The DELM class coordinates configuration validation, experiment setup, preprocessing, and batched extraction. Use this page to review constructor arguments and helper methods.

delm.delm.DELM

Extraction pipeline with pluggable strategies.

Attributes:
  • config

    DELMConfig instance for this pipeline.

  • experiment_name

    Name of the experiment.

  • experiment_directory

    Directory for experiment outputs.

  • overwrite_experiment

    Whether to overwrite existing experiment data.

  • auto_checkpoint_and_resume_experiment

    Whether to auto-resume experiments.

__init__

__init__(
    *,
    config: DELMConfig,
    experiment_name: str,
    experiment_directory: Path,
    overwrite_experiment: bool = False,
    auto_checkpoint_and_resume_experiment: bool = True,
    use_disk_storage: bool = True,
    save_file_log: bool = True,
    log_dir: Union[str, Optional][Path] = None,
    console_log_level: str = DEFAULT_CONSOLE_LOG_LEVEL,
    file_log_level: str = DEFAULT_FILE_LOG_LEVEL,
    override_logging: bool = True
) -> None

Initialize the DELM extraction pipeline.

Parameters:
  • config (DELMConfig) –

    DELM configuration for this pipeline.

  • experiment_name (str) –

    Name of the experiment.

  • experiment_directory (Path) –

    Base directory for experiment outputs.

  • overwrite_experiment (bool, default: False ) –

    Whether to overwrite existing experiment data.

  • auto_checkpoint_and_resume_experiment (bool, default: True ) –

    Whether to auto‑resume from checkpoints.

  • use_disk_storage (bool, default: True ) –

    If True, use disk‑based experiment manager; otherwise in‑memory.

  • save_file_log (bool, default: True ) –

    If True, write a rotating log file under log_dir.

  • log_dir (Union[str, Optional][Path], default: None ) –

    Directory for log files. If None and save_file_log is True, defaults to DEFAULT_LOG_DIR/<experiment_name>.

  • console_log_level (str, default: DEFAULT_CONSOLE_LOG_LEVEL ) –

    Log level for console output.

  • file_log_level (str, default: DEFAULT_FILE_LOG_LEVEL ) –

    Log level for file output.

  • override_logging (bool, default: True ) –

    If True, force reconfiguration of logging for the process.

Raises:
  • ValueError

    If the provided config is invalid.

from_yaml classmethod

from_yaml(
    config_path: Union[str, Path],
    experiment_name: str,
    experiment_directory: Path,
    **kwargs: Any
) -> "DELM"

Create a DELM instance from a YAML configuration file.

Parameters:
  • config_path (Union[str, Path]) –

    Path to YAML configuration file.

  • experiment_name (str) –

    Name of the experiment.

  • experiment_directory (Path) –

    Base directory for experiment outputs.

  • **kwargs (Any, default: {} ) –

    Additional keyword arguments for DELM constructor.

Returns:
  • 'DELM'

    Configured DELM instance.

from_dict classmethod

from_dict(
    config_dict: Dict[str, Any],
    experiment_name: str,
    experiment_directory: Path,
    **kwargs: Any
) -> "DELM"

Create a DELM instance from a configuration dictionary.

Parameters:
  • config_dict (Dict[str, Any]) –

    Configuration dictionary.

  • experiment_name (str) –

    Name of the experiment.

  • experiment_directory (Path) –

    Base directory for experiment outputs.

  • **kwargs (Any, default: {} ) –

    Additional keyword arguments for DELM constructor.

Returns:
  • 'DELM'

    Configured DELM instance.

prep_data

prep_data(
    data: Union[str, Path] | DataFrame,
    sample_size: int = -1,
) -> pd.DataFrame

Preprocess data using the instance config and always save to the experiment manager.

Parameters:
  • data (Union[str, Path] | DataFrame) –

    Input data as a string path, Path, or DataFrame.

  • sample_size (int, default: -1 ) –

    Optional number of records to sample before processing. -1 (default) processes all rows; a positive value samples deterministically using SYSTEM_RANDOM_SEED.

Returns:
  • DataFrame

    A DataFrame containing chunked (and optionally scored) data ready for extraction.

process_via_llm

process_via_llm(
    preprocessed_file_path: Optional[Path] = None,
) -> pd.DataFrame

Process data through LLM extraction using configuration from constructor, with batch checkpointing and resuming.

Parameters:
  • preprocessed_file_path (Optional[Path], default: None ) –

    The path to the preprocessed data. If None, the preprocessed data will be loaded from the experiment manager.

Returns:
  • DataFrame

    A DataFrame containing the extracted data.

get_extraction_results

get_extraction_results() -> pd.DataFrame

Get the results from the experiment manager.

Returns:
  • DataFrame

    A DataFrame containing the extraction results.

get_cost_summary

get_cost_summary() -> dict[str, Any]

Get the cost summary from the cost tracker.

Returns:
  • dict[str, Any]

    A dictionary containing the cost summary.

Raises:
  • ValueError

    If cost tracking is not enabled in the configuration.