Loading Data¶

DELM supports a wide range of input formats for text extraction. You can load single files, entire directories, or raw pandas DataFrames.

Supported File Types¶

DELM supports the following file formats out of the box:

Format	Extension	Notes
CSV	`.csv`	Requires target column
Parquet	`.parquet`	Requires target column
Feather	`.feather`	Requires target column
Text	`.txt`
Markdown	`.md`
Word	`.docx`
HTML	`.html`, `.htm`	Automatically strips tags to extract text

Optional Formats¶

The following formats require extra dependencies included in the extras package:

Format	Extension	Installation	Notes
PDF	`.pdf`	`pip install delm[extras]`
Excel	`.xlsx`, `.xls`	`pip install delm[extras]`	Requires target column

Input Methods¶

You can load data into DELM using the delm.prep_data() method in three ways.

1. Single File¶

Pass the path to a single file. DELM will detect the format and load it.

delm = DELM(
    ...
)

# Load a single document
delm.extract("documents/report_2024.pdf")

Specifying Text Columns: If your CSV/Excel/Parquet file has text in a specific column (e.g., "comments"), specify it:

delm = DELM(
    # ...
    target_column="comments"
)
delm.prep_data("data/survey_responses.csv")

2. Directory of Files¶

Pass a directory path to load all supported files within it. DELM find and loads any valid files.

# Load all supported files in a directory
delm.prep_data("data/financial_reports/")

This is useful for processing a mixed collection of PDFs, Word docs, and text files in one go.

3. Pandas DataFrame¶

If you already have data in memory, you can pass a pandas DataFrame directly.

import pandas as pd

df = pd.DataFrame({
    "text": [
        "Company A reported $10M revenue.",
        "Company B reported $5M revenue."
    ],
    "meta_id": [101, 102]
})

delm.prep_data(df)