Output Data¶

Understand the structure of DELM's extraction results and how to transform them for analysis.

Output Columns¶

From `delm.extract()`¶

When you call delm.extract(), you get a DataFrame with:

Your original columns (from input data)
DELM system columns (added during processing):

Column	Description
`delm_chunk_id`	Unique ID for each text chunk processed
`delm_record_id`	Links chunks back to original records
`delm_text_chunk`	The actual text chunk sent to the LLM
`delm_score`	Relevance score (if scorer was used)
`delm_batch_id`	Batch number for processing
`delm_errors`	Error messages (if extraction failed)
`delm_extracted_data_json`	JSON string of extracted data

Example:

results_df = delm.extract("data.csv")
print(results_df.columns)
# Output: ['id', 'company', 'text', 'delm_chunk_id', 'delm_record_id', 
#          'delm_text_chunk', 'delm_score', 'delm_batch_id', 'delm_errors',
#          'delm_extracted_data_json']

From `delm.get_extraction_results()`¶

This method only returns the core extraction columns (no original data, no delm_record_id, no delm_score):

delm_chunk_id
delm_batch_id
delm_text_chunk
delm_errors
delm_extracted_data_json

Use case: When you've saved results to disk with use_disk_storage=True and want to reload just the extraction data later.

Note: delm_record_id and delm_score are metadata that are merged in after loading, so they're only available from extract(), not from get_extraction_results().

delm = DELM(
    schema=schema,
    use_disk_storage=True,
    experiment_path="experiments/my_run"
)

# Run extraction
results_df = delm.extract("data.csv")  # Returns all columns

# Later, reload just extraction data
extraction_only = delm.get_extraction_results()  # Returns only DELM columns

Transforming Results with `explode_json_results()`¶

The explode_json_results() function converts nested JSON into flat, tabular format for analysis. How it works depends on your schema type.

Simple Schema¶

For simple schemas, each row represents one chunk with all extracted fields as columns.

from delm import DELM, Schema, ExtractionVariable
from delm.utils.post_processing import explode_json_results

# Define simple schema
schema = Schema.simple([
    ExtractionVariable(name="company", data_type="string"),
    ExtractionVariable(name="price", data_type="number"),
    ExtractionVariable(name="currency", data_type="string")
])

delm = DELM(schema=schema, provider="openai", model="gpt-4o-mini")
results = delm.extract("data.csv")

# Explode JSON
exploded = explode_json_results(results, schema)

Input JSON (in delm_extracted_data_json):

{"company": "Apple", "price": 150, "currency": "USD"}

Output Table:

delm_chunk_id	company	price	currency
0	Apple	150	USD
1	Google	2800	USD

Nested Schema¶

For nested schemas, each item in the list becomes its own row. Multiple items from the same chunk will create multiple rows.

schema = Schema.nested(
    container_name="commodities",
    variables_list=[
        ExtractionVariable(name="commodity", data_type="string"),
        ExtractionVariable(name="price", data_type="number"),
        ExtractionVariable(name="unit", data_type="string")
    ]
)

delm = DELM(schema=schema, provider="openai", model="gpt-4o-mini")
results = delm.extract("data.csv")

# Explode JSON
exploded = explode_json_results(results, schema)

Input JSON (in delm_extracted_data_json):

{
  "commodities": [
    {"commodity": "oil", "price": 75, "unit": "barrel"},
    {"commodity": "gold", "price": 1950, "unit": "ounce"}
  ]
}

Output Table:

delm_chunk_id	commodity	price	unit
0	oil	75	barrel
0	gold	1950	ounce
1	silver	24	ounce

Note: Both "oil" and "gold" have the same delm_chunk_id (0) because they came from the same chunk.

Multiple Schema¶

For multiple schemas, each sub-schema is exploded separately, and a schema_name column identifies which schema each row belongs to.

schema = Schema.multiple({
    "commodities": Schema.nested(
        container_name="items",
        variables_list=[
            ExtractionVariable(name="name", data_type="string"),
            ExtractionVariable(name="price", data_type="number")
        ]
    ),
    "companies": Schema.nested(
        container_name="items",
        variables_list=[
            ExtractionVariable(name="name", data_type="string"),
            ExtractionVariable(name="sector", data_type="string")
        ]
    )
})

delm = DELM(schema=schema, provider="openai", model="gpt-4o-mini")
results = delm.extract("data.csv")

# Explode JSON
exploded = explode_json_results(results, schema)

Input JSON (in delm_extracted_data_json):

{
  "commodities": [{"name": "oil", "price": 75}],
  "companies": [{"name": "Exxon", "sector": "energy"}]
}

Output Table:

delm_chunk_id	schema_name	name	price	sector
0	commodities	oil	75	None
0	companies	Exxon	None	energy
1	commodities	gold	1950	None
1	companies	Shell	None	energy

Note: Fields that don't exist in a schema are filled with None (e.g., "sector" is None for commodities rows).

Output Data¶

Output Columns¶

From delm.extract()¶

From delm.get_extraction_results()¶

Transforming Results with explode_json_results()¶

Simple Schema¶

Nested Schema¶

Multiple Schema¶

From `delm.extract()`¶

From `delm.get_extraction_results()`¶

Transforming Results with `explode_json_results()`¶