Post-Processing¶
Transform and flatten extraction results.
explode_json_results()¶
Flatten nested JSON extraction results into tabular format.
from delm.utils.post_processing import explode_json_results
exploded_df = explode_json_results(
input_df: pd.DataFrame,
schema: Schema,
json_column: str = "delm_extracted_data_json"
) -> pd.DataFrame
Parameters:
- input_df: DataFrame with JSON extraction results
- schema: Schema used for extraction
- json_column: Column containing JSON data
Returns: Exploded DataFrame where each extracted item gets its own row
Behavior by Schema Type¶
Simple Schema¶
Each record becomes one row with columns for each variable.
# Input
delm_file_name | delm_extracted_data_json
"doc1.txt" | {"company": "Apple", "price": 150.0}
# Output
delm_file_name | company | price
"doc1.txt" | "Apple" | 150.0
Nested Schema¶
Each item in the container list becomes its own row.
# Input
delm_file_name | delm_extracted_data_json
"doc1.txt" | {"products": [{"name": "Widget", "price": 10.0}, {"name": "Gadget", "price": 20.0}]}
# Output
delm_file_name | name | price
"doc1.txt" | "Widget" | 10.0
"doc1.txt" | "Gadget" | 20.0
Multiple Schema¶
Each sub-schema is exploded separately with schema_name column.
# Input
delm_file_name | delm_extracted_data_json
"doc1.txt" | {"products": [...], "companies": {...}}
# Output
delm_file_name | schema_name | name | price
"doc1.txt" | "products" | "Widget" | 10.0
"doc1.txt" | "products" | "Gadget" | 20.0
"doc1.txt" | "companies" | "Apple" | None
Missing Fields¶
Missing fields appear as None:
# Input JSON: {"company": "Apple"} (price missing)
# Output row: {"company": "Apple", "price": None}
Example¶
from delm import DELM, Schema, ExtractionVariable
from delm.utils.post_processing import explode_json_results
schema = Schema.nested(
"products",
ExtractionVariable("name", "Product name", "string"),
ExtractionVariable("price", "Product price", "number")
)
delm = DELM(schema=schema, provider="openai", model="gpt-4o-mini")
results = delm.extract("data.csv")
# Flatten nested results
exploded = explode_json_results(results, schema)
# Now each product is a separate row
print(exploded[["delm_file_name", "name", "price"]])
merge_jsons_for_record()¶
Merge multiple JSON extractions for the same record (used internally).
from delm.utils.post_processing import merge_jsons_for_record
merged_json = merge_jsons_for_record(
json_list: List[dict],
schema: ExtractionSchema
) -> dict
Merging rules: - Scalars: Majority vote (ties → first value) - Lists: Concatenate all values
Example:
jsons = [
{"price": 10.0, "tags": ["tech"]},
{"price": 10.0, "tags": ["gadgets"]},
{"price": 15.0, "tags": ["home"]}
]
merged = merge_jsons_for_record(jsons, schema)
# {"price": 10.0, "tags": ["tech", "gadgets", "home"]}
# price: 10.0 wins (2 votes vs 1)
# tags: all concatenated