Schema Reference¶

Schemas define the structured outputs that DELM extracts from your documents. The schema system supports progressive complexity levels, from simple key‑value extraction to complex nested structures.

Table of Contents¶

Simple Schema (Level 1)
Nested Schema (Level 2)
Multiple Schemas (Level 3)
Variable Configuration

Imports¶

All schema classes are available directly from the main package:

from delm import Schema, ExtractionVariable

Schema Types¶

DELM supports three levels of schema complexity, each building on the previous level.

Simple Schema (Level 1)¶

The simplest form of extraction: individual key‑value pairs found once per chunk.

schema = Schema.simple(
    ExtractionVariable(
        name="company_names",
        description="Company names mentioned in the text",
        data_type="[string]",
        required=False
    ),
    ExtractionVariable(
        name="revenue_numbers",
        description="Revenue figures mentioned",
        data_type="[number]",
        required=False
    ),
    ExtractionVariable(
        name="forecast_year",
        description="Year for which forecast is made",
        data_type="integer",
        required=True,
        validate_in_text=True
    )
)

Output Format:

{
  "company_names": ["Apple", "Microsoft"],
  "revenue_numbers": [1500000000, 2000000000],
  "forecast_year": 2024
}

Nested Schema (Level 2)¶

Extract structured objects with multiple related fields (a list of dictionaries).

schema = Schema.nested(
    container_name="companies",
    variables_list=[
        ExtractionVariable(
            name="name",
            description="Company name",
            data_type="string",
            required=True
        ),
        ExtractionVariable(
            name="revenue",
            description="Revenue figure in USD",
            data_type="number",
            required=False
        ),
        ExtractionVariable(
            name="sector",
            description="Business sector",
            data_type="string",
            required=False,
            allowed_values=["technology", "finance", "healthcare", "energy", "retail"]
        ),
        ExtractionVariable(
            name="growth_rate",
            description="Annual growth rate percentage",
            data_type="number",
            required=False,
            validate_in_text=True  # Only extract if explicitly mentioned
        ),
        ExtractionVariable(
            name="products",
            description="List of products offered by the company",
            data_type="[string]",
            required=False
        )
    ]
)

Output Format:

{
  "companies": [
    {
      "name": "Apple",
      "revenue": 1500000000,
      "sector": "technology",
      "growth_rate": 12.5,
      "products": ["iPhone", "MacBook", "iPad"]
    },
    {
      "name": "Microsoft",
      "revenue": 2000000000,
      "sector": "technology",
      "growth_rate": null,
      "products": ["Windows", "Office", "Azure"]
    }
  ]
}

Multiple Schemas (Level 3)¶

Extract multiple independent structured objects simultaneously. These can be simple, nested, or even deep multi‑schemas.

# Define sub-schemas first
companies_schema = Schema.nested(
    container_name="companies",
    variables_list=[
        ExtractionVariable(name="name", description="Company name", data_type="string", required=True),
        ExtractionVariable(name="revenue", description="Revenue figure", data_type="number", required=False)
    ]
)

products_schema = Schema.nested(
    container_name="products",
    variables_list=[
        ExtractionVariable(name="name", description="Product name", data_type="string", required=True),
        ExtractionVariable(name="price", description="Product price in USD", data_type="number", required=False),
        ExtractionVariable(
            name="category", 
            description="Product category", 
            data_type="string", 
            allowed_values=["software", "hardware", "service", "consulting"]
        )
    ]
)

trends_schema = Schema.nested(
    container_name="trends",
    variables_list=[
        ExtractionVariable(name="trend_name", description="Market trend description", data_type="string", required=True),
        ExtractionVariable(
            name="impact", 
            description="Expected impact", 
            data_type="string", 
            allowed_values=["positive", "negative", "neutral"]
        )
    ]
)

# Combine into multiple schema
schema = Schema.multiple(
    companies=companies_schema,
    products=products_schema,
    market_trends=trends_schema
)

Output Format:

{
  "companies": [
    { "name": "Apple", "revenue": 1500000000 }
  ],
  "products": [
    { "name": "iPhone 15", "price": 999, "category": "hardware" }
  ],
  "trends": [
    { "trend_name": "AI adoption acceleration", "impact": "positive" }
  ]
}

Variable Configuration¶

Each ExtractionVariable can be configured with these arguments.

Required Arguments¶

Argument	Type	Description
`name`	string	Variable name (used as JSON key)
`description`	string	Human‑readable description for LLM
`data_type`	string	Data type (see supported types below)

Optional Arguments¶

Argument	Type	Default	Description
`required`	boolean	`False`	Whether field must be present
`allowed_values`	list	`None`	List of valid string values (enums)
`validate_in_text`	boolean	`False`	Only extract if value literally appears in text

Supported Data Types¶

Type String	Description	Example Values
`"string"`	Text values	"Apple", "technology"
`"number"`	Floating‑point numbers	1500000000, 12.5
`"integer"`	Whole numbers	2024, 100
`"boolean"`	True/false values	`True`, `False`
`"date"`	Date strings	"2025-09-15"
`"[string]"`	List of strings	["Apple", "Google"]
`"[number]"`	List of numbers	[12.5, 42, 100]
`"[integer]"`	List of integers	[2024, 100, 7]
`"[boolean]"`	List of booleans	[True, False, True]