Schema Reference¶
Schemas define the structured outputs that DELM extracts from your documents. The schema system supports progressive complexity levels, from simple key‑value extraction to complex nested structures.
Table of Contents¶
- Schema Types
- Simple Schema (Level 1)
- Nested Schema (Level 2)
- Multiple Schemas (Level 3)
- Variable Configuration
- Prompt Customization
- Schema Examples
Schema Types¶
DELM supports three levels of schema complexity, each building on the previous level.
Simple Schema (Level 1)¶
The simplest form of extraction: individual key‑value pairs.
variables:
- name: "company_names"
description: "Company names mentioned in the text"
data_type: "[string]"
required: false
- name: "revenue_numbers"
description: "Revenue figures mentioned"
data_type: "[number]"
required: false
- name: "forecast_year"
description: "Year for which forecast is made"
data_type: "integer"
required: true
validate_in_text: true
Output Format:
{
"company_names": ["Apple", "Microsoft"],
"revenue_numbers": [1500000000, 2000000000],
"forecast_year": 2024
}
Nested Schema (Level 2)¶
Extract structured objects with multiple related fields.
schema_type: "nested"
container_name: "companies"
variables:
- name: "name"
description: "Company name"
data_type: "string"
required: true
- name: "revenue"
description: "Revenue figure in USD"
data_type: "number"
required: false
- name: "sector"
description: "Business sector"
data_type: "string"
required: false
allowed_values: ["technology", "finance", "healthcare", "energy", "retail"]
- name: "growth_rate"
description: "Annual growth rate percentage"
data_type: "number"
required: false
validate_in_text: true # Only extract if explicitly mentioned
- name: "products"
description: "List of products offered by the company"
data_type: "[string]"
required: false
Output Format:
{
"companies": [
{
"name": "Apple",
"revenue": 1500000000,
"sector": "technology",
"growth_rate": 12.5,
"products": ["iPhone", "MacBook", "iPad"]
},
{
"name": "Microsoft",
"revenue": 2000000000,
"sector": "technology",
"growth_rate": null,
"products": ["Windows", "Office", "Azure"]
}
]
}
Multiple Schemas (Level 3)¶
Extract multiple independent structured objects simultaneously. These can be simple, nested, or even deep multi‑schemas.
schema_type: "multiple"
# Companies schema
companies:
schema_type: "nested"
container_name: "companies"
variables:
- name: "name"
description: "Company name"
data_type: "string"
required: true
- name: "revenue"
description: "Revenue figure"
data_type: "number"
required: false
# Products schema
products:
schema_type: "nested"
container_name: "products"
variables:
- name: "name"
description: "Product name"
data_type: "string"
required: true
- name: "price"
description: "Product price in USD"
data_type: "number"
required: false
- name: "category"
description: "Product category"
data_type: "string"
allowed_values: ["software", "hardware", "service", "consulting"]
required: false
# Market trends schema
market_trends:
schema_type: "nested"
container_name: "trends"
variables:
- name: "trend_name"
description: "Market trend description"
data_type: "string"
required: true
- name: "impact"
description: "Expected impact (positive/negative/neutral)"
data_type: "string"
allowed_values: ["positive", "negative", "neutral"]
required: false
Output Format:
{
"companies": [
{ "name": "Apple", "revenue": 1500000000 }
],
"products": [
{ "name": "iPhone 15", "price": 999, "category": "hardware" }
],
"trends": [
{ "trend_name": "AI adoption acceleration", "impact": "positive" }
]
}
Variable Configuration¶
Each variable in your schema can be configured with these options.
Required Fields¶
Field | Type | Required | Description |
---|---|---|---|
name |
string | Yes | Variable name (used as JSON key) |
description |
string | Yes | Human‑readable description for LLM |
data_type |
string | Yes | Data type (see supported types below) |
Optional Fields¶
Field | Type | Default | Description |
---|---|---|---|
required |
boolean | false | Whether field must be present |
allowed_values |
array | null | List of valid values |
validate_in_text |
boolean | false | Only extract if explicitly mentioned |
Supported Data Types¶
Type | Description | Example Values |
---|---|---|
string |
Text values | "Apple", "technology" |
number |
Floating‑point numbers | 1500000000, 12.5 |
integer |
Whole numbers | 2024, 100 |
boolean |
True/false values | true, false |
date |
Date strings | "2025-09-15" |
[string] |
List of strings | ["Apple", "Google"] |
[number] |
List of numbers | [12.5, 42, 100] |
[integer] |
List of integers | [2024, 100, 7] |
[boolean] |
List of booleans | [true, false, true] |
Note: List types must be surrounded by quotes in .yaml
files. For example "[string]"
, not [string]
.
Schema spec files are YAML (.yml
/.yaml
).
Prompt Customization¶
DELM renders the prompt using two configurable strings from your pipeline config:
schema.system_prompt
: Injected as the system role messageschema.prompt_template
: A Pythonstr.format
‑style template rendered per chunk, with placeholders:{variables}
: A human‑readable list of variables with types and allowed values{text}
: The current text chunk{context}
: Optional extra key‑values (if provided by advanced flows)
Examples:
System: {schema.system_prompt}
User: {schema.prompt_template.format(variables=..., text=..., context=...)}
Notes: - For Multiple schemas, the prompt is built by concatenating sub‑schema prompts under headings. - Token estimation uses these same prompts, so edits affect cost estimates.
Variable Examples¶
# Simple string field
- name: "company_name"
description: "Name of the company"
data_type: "string"
required: true
# Number with validation
- name: "revenue"
description: "Revenue in USD"
data_type: "number"
required: false
validate_in_text: true
# String field with allowed values (essentially an enum)
- name: "sector"
description: "Business sector"
data_type: "string"
allowed_values: ["technology", "finance", "healthcare"]
required: false
# Boolean field
- name: "is_public"
description: "Whether company is publicly traded"
data_type: "boolean"
required: false
# List of numbers with allowed values
- name: "quarterly_growth_rates"
description: "Quarterly revenue growth rates in percent"
data_type: "[number]"
allowed_values: [0, 5, 10, 15, 20, 25, 30]
required: false
Validation Features¶
Text Validation¶
- name: "commodity_type"
description: "Type of commodity mentioned"
data_type: "string"
validate_in_text: true # Only extract if explicitly mentioned in text
Allowed Values¶
- name: "sentiment"
description: "Overall sentiment"
data_type: "string"
allowed_values: ["positive", "negative", "neutral"]
Cleaning & Validation Semantics¶
- Required fields: If a required field has no valid value, the item is dropped.
- Simple schema: the whole response for a chunk is discarded.
- Nested schema: the specific object is discarded; the chunk may still yield other objects.
- Null‑like strings in string fields (e.g., "none", "null", "unknown", "n/a", "") are filtered unless explicitly listed in
allowed_values
. validate_in_text: true
keeps only string values that literally appear in the source text (case‑insensitive).- For Multiple schemas, nested sub‑schemas are unwrapped in outputs (e.g.,
books: [...]
, notbooks: {books: [...]}
). - For Nested schemas, if
container_name
is omitted, it defaults to "instances".
Schema Examples¶
Financial Report Analysis¶
schema_type: "nested"
container_name: "financial_metrics"
variables:
- name: "metric_name"
description: "Name of the financial metric"
data_type: "string"
required: true
- name: "value"
description: "Numeric value of the metric"
data_type: "number"
required: true
- name: "currency"
description: "Currency of the value"
data_type: "string"
allowed_values: ["USD", "EUR", "GBP"]
required: false
- name: "period"
description: "Time period for the metric"
data_type: "string"
required: false
Commodity Price Extraction¶
variables:
- name: "commodity_type"
description: "Type of commodity mentioned"
data_type: "string"
allowed_values: ["oil", "gas", "gold", "silver", "copper"]
validate_in_text: true
- name: "price_value"
description: "Price value mentioned"
data_type: "number"
required: false
- name: "price_mention"
description: "Whether a price is mentioned"
data_type: "boolean"
required: false
- name: "forecast_period"
description: "Time period for price forecast"
data_type: "string"
required: false
Customer Feedback Analysis¶
schema_type: "multiple"
sentiment:
schema_type: "nested"
container_name: "sentiments"
variables:
- name: "aspect"
description: "Product/service aspect mentioned"
data_type: "string"
required: true
- name: "sentiment"
description: "Sentiment toward the aspect"
data_type: "string"
allowed_values: ["positive", "negative", "neutral"]
required: true
- name: "intensity"
description: "Intensity of the sentiment"
data_type: "string"
allowed_values: ["low", "medium", "high"]
required: false
suggestions:
schema_type: "nested"
container_name: "suggestions"
variables:
- name: "suggestion"
description: "Improvement suggestion"
data_type: "string"
required: true
- name: "category"
description: "Category of suggestion"
data_type: "string"
allowed_values: ["feature", "bug", "ui", "performance"]
required: false
For more help, see the main README.md
or open an issue on GitHub.