Schema Reference¶

Schemas define the structured outputs that DELM extracts from your documents. The schema system supports progressive complexity levels, from simple key‑value extraction to complex nested structures.

Table of Contents¶

Schema Types
Simple Schema (Level 1)
Nested Schema (Level 2)
Multiple Schemas (Level 3)
Variable Configuration
Prompt Customization
Schema Examples

Schema Types¶

DELM supports three levels of schema complexity, each building on the previous level.

Simple Schema (Level 1)¶

The simplest form of extraction: individual key‑value pairs.

variables:
  - name: "company_names"
    description: "Company names mentioned in the text"
    data_type: "[string]"
    required: false

  - name: "revenue_numbers"
    description: "Revenue figures mentioned"
    data_type: "[number]"
    required: false

  - name: "forecast_year"
    description: "Year for which forecast is made"
    data_type: "integer"
    required: true
    validate_in_text: true

Output Format:

{
  "company_names": ["Apple", "Microsoft"],
  "revenue_numbers": [1500000000, 2000000000],
  "forecast_year": 2024
}

Nested Schema (Level 2)¶

Extract structured objects with multiple related fields.

schema_type: "nested"
container_name: "companies"
variables:
  - name: "name"
    description: "Company name"
    data_type: "string"
    required: true

  - name: "revenue"
    description: "Revenue figure in USD"
    data_type: "number"
    required: false

  - name: "sector"
    description: "Business sector"
    data_type: "string"
    required: false
    allowed_values: ["technology", "finance", "healthcare", "energy", "retail"]

  - name: "growth_rate"
    description: "Annual growth rate percentage"
    data_type: "number"
    required: false
    validate_in_text: true  # Only extract if explicitly mentioned

  - name: "products"
    description: "List of products offered by the company"
    data_type: "[string]"
    required: false

Output Format:

{
  "companies": [
    {
      "name": "Apple",
      "revenue": 1500000000,
      "sector": "technology",
      "growth_rate": 12.5,
      "products": ["iPhone", "MacBook", "iPad"]
    },
    {
      "name": "Microsoft",
      "revenue": 2000000000,
      "sector": "technology",
      "growth_rate": null,
      "products": ["Windows", "Office", "Azure"]
    }
  ]
}

Multiple Schemas (Level 3)¶

Extract multiple independent structured objects simultaneously. These can be simple, nested, or even deep multi‑schemas.

schema_type: "multiple"

# Companies schema
companies:
  schema_type: "nested"
  container_name: "companies"
  variables:
    - name: "name"
      description: "Company name"
      data_type: "string"
      required: true
    - name: "revenue"
      description: "Revenue figure"
      data_type: "number"
      required: false

# Products schema
products:
  schema_type: "nested"
  container_name: "products"
  variables:
    - name: "name"
      description: "Product name"
      data_type: "string"
      required: true
    - name: "price"
      description: "Product price in USD"
      data_type: "number"
      required: false
    - name: "category"
      description: "Product category"
      data_type: "string"
      allowed_values: ["software", "hardware", "service", "consulting"]
      required: false

# Market trends schema
market_trends:
  schema_type: "nested"
  container_name: "trends"
  variables:
    - name: "trend_name"
      description: "Market trend description"
      data_type: "string"
      required: true
    - name: "impact"
      description: "Expected impact (positive/negative/neutral)"
      data_type: "string"
      allowed_values: ["positive", "negative", "neutral"]
      required: false

Output Format:

{
  "companies": [
    { "name": "Apple", "revenue": 1500000000 }
  ],
  "products": [
    { "name": "iPhone 15", "price": 999, "category": "hardware" }
  ],
  "trends": [
    { "trend_name": "AI adoption acceleration", "impact": "positive" }
  ]
}

Variable Configuration¶

Each variable in your schema can be configured with these options.

Required Fields¶

Field	Type	Required	Description
`name`	string	Yes	Variable name (used as JSON key)
`description`	string	Yes	Human‑readable description for LLM
`data_type`	string	Yes	Data type (see supported types below)

Optional Fields¶

Field	Type	Default	Description
`required`	boolean	false	Whether field must be present
`allowed_values`	array	null	List of valid values
`validate_in_text`	boolean	false	Only extract if explicitly mentioned

Supported Data Types¶

Type	Description	Example Values
`string`	Text values	"Apple", "technology"
`number`	Floating‑point numbers	1500000000, 12.5
`integer`	Whole numbers	2024, 100
`boolean`	True/false values	true, false
`date`	Date strings	"2025-09-15"
`[string]`	List of strings	["Apple", "Google"]
`[number]`	List of numbers	[12.5, 42, 100]
`[integer]`	List of integers	[2024, 100, 7]
`[boolean]`	List of booleans	[true, false, true]

Note: List types must be surrounded by quotes in .yaml files. For example "[string]", not [string].

Schema spec files are YAML (.yml/.yaml).

Prompt Customization¶

DELM renders the prompt using two configurable strings from your pipeline config:

schema.system_prompt: Injected as the system role message
schema.prompt_template: A Python str.format‑style template rendered per chunk, with placeholders:
{variables}: A human‑readable list of variables with types and allowed values
{text}: The current text chunk
{context}: Optional extra key‑values (if provided by advanced flows)

Examples:

System: {schema.system_prompt}
User: {schema.prompt_template.format(variables=..., text=..., context=...)}

Notes: - For Multiple schemas, the prompt is built by concatenating sub‑schema prompts under headings. - Token estimation uses these same prompts, so edits affect cost estimates.

Variable Examples¶

# Simple string field
- name: "company_name"
  description: "Name of the company"
  data_type: "string"
  required: true

# Number with validation
- name: "revenue"
  description: "Revenue in USD"
  data_type: "number"
  required: false
  validate_in_text: true

# String field with allowed values (essentially an enum)
- name: "sector"
  description: "Business sector"
  data_type: "string"
  allowed_values: ["technology", "finance", "healthcare"]
  required: false

# Boolean field
- name: "is_public"
  description: "Whether company is publicly traded"
  data_type: "boolean"
  required: false

# List of numbers with allowed values
- name: "quarterly_growth_rates"
  description: "Quarterly revenue growth rates in percent"
  data_type: "[number]"
  allowed_values: [0, 5, 10, 15, 20, 25, 30]
  required: false

Validation Features¶

Text Validation¶

- name: "commodity_type"
  description: "Type of commodity mentioned"
  data_type: "string"
  validate_in_text: true  # Only extract if explicitly mentioned in text

Allowed Values¶

- name: "sentiment"
  description: "Overall sentiment"
  data_type: "string"
  allowed_values: ["positive", "negative", "neutral"]

Cleaning & Validation Semantics¶

Required fields: If a required field has no valid value, the item is dropped.
Simple schema: the whole response for a chunk is discarded.
Nested schema: the specific object is discarded; the chunk may still yield other objects.
Null‑like strings in string fields (e.g., "none", "null", "unknown", "n/a", "") are filtered unless explicitly listed in allowed_values.
validate_in_text: true keeps only string values that literally appear in the source text (case‑insensitive).
For Multiple schemas, nested sub‑schemas are unwrapped in outputs (e.g., books: [...], not books: {books: [...]}).
For Nested schemas, if container_name is omitted, it defaults to "instances".

Schema Examples¶

Financial Report Analysis¶

schema_type: "nested"
container_name: "financial_metrics"
variables:
  - name: "metric_name"
    description: "Name of the financial metric"
    data_type: "string"
    required: true
  - name: "value"
    description: "Numeric value of the metric"
    data_type: "number"
    required: true
  - name: "currency"
    description: "Currency of the value"
    data_type: "string"
    allowed_values: ["USD", "EUR", "GBP"]
    required: false
  - name: "period"
    description: "Time period for the metric"
    data_type: "string"
    required: false

Commodity Price Extraction¶

variables:
  - name: "commodity_type"
    description: "Type of commodity mentioned"
    data_type: "string"
    allowed_values: ["oil", "gas", "gold", "silver", "copper"]
    validate_in_text: true
  - name: "price_value"
    description: "Price value mentioned"
    data_type: "number"
    required: false
  - name: "price_mention"
    description: "Whether a price is mentioned"
    data_type: "boolean"
    required: false
  - name: "forecast_period"
    description: "Time period for price forecast"
    data_type: "string"
    required: false

Customer Feedback Analysis¶

schema_type: "multiple"

sentiment:
  schema_type: "nested"
  container_name: "sentiments"
  variables:
    - name: "aspect"
      description: "Product/service aspect mentioned"
      data_type: "string"
      required: true
    - name: "sentiment"
      description: "Sentiment toward the aspect"
      data_type: "string"
      allowed_values: ["positive", "negative", "neutral"]
      required: true
    - name: "intensity"
      description: "Intensity of the sentiment"
      data_type: "string"
      allowed_values: ["low", "medium", "high"]
      required: false

suggestions:
  schema_type: "nested"
  container_name: "suggestions"
  variables:
    - name: "suggestion"
      description: "Improvement suggestion"
      data_type: "string"
      required: true
    - name: "category"
      description: "Category of suggestion"
      data_type: "string"
      allowed_values: ["feature", "bug", "ui", "performance"]
      required: false

For more help, see the main README.md or open an issue on GitHub.