Schema Reference

Schemas define the structured outputs that DELM extracts from your documents. The schema system supports progressive complexity levels, from simple key‑value extraction to complex nested structures.

Table of Contents

Schema Types

DELM supports three levels of schema complexity, each building on the previous level.

Simple Schema (Level 1)

The simplest form of extraction: individual key‑value pairs.

variables:
  - name: "company_names"
    description: "Company names mentioned in the text"
    data_type: "[string]"
    required: false

  - name: "revenue_numbers"
    description: "Revenue figures mentioned"
    data_type: "[number]"
    required: false

  - name: "forecast_year"
    description: "Year for which forecast is made"
    data_type: "integer"
    required: true
    validate_in_text: true

Output Format:

{
  "company_names": ["Apple", "Microsoft"],
  "revenue_numbers": [1500000000, 2000000000],
  "forecast_year": 2024
}

Nested Schema (Level 2)

Extract structured objects with multiple related fields.

schema_type: "nested"
container_name: "companies"
variables:
  - name: "name"
    description: "Company name"
    data_type: "string"
    required: true

  - name: "revenue"
    description: "Revenue figure in USD"
    data_type: "number"
    required: false

  - name: "sector"
    description: "Business sector"
    data_type: "string"
    required: false
    allowed_values: ["technology", "finance", "healthcare", "energy", "retail"]

  - name: "growth_rate"
    description: "Annual growth rate percentage"
    data_type: "number"
    required: false
    validate_in_text: true  # Only extract if explicitly mentioned

  - name: "products"
    description: "List of products offered by the company"
    data_type: "[string]"
    required: false

Output Format:

{
  "companies": [
    {
      "name": "Apple",
      "revenue": 1500000000,
      "sector": "technology",
      "growth_rate": 12.5,
      "products": ["iPhone", "MacBook", "iPad"]
    },
    {
      "name": "Microsoft",
      "revenue": 2000000000,
      "sector": "technology",
      "growth_rate": null,
      "products": ["Windows", "Office", "Azure"]
    }
  ]
}

Multiple Schemas (Level 3)

Extract multiple independent structured objects simultaneously. These can be simple, nested, or even deep multi‑schemas.

schema_type: "multiple"

# Companies schema
companies:
  schema_type: "nested"
  container_name: "companies"
  variables:
    - name: "name"
      description: "Company name"
      data_type: "string"
      required: true
    - name: "revenue"
      description: "Revenue figure"
      data_type: "number"
      required: false

# Products schema
products:
  schema_type: "nested"
  container_name: "products"
  variables:
    - name: "name"
      description: "Product name"
      data_type: "string"
      required: true
    - name: "price"
      description: "Product price in USD"
      data_type: "number"
      required: false
    - name: "category"
      description: "Product category"
      data_type: "string"
      allowed_values: ["software", "hardware", "service", "consulting"]
      required: false

# Market trends schema
market_trends:
  schema_type: "nested"
  container_name: "trends"
  variables:
    - name: "trend_name"
      description: "Market trend description"
      data_type: "string"
      required: true
    - name: "impact"
      description: "Expected impact (positive/negative/neutral)"
      data_type: "string"
      allowed_values: ["positive", "negative", "neutral"]
      required: false

Output Format:

{
  "companies": [
    { "name": "Apple", "revenue": 1500000000 }
  ],
  "products": [
    { "name": "iPhone 15", "price": 999, "category": "hardware" }
  ],
  "trends": [
    { "trend_name": "AI adoption acceleration", "impact": "positive" }
  ]
}

Variable Configuration

Each variable in your schema can be configured with these options.

Required Fields

Field Type Required Description
name string Yes Variable name (used as JSON key)
description string Yes Human‑readable description for LLM
data_type string Yes Data type (see supported types below)

Optional Fields

Field Type Default Description
required boolean false Whether field must be present
allowed_values array null List of valid values
validate_in_text boolean false Only extract if explicitly mentioned

Supported Data Types

Type Description Example Values
string Text values "Apple", "technology"
number Floating‑point numbers 1500000000, 12.5
integer Whole numbers 2024, 100
boolean True/false values true, false
date Date strings "2025-09-15"
[string] List of strings ["Apple", "Google"]
[number] List of numbers [12.5, 42, 100]
[integer] List of integers [2024, 100, 7]
[boolean] List of booleans [true, false, true]

Note: List types must be surrounded by quotes in .yaml files. For example "[string]", not [string].

Schema spec files are YAML (.yml/.yaml).

Prompt Customization

DELM renders the prompt using two configurable strings from your pipeline config:

  • schema.system_prompt: Injected as the system role message
  • schema.prompt_template: A Python str.format‑style template rendered per chunk, with placeholders:
  • {variables}: A human‑readable list of variables with types and allowed values
  • {text}: The current text chunk
  • {context}: Optional extra key‑values (if provided by advanced flows)

Examples:

System: {schema.system_prompt}
User: {schema.prompt_template.format(variables=..., text=..., context=...)}

Notes: - For Multiple schemas, the prompt is built by concatenating sub‑schema prompts under headings. - Token estimation uses these same prompts, so edits affect cost estimates.

Variable Examples

# Simple string field
- name: "company_name"
  description: "Name of the company"
  data_type: "string"
  required: true

# Number with validation
- name: "revenue"
  description: "Revenue in USD"
  data_type: "number"
  required: false
  validate_in_text: true

# String field with allowed values (essentially an enum)
- name: "sector"
  description: "Business sector"
  data_type: "string"
  allowed_values: ["technology", "finance", "healthcare"]
  required: false

# Boolean field
- name: "is_public"
  description: "Whether company is publicly traded"
  data_type: "boolean"
  required: false

# List of numbers with allowed values
- name: "quarterly_growth_rates"
  description: "Quarterly revenue growth rates in percent"
  data_type: "[number]"
  allowed_values: [0, 5, 10, 15, 20, 25, 30]
  required: false

Validation Features

Text Validation

- name: "commodity_type"
  description: "Type of commodity mentioned"
  data_type: "string"
  validate_in_text: true  # Only extract if explicitly mentioned in text

Allowed Values

- name: "sentiment"
  description: "Overall sentiment"
  data_type: "string"
  allowed_values: ["positive", "negative", "neutral"]

Cleaning & Validation Semantics

  • Required fields: If a required field has no valid value, the item is dropped.
  • Simple schema: the whole response for a chunk is discarded.
  • Nested schema: the specific object is discarded; the chunk may still yield other objects.
  • Null‑like strings in string fields (e.g., "none", "null", "unknown", "n/a", "") are filtered unless explicitly listed in allowed_values.
  • validate_in_text: true keeps only string values that literally appear in the source text (case‑insensitive).
  • For Multiple schemas, nested sub‑schemas are unwrapped in outputs (e.g., books: [...], not books: {books: [...]}).
  • For Nested schemas, if container_name is omitted, it defaults to "instances".

Schema Examples

Financial Report Analysis

schema_type: "nested"
container_name: "financial_metrics"
variables:
  - name: "metric_name"
    description: "Name of the financial metric"
    data_type: "string"
    required: true
  - name: "value"
    description: "Numeric value of the metric"
    data_type: "number"
    required: true
  - name: "currency"
    description: "Currency of the value"
    data_type: "string"
    allowed_values: ["USD", "EUR", "GBP"]
    required: false
  - name: "period"
    description: "Time period for the metric"
    data_type: "string"
    required: false

Commodity Price Extraction

variables:
  - name: "commodity_type"
    description: "Type of commodity mentioned"
    data_type: "string"
    allowed_values: ["oil", "gas", "gold", "silver", "copper"]
    validate_in_text: true
  - name: "price_value"
    description: "Price value mentioned"
    data_type: "number"
    required: false
  - name: "price_mention"
    description: "Whether a price is mentioned"
    data_type: "boolean"
    required: false
  - name: "forecast_period"
    description: "Time period for price forecast"
    data_type: "string"
    required: false

Customer Feedback Analysis

schema_type: "multiple"

sentiment:
  schema_type: "nested"
  container_name: "sentiments"
  variables:
    - name: "aspect"
      description: "Product/service aspect mentioned"
      data_type: "string"
      required: true
    - name: "sentiment"
      description: "Sentiment toward the aspect"
      data_type: "string"
      allowed_values: ["positive", "negative", "neutral"]
      required: true
    - name: "intensity"
      description: "Intensity of the sentiment"
      data_type: "string"
      allowed_values: ["low", "medium", "high"]
      required: false

suggestions:
  schema_type: "nested"
  container_name: "suggestions"
  variables:
    - name: "suggestion"
      description: "Improvement suggestion"
      data_type: "string"
      required: true
    - name: "category"
      description: "Category of suggestion"
      data_type: "string"
      allowed_values: ["feature", "bug", "ui", "performance"]
      required: false

For more help, see the main README.md or open an issue on GitHub.