Splitting Strategies¶
Text chunking strategies for preprocessing.
Base Class¶
from delm.strategies import SplitStrategy
class CustomSplitter(SplitStrategy):
def split(self, text: str) -> List[str]:
# Return list of text chunks
return chunks
def to_dict(self) -> dict:
return {"type": "CustomSplitter", ...}
@classmethod
def from_dict(cls, data: dict) -> "CustomSplitter":
return cls(...)
Required for disk storage: Implement to_dict() and from_dict(), then register:
from delm.strategies import SPLITTER_REGISTRY
SPLITTER_REGISTRY["CustomSplitter"] = CustomSplitter
Built-in Strategies¶
ParagraphSplit¶
Split by double newlines (paragraphs).
from delm import DELM
delm = DELM(
schema=schema,
splitting_strategy={"type": "paragraph"}
)
Output: One chunk per paragraph.
FixedWindowSplit¶
Split into sliding windows of N sentences.
delm = DELM(
schema=schema,
splitting_strategy={
"type": "fixed_window",
"window": 5, # 5 sentences per chunk
"stride": 3 # Move 3 sentences (overlap = window - stride)
}
)
Parameters:
- window (int): Number of sentences per chunk
- stride (int, optional): Step size (default = window, no overlap)
Example:
Text: S1. S2. S3. S4. S5. S6. S7.
window=3, stride=2:
Chunk 1: S1. S2. S3.
Chunk 2: S3. S4. S5.
Chunk 3: S5. S6. S7.
RegexSplit¶
Split by custom regex pattern.
delm = DELM(
schema=schema,
splitting_strategy={
"type": "regex",
"pattern": r"\n\s*---\s*\n" # Split by "---" separator
}
)
Parameters:
- pattern (str): Regex pattern to split on
Class-based Definition¶
from delm.strategies import ParagraphSplit, FixedWindowSplit
splitter = ParagraphSplit()
# Or
splitter = FixedWindowSplit(window=5, stride=3)
delm = DELM(
schema=schema,
splitting_strategy=splitter
)
Registry¶
Access all available splitters:
from delm.strategies import SPLITTER_REGISTRY
print(SPLITTER_REGISTRY.keys())
# dict_keys(['paragraph', 'fixed_window', 'regex', ...])