Chunking Strategies¶
Chunkin uses LangChain text splitters for all chunking strategies. These are battle-tested components that provide high-quality document splitting.
recursive (Default)¶
RecursiveCharacterTextSplitter from LangChain - Recursively splits text using a hierarchy of separators.
Default separators: ["\n\n", "\n", " ", ""]
Best for: General purpose text chunking with good context preservation.
character¶
CharacterTextSplitter from LangChain - Simple character-based splitting.
Best for: Simple use cases, consistent chunk sizes.
markdown¶
MarkdownTextSplitter from LangChain - Splits markdown preserving structure.
Best for: Markdown documents where header hierarchy should be respected.
markdown_headers¶
MarkdownHeaderTextSplitter from LangChain - Splits by markdown headers only.
Best for: When you want to group content by section headers.
html_headers¶
HTMLHeaderTextSplitter from LangChain - Splits HTML by header tags.
Best for: HTML documents, web page content.
semantic¶
SemanticChunker from LangChain Experimental - Uses embeddings to split text based on semantic similarity.
Best for: When you want chunks that maintain semantic coherence, better for RAG retrieval quality.
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
chunker = DocumentChunker(
strategy="semantic",
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95,
)
Semantic Chunking Parameters¶
| Parameter | Default | Description |
|---|---|---|
embeddings |
required | Embedding model (e.g., OpenAIEmbeddings) |
breakpoint_threshold_type |
"percentile" |
Method for determining breakpoints |
breakpoint_threshold_amount |
95 |
Threshold value |
buffer_size |
1 |
Number of sentences to overlap |
min_chunk_size |
0 |
Minimum chunk size in characters |
add_start_index |
False |
Add start index to metadata |
nb_suffix |
1 |
Number suffix for splits |
Breakpoint Threshold Types¶
| Type | Description |
|---|---|
percentile |
Splits at the nth percentile of distances |
standard_deviation |
Splits at n standard deviations above mean |
interquartile |
Splits using interquartile range |
gradient |
Uses gradient of distance array (good for legal/medical texts) |
Choosing a Threshold¶
- Higher threshold (e.g., 95-99): Fewer, larger chunks - less sensitive to meaning shifts
- Lower threshold (e.g., 70-80): More, smaller chunks - more sensitive to meaning shifts
# More sensitive (more chunks)
chunker = DocumentChunker(
strategy="semantic",
embeddings=embeddings,
breakpoint_threshold_amount=80,
)
# Less sensitive (fewer chunks)
chunker = DocumentChunker(
strategy="semantic",
embeddings=embeddings,
breakpoint_threshold_amount=99,
)