Chunking Strategies¶

Chunkin uses LangChain text splitters for all chunking strategies. These are battle-tested components that provide high-quality document splitting.

recursive (Default)¶

RecursiveCharacterTextSplitter from LangChain - Recursively splits text using a hierarchy of separators.

Default separators: ["\n\n", "\n", " ", ""]

Best for: General purpose text chunking with good context preservation.

chunker = DocumentChunker(strategy="recursive")

character¶

CharacterTextSplitter from LangChain - Simple character-based splitting.

Best for: Simple use cases, consistent chunk sizes.

chunker = DocumentChunker(strategy="character", chunk_size=500)

markdown¶

MarkdownTextSplitter from LangChain - Splits markdown preserving structure.

Best for: Markdown documents where header hierarchy should be respected.

chunker = DocumentChunker(strategy="markdown", chunk_size=800)

markdown_headers¶

MarkdownHeaderTextSplitter from LangChain - Splits by markdown headers only.

Best for: When you want to group content by section headers.

chunker = DocumentChunker(strategy="markdown_headers")

html_headers¶

HTMLHeaderTextSplitter from LangChain - Splits HTML by header tags.

Best for: HTML documents, web page content.

chunker = DocumentChunker(strategy="html_headers")

semantic¶

SemanticChunker from LangChain Experimental - Uses embeddings to split text based on semantic similarity.

Best for: When you want chunks that maintain semantic coherence, better for RAG retrieval quality.

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
chunker = DocumentChunker(
    strategy="semantic",
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)

Semantic Chunking Parameters¶

Parameter	Default	Description
`embeddings`	required	Embedding model (e.g., OpenAIEmbeddings)
`breakpoint_threshold_type`	`"percentile"`	Method for determining breakpoints
`breakpoint_threshold_amount`	`95`	Threshold value
`buffer_size`	`1`	Number of sentences to overlap
`min_chunk_size`	`0`	Minimum chunk size in characters
`add_start_index`	`False`	Add start index to metadata
`nb_suffix`	`1`	Number suffix for splits

Breakpoint Threshold Types¶

Type	Description
`percentile`	Splits at the nth percentile of distances
`standard_deviation`	Splits at n standard deviations above mean
`interquartile`	Splits using interquartile range
`gradient`	Uses gradient of distance array (good for legal/medical texts)

Choosing a Threshold¶

Higher threshold (e.g., 95-99): Fewer, larger chunks - less sensitive to meaning shifts
Lower threshold (e.g., 70-80): More, smaller chunks - more sensitive to meaning shifts

# More sensitive (more chunks)
chunker = DocumentChunker(
    strategy="semantic",
    embeddings=embeddings,
    breakpoint_threshold_amount=80,
)

# Less sensitive (fewer chunks)
chunker = DocumentChunker(
    strategy="semantic",
    embeddings=embeddings,
    breakpoint_threshold_amount=99,
)