Usage¶
Basic Usage¶
from chunkin import DocumentChunker
chunker = DocumentChunker()
chunks = chunker.create_chunks("path/to/document.pdf")
print(f"Created {len(chunks)} chunks")
for chunk in chunks:
print(chunk.page_content[:100])
Custom Chunk Size¶
chunker = DocumentChunker(chunk_size=500, chunk_overlap=50)
chunks = chunker.create_chunks("document.pdf")
Choosing a Strategy¶
Chunkin uses LangChain text splitters for all chunking strategies:
# Recursive (default) - uses RecursiveCharacterTextSplitter
chunker = DocumentChunker(strategy="recursive")
# Character-based - uses CharacterTextSplitter
chunker = DocumentChunker(strategy="character", chunk_size=300)
# Markdown-aware - uses MarkdownTextSplitter
chunker = DocumentChunker(strategy="markdown", chunk_size=800)
# Semantic chunking - uses SemanticChunker (LangChain experimental)
from langchain_openai import OpenAIEmbeddings
chunker = DocumentChunker(
strategy="semantic",
embeddings=OpenAIEmbeddings(),
)
See Chunking Strategies for details on each strategy.
Output Directory & Saving¶
By default, chunks are saved as JSON files. Set output_dir to specify a directory:
chunker = DocumentChunker(output_dir="chunks")
chunks = chunker.create_chunks("document.pdf")
# Saves to: chunks/document_chunks.json
For single files without output_dir, saves to working directory:
Retrieving Chunks¶
After processing, access chunks from the internal store:
# Get all chunks
all_chunks = chunker.get_chunks()
# Get chunks for specific file
pdf_chunks = chunker.get_chunks("document.pdf")
# List all processed files and their chunk counts
chunk_summary = chunker.list_chunks()
# {'document.pdf': 15, 'report.docx': 8}
Batch Processing¶
Process all documents in a directory:
chunker = DocumentChunker(output_dir="chunks")
# Get all chunks as a dictionary
all_chunks = chunker.batch_chunks("path/to/documents")
for file_path, chunks in all_chunks.items():
print(f"{file_path}: {len(chunks)} chunks")
# Saves to: chunks/document_chunks.json, chunks/report_chunks.json, etc.
Filter by specific extensions:
Recursive processing (including subdirectories):
all_chunks = chunker.batch_chunks(
"path/to/documents",
extensions=[".pdf", ".docx", ".txt"],
recursive=True,
)
Stream processing for large directories:
chunker = DocumentChunker(output_dir="chunks")
for file_path, chunks in chunker.batch_chunks_stream("path/to/documents"):
print(f"Processing: {file_path}")
# Process chunks here
Chunk Metadata¶
Each chunk includes metadata:
{
"source": "path/to/document.pdf", # from LangChain loader
"page": 0, # from loader (PDF only)
"chunk_index": 0, # added by DocumentChunker
"source_file": "document.pdf", # added by DocumentChunker
"chunking_strategy": "recursive" # added by DocumentChunker
}
Access metadata:
chunks = chunker.create_chunks("document.pdf")
for chunk in chunks:
print(f"Chunk {chunk.metadata['chunk_index']}: {len(chunk.page_content)} chars")
Supported Formats¶
Chunkin uses LangChain document loaders for format support:
| Format | Extension | LangChain Loader |
|---|---|---|
.pdf |
PyPDFLoader | |
| Word | .docx, .doc |
UnstructuredWordDocumentLoader |
| Text | .txt |
TextLoader |
| Markdown | .md |
UnstructuredMarkdownLoader |
| CSV | .csv |
CSVLoader |
| Excel | .xlsx, .xls |
UnstructuredExcelLoader |
| PowerPoint | .pptx, .ppt |
UnstructuredPowerPointLoader |
For a full list of available document loaders in LangChain, see the LangChain documentation.