API Reference¶
DocumentChunker¶
Constructor¶
DocumentChunker(
chunk_size: int = 1000,
chunk_overlap: int = 200,
strategy: str = "recursive",
separators: Optional[List[str]] = None,
is_separator_regex: bool = False,
keep_separator: bool = True,
embeddings: Optional[Embeddings] = None,
breakpoint_threshold_type: str = "percentile",
breakpoint_threshold_amount: int = 95,
min_chunk_size: int = 0,
buffer_size: int = 1,
add_start_index: bool = False,
nb_suffix: int = 1,
output_dir: Optional[str] = None,
save_chunks: bool = True,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size |
int | 1000 | Target size of each chunk |
chunk_overlap |
int | 200 | Overlap between chunks |
strategy |
str | "recursive" | Chunking strategy |
separators |
List[str] | None | Custom separators |
is_separator_regex |
bool | False | Treat separators as regex |
keep_separator |
bool | True | Include separator in chunks |
embeddings |
Embeddings | None | Required for semantic strategy |
breakpoint_threshold_type |
str | "percentile" | Semantic breakpoint method |
breakpoint_threshold_amount |
int | 95 | Threshold for semantic splits |
buffer_size |
int | 1 | Sentence overlap for semantic |
min_chunk_size |
int | 0 | Minimum chunk size |
add_start_index |
bool | False | Add start index to metadata |
nb_suffix |
int | 1 | Number suffix for splits |
output_dir |
str | None | Directory to save chunk JSON files |
save_chunks |
bool | True | Whether to save chunks to JSON |
Methods¶
create_chunks(file_path: str) -> List[Document]¶
Loads a document, chunks it, saves to JSON, and stores in internal collection.
get_chunks(file_path: Optional[str] = None) -> Union[List[Document], Dict[str, List[Document]]]¶
Retrieve chunks from the internal store.
# All chunks
all_chunks = chunker.get_chunks()
# Chunks for specific file
pdf_chunks = chunker.get_chunks("document.pdf")
list_chunks() -> Dict[str, int]¶
Returns a summary of all processed files with their chunk counts.
batch_chunks(directory: str, extensions: Optional[List[str]] = None, recursive: bool = False) -> Dict[str, List[Document]]¶
Process all documents in a directory.
batch_chunks_stream(directory: str, extensions: Optional[List[str]] = None, recursive: bool = False) -> Iterator[tuple[str, List[Document]]]¶
Stream process documents for memory-efficient batch processing.
for file_path, chunks in chunker.batch_chunks_stream("path/to/documents"):
print(f"Processed: {file_path} ({len(chunks)} chunks)")
supported_formats() -> List[str]¶
Returns list of supported file extensions.
formats = DocumentChunker.supported_formats()
# ['.pdf', '.docx', '.doc', '.txt', '.md', '.csv', '.xlsx', '.xls', '.pptx', '.ppt']
supported_strategies() -> List[str]¶
Returns list of available chunking strategies.
strategies = DocumentChunker.supported_strategies()
# ['recursive', 'character', 'markdown', 'markdown_headers', 'html_headers', 'semantic']
Chunk Metadata¶
Each chunk contains:
From document loaders:
- source - full path to source file
- page - page number (PDF only)
Added by DocumentChunker:
- chunk_index - index of chunk within source file
- source_file - filename only
- chunking_strategy - strategy used
Output Format¶
Chunks are saved as JSON:
[
{
"content": "Chunk text content...",
"metadata": {
"source": "/path/to/document.pdf",
"page": 0,
"chunk_index": 0,
"source_file": "document.pdf",
"chunking_strategy": "recursive"
}
},
{
"content": "Next chunk...",
"metadata": {
"source": "/path/to/document.pdf",
"page": 0,
"chunk_index": 1,
"source_file": "document.pdf",
"chunking_strategy": "recursive"
}
}
]