API Reference¶

DocumentChunker¶

class DocumentChunker:

Constructor¶

DocumentChunker(
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    strategy: str = "recursive",
    separators: Optional[List[str]] = None,
    is_separator_regex: bool = False,
    keep_separator: bool = True,
    embeddings: Optional[Embeddings] = None,
    breakpoint_threshold_type: str = "percentile",
    breakpoint_threshold_amount: int = 95,
    min_chunk_size: int = 0,
    buffer_size: int = 1,
    add_start_index: bool = False,
    nb_suffix: int = 1,
    output_dir: Optional[str] = None,
    save_chunks: bool = True,
)

Parameter	Type	Default	Description
`chunk_size`	int	1000	Target size of each chunk
`chunk_overlap`	int	200	Overlap between chunks
`strategy`	str	"recursive"	Chunking strategy
`separators`	List[str]	None	Custom separators
`is_separator_regex`	bool	False	Treat separators as regex
`keep_separator`	bool	True	Include separator in chunks
`embeddings`	Embeddings	None	Required for semantic strategy
`breakpoint_threshold_type`	str	"percentile"	Semantic breakpoint method
`breakpoint_threshold_amount`	int	95	Threshold for semantic splits
`buffer_size`	int	1	Sentence overlap for semantic
`min_chunk_size`	int	0	Minimum chunk size
`add_start_index`	bool	False	Add start index to metadata
`nb_suffix`	int	1	Number suffix for splits
`output_dir`	str	None	Directory to save chunk JSON files
`save_chunks`	bool	True	Whether to save chunks to JSON

Methods¶

create_chunks(file_path: str) -> List[Document]¶

Loads a document, chunks it, saves to JSON, and stores in internal collection.

chunks = chunker.create_chunks("document.pdf")

get_chunks(file_path: Optional[str] = None) -> Union[List[Document], Dict[str, List[Document]]]¶

Retrieve chunks from the internal store.

# All chunks
all_chunks = chunker.get_chunks()

# Chunks for specific file
pdf_chunks = chunker.get_chunks("document.pdf")

list_chunks() -> Dict[str, int]¶

Returns a summary of all processed files with their chunk counts.

summary = chunker.list_chunks()
# {'document.pdf': 15, 'report.docx': 8}

batch_chunks(directory: str, extensions: Optional[List[str]] = None, recursive: bool = False) -> Dict[str, List[Document]]¶

Process all documents in a directory.

all_chunks = chunker.batch_chunks("path/to/documents")

batch_chunks_stream(directory: str, extensions: Optional[List[str]] = None, recursive: bool = False) -> Iterator[tuple[str, List[Document]]]¶

Stream process documents for memory-efficient batch processing.

for file_path, chunks in chunker.batch_chunks_stream("path/to/documents"):
    print(f"Processed: {file_path} ({len(chunks)} chunks)")

supported_formats() -> List[str]¶

Returns list of supported file extensions.

formats = DocumentChunker.supported_formats()
# ['.pdf', '.docx', '.doc', '.txt', '.md', '.csv', '.xlsx', '.xls', '.pptx', '.ppt']

supported_strategies() -> List[str]¶

Returns list of available chunking strategies.

strategies = DocumentChunker.supported_strategies()
# ['recursive', 'character', 'markdown', 'markdown_headers', 'html_headers', 'semantic']

Chunk Metadata¶

Each chunk contains:

From document loaders: - source - full path to source file - page - page number (PDF only)

Added by DocumentChunker: - chunk_index - index of chunk within source file - source_file - filename only - chunking_strategy - strategy used

Output Format¶

Chunks are saved as JSON:

[
  {
    "content": "Chunk text content...",
    "metadata": {
      "source": "/path/to/document.pdf",
      "page": 0,
      "chunk_index": 0,
      "source_file": "document.pdf",
      "chunking_strategy": "recursive"
    }
  },
  {
    "content": "Next chunk...",
    "metadata": {
      "source": "/path/to/document.pdf",
      "page": 0,
      "chunk_index": 1,
      "source_file": "document.pdf",
      "chunking_strategy": "recursive"
    }
  }
]