Skip to content

Chunkin - Document Chunker & Indexer

A Python library for processing and chunking various document formats, and indexing them into vector stores. Built on LangChain.

Built on LangChain

Chunkin leverages LangChain for:

Modules

Document Chunker

Process documents and create chunks for vector store indexing. - 8 formats: PDF, DOCX, TXT, MD, CSV, XLSX, PPT - 6 strategies: recursive, character, markdown, markdown_headers, html_headers, semantic - Batch processing with directory support

Doc Indexer

Index chunks into various vector stores and perform similarity search. - 50+ vector stores: Local, AWS, Azure, Google Cloud, and more - Unified API for all vector stores using LangChain's VectorStore interface - Search with metadata filtering

Doc Processor

Unified end-to-end processing combining chunking and indexing. - Single initialization for both modules - All chunking + indexing options - process_file/directory() for easy workflows

Quick Start

from chunkin_processor import DocProcessor
from langchain_openai import OpenAIEmbeddings

processor = DocProcessor(
    embeddings=OpenAIEmbeddings(),
    vector_store_type="faiss",
    chunk_size=500,
)

# Process file
processor.process_file("document.pdf")

# Search
results = processor.search("your query", k=3)

Supported Formats

Uses LangChain document loaders:

Format Extensions Default Metadata
PDF .pdf source, page
Word .docx, .doc source
Text .txt source
Markdown .md source
CSV .csv source
Excel .xlsx, .xls source
PowerPoint .pptx, .ppt source

Installation

# Core only
pip install chunkin

# With OpenAI + FAISS
pip install chunkin[core]

# With semantic chunking
pip install chunkin[semantic]

# Local vector stores
pip install chunkin[local]

# All vector stores
pip install chunkin[all]

LangChain Integration

Chunkin is designed to work seamlessly with other LangChain components:

from chunkin import DocumentChunker
from chunkin_indexer import DocIndexer
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate

# All components work together
embeddings = OpenAIEmbeddings()
chunker = DocumentChunker()
indexer = DocIndexer(vector_store_type="faiss", embeddings=embeddings)

Further Reading