Chunkin - Document Chunker & Indexer¶

A Python library for processing and chunking various document formats, and indexing them into vector stores. Built on LangChain.

Built on LangChain¶

Chunkin leverages LangChain for:

Document Loading: LangChain document loaders for 8 formats
Text Splitting: LangChain text splitters with 6 strategies
Vector Storage: LangChain vector store integrations for 50+ stores

Modules¶

Document Chunker ¶

Process documents and create chunks for vector store indexing. - 8 formats: PDF, DOCX, TXT, MD, CSV, XLSX, PPT - 6 strategies: recursive, character, markdown, markdown_headers, html_headers, semantic - Batch processing with directory support

Doc Indexer ¶

Index chunks into various vector stores and perform similarity search. - 50+ vector stores: Local, AWS, Azure, Google Cloud, and more - Unified API for all vector stores using LangChain's VectorStore interface - Search with metadata filtering

Doc Processor ¶

Unified end-to-end processing combining chunking and indexing. - Single initialization for both modules - All chunking + indexing options - process_file/directory() for easy workflows

Quick Start¶

from chunkin_processor import DocProcessor
from langchain_openai import OpenAIEmbeddings

processor = DocProcessor(
    embeddings=OpenAIEmbeddings(),
    vector_store_type="faiss",
    chunk_size=500,
)

# Process file
processor.process_file("document.pdf")

# Search
results = processor.search("your query", k=3)

Supported Formats¶

Uses LangChain document loaders:

Format	Extensions	Default Metadata
PDF	`.pdf`	`source`, `page`
Word	`.docx`, `.doc`	`source`
Text	`.txt`	`source`
Markdown	`.md`	`source`
CSV	`.csv`	`source`
Excel	`.xlsx`, `.xls`	`source`
PowerPoint	`.pptx`, `.ppt`	`source`

Installation¶

# Core only
pip install chunkin

# With OpenAI + FAISS
pip install chunkin[core]

# With semantic chunking
pip install chunkin[semantic]

# Local vector stores
pip install chunkin[local]

# All vector stores
pip install chunkin[all]

LangChain Integration¶

Chunkin is designed to work seamlessly with other LangChain components:

from chunkin import DocumentChunker
from chunkin_indexer import DocIndexer
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate

# All components work together
embeddings = OpenAIEmbeddings()
chunker = DocumentChunker()
indexer = DocIndexer(vector_store_type="faiss", embeddings=embeddings)