Semantic Chunking

AI-powered intelligent document splitting

What is Semantic Chunking?#

Semantic chunking is an AI-powered technique that splits documents based on meaning and context rather than arbitrary character counts.

Context-Aware

Understands document structure

Preserve Meaning

Keeps related content together

Traditional vs Semantic Chunking#

Chunk 1: "...authentication is configured by set"
Chunk 2: "ting the JWT_SECRET environment variabl"
Chunk 3: "e. Make sure to use a secure random..."

Breaks sentences mid-word Loses context Poor search results

How It Works#

Loading diagram...

Extract: Raw text is extracted from the document
Analyze: AI model analyzes structure and content
Identify: Natural boundaries are detected
Chunk: Content is split at meaningful points
Embed: Each chunk becomes a vector

AI-Powered Features#

Structure Recognition#

Headers and sections
Paragraphs and topics
Lists and enumerations
Code blocks
Tables and data
Quotes and citations

Context Preservation#

Complete sentences are kept together
Related paragraphs are grouped
Code with its explanation
Tables with headers

Custom Parsing Prompts#

Control how the AI chunks your documents:

Basic Parsing Prompt#

{
  "parsingPrompt": "Extract the content and split into logical sections. Keep code examples with their explanations."
}

Technical Documentation#

{
  "parsingPrompt": "Parse this technical document:
1. Preserve all code blocks exactly as written
2. Keep function documentation with the code
3. Group related concepts together
4. Extract step-by-step instructions as separate chunks"
}

Legal Documents#

{
  "parsingPrompt": "Parse this legal document:
1. Keep each clause complete
2. Preserve section numbering
3. Extract definitions with their explanations
4. Identify key obligations and requirements"
}

Research Papers#

{
  "parsingPrompt": "Parse this research paper:
1. Keep abstracts as single chunks
2. Preserve methodology descriptions completely
3. Keep findings with supporting data
4. Extract citations and references"
}

Chunk Size Considerations#

Optimal Chunk Sizes#

Content Type	Recommended Size	Rationale
General docs	500-1000 tokens	Balance context and precision
Technical docs	300-500 tokens	More granular for specific queries
Legal	200-400 tokens	Clause-level retrieval
Long-form	800-1200 tokens	Maintain narrative flow

Factors Affecting Size#

Search precision: Smaller = more precise matches
Context: Larger = more context per result
Storage: More chunks = more vectors
Cost: More chunks = more embeddings

Best Practices#

Use specialized parsing prompts for different document types:

Technical: Focus on code preservation
Legal: Focus on clause boundaries
Marketing: Focus on key messages

Run test ingestions and review chunks before processing large batches.

Refine parsing prompts based on search quality results.

Viewing Chunks#

After processing, you can see how documents were chunked:

curl http://localhost:3000/api/v2/knowledgebases/{kbId}/documents/{docId}/chunks \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Custom Prompts

Advanced prompt engineering

Embeddings

Vector generation