IngestIQ

Semantic Chunking

AI-powered intelligent document splitting

What is Semantic Chunking?#

Semantic chunking is an AI-powered technique that splits documents based on meaning and context rather than arbitrary character counts.

Context-Aware

Understands document structure

Preserve Meaning

Keeps related content together

Traditional vs Semantic Chunking#

Chunk 1: "...authentication is configured by set"
Chunk 2: "ting the JWT_SECRET environment variabl"
Chunk 3: "e. Make sure to use a secure random..."

Breaks sentences mid-word Loses context Poor search results

How It Works#

Loading diagram...
  1. Extract: Raw text is extracted from the document
  2. Analyze: AI model analyzes structure and content
  3. Identify: Natural boundaries are detected
  4. Chunk: Content is split at meaningful points
  5. Embed: Each chunk becomes a vector

AI-Powered Features#

Structure Recognition#

  • Headers and sections
  • Paragraphs and topics
  • Lists and enumerations
  • Code blocks
  • Tables and data
  • Quotes and citations

Context Preservation#

  • Complete sentences are kept together
  • Related paragraphs are grouped
  • Code with its explanation
  • Tables with headers

Custom Parsing Prompts#

Control how the AI chunks your documents:

Basic Parsing Prompt#

{
  "parsingPrompt": "Extract the content and split into logical sections. Keep code examples with their explanations."
}

Technical Documentation#

{
  "parsingPrompt": "Parse this technical document:
1. Preserve all code blocks exactly as written
2. Keep function documentation with the code
3. Group related concepts together
4. Extract step-by-step instructions as separate chunks"
}
{
  "parsingPrompt": "Parse this legal document:
1. Keep each clause complete
2. Preserve section numbering
3. Extract definitions with their explanations
4. Identify key obligations and requirements"
}

Research Papers#

{
  "parsingPrompt": "Parse this research paper:
1. Keep abstracts as single chunks
2. Preserve methodology descriptions completely
3. Keep findings with supporting data
4. Extract citations and references"
}

Chunk Size Considerations#

Optimal Chunk Sizes#

Content TypeRecommended SizeRationale
General docs500-1000 tokensBalance context and precision
Technical docs300-500 tokensMore granular for specific queries
Legal200-400 tokensClause-level retrieval
Long-form800-1200 tokensMaintain narrative flow

Factors Affecting Size#

  • Search precision: Smaller = more precise matches
  • Context: Larger = more context per result
  • Storage: More chunks = more vectors
  • Cost: More chunks = more embeddings

Best Practices#

Use specialized parsing prompts for different document types:

  • Technical: Focus on code preservation
  • Legal: Focus on clause boundaries
  • Marketing: Focus on key messages

Run test ingestions and review chunks before processing large batches.

Refine parsing prompts based on search quality results.

Viewing Chunks#

After processing, you can see how documents were chunked:

curl http://localhost:3000/api/v2/knowledgebases/{kbId}/documents/{docId}/chunks \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"
Documentation