Documents

Document lifecycle, processing, and search

What is a Document?#

A Document represents a file that has been ingested into a Knowledge Base. Documents go through a processing lifecycle that transforms them from raw files into searchable, AI-ready content.

Document Lifecycle#

Loading diagram...

Document Status#

Status	Description
`uploaded`	File received, awaiting processing
`processing`	Currently being parsed and embedded
`processed`	Successfully processed and searchable
`failed`	Processing failed (check metadata for error)
`reprocessing`	Being reprocessed

Document Properties#

Property	Type	Description
`id`	UUID	Unique identifier
`fileId`	String	Storage file identifier
`knowledgebaseId`	UUID	Parent Knowledge Base
`pipelineId`	UUID	Pipeline that created this document
`filename`	String	Original filename
`author`	String	Document author (if extracted)
`size`	Integer	File size in bytes
`pageCount`	Integer	Number of pages (for PDFs)
`status`	Enum	Processing status
`metadata`	Object	Extracted and custom metadata

Processing Flow#

When a document is processed, IngestIQ:

Upload & Store

File is uploaded and stored in S3/MinIO

Text Extraction

Content is extracted from the file format (PDF, CSV, etc.)

Semantic Chunking

AI splits content into meaningful chunks

Metadata Extraction

Optional: Extract structured metadata (title, author, dates)

Embedding Generation

Each chunk is converted to a vector embedding

Vector Storage

Embeddings are stored in pgvector for search

Listing Documents#

curl http://localhost:3000/api/v2/knowledgebases/{kbId}/documents \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Response#

{
  "documents": [
    {
      "id": "doc-uuid",
      "filename": "product-guide.pdf",
      "status": "processed",
      "size": 1048576,
      "pageCount": 25,
      "metadata": {
        "title": "Product Guide",
        "author": "Engineering Team"
      },
      "createdAt": "2024-01-28T12:00:00.000Z"
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 20,
    "total": 150
  }
}

Searching Documents#

Perform semantic search across your Knowledge Base:

curl -X POST http://localhost:3000/api/v2/documents/search \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "knowledgebaseId": "kb-uuid",
    "query": "How do I configure authentication?",
    "topK": 10
  }'

Search Response#

{
  "results": [
    {
      "documentId": "doc-uuid",
      "filename": "auth-guide.pdf",
      "content": "Authentication is configured by setting...",
      "score": 0.89,
      "metadata": {
        "page": 5,
        "section": "Configuration"
      }
    }
  ]
}

Reprocessing Documents#

Reprocess a document to apply updated parsing or embedding models:

curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/documents/{docId}/reprocess \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Reprocessing re-runs the entire pipeline for the document. Previous embeddings are replaced.

Document Metadata#

Metadata is stored as a flexible JSON object containing:

Automatic Metadata#

processingStarted - When processing began
processingCompleted - When processing finished
error - Error message (if failed)

Extracted Metadata (if enabled)#

title - Document title
author - Document author
keywords - Extracted keywords
summary - AI-generated summary

Example Metadata#

{
  "title": "Q4 Financial Report",
  "author": "Finance Team",
  "keywords": ["revenue", "quarterly", "2024"],
  "summary": "This report covers Q4 2024 financial performance...",
  "processingCompleted": "2024-01-28T12:05:00.000Z",
  "chunks": 47,
  "embeddingModel": "text-embedding-3-small"
}

Supported File Types#

Type	Extension	Notes
PDF	`.pdf`	Native support with page extraction
CSV	`.csv`	Parsed as structured data
Excel	`.xlsx`, `.xls`	Sheet-by-sheet processing
Word	`.docx`	Converted via Gotenberg
Images	`.png`, `.jpg`	OCR via Gemini Vision
Video	`.mp4`, `.mov`	Audio transcription

Best Practices#

Check the metadata.error field for failure reasons:

curl "http://localhost:3000/api/v2/knowledgebases/{kbId}/documents?status=failed"

Enable metadata extraction for better search filtering:

{
  "isMetadataPrompt": true,
  "metadataParsingPrompt": "Extract: title, date, category"
}

Search API

Full search API documentation

Semantic Chunking

How documents are chunked

What is a Document?#

Document Lifecycle#

Document Status#

Document Properties#

Processing Flow#

Upload & Store

Text Extraction

Semantic Chunking

Metadata Extraction

Embedding Generation

Vector Storage

Listing Documents#

Response#

Searching Documents#

Search Response#

Reprocessing Documents#

Document Metadata#

Automatic Metadata#

Extracted Metadata (if enabled)#

Example Metadata#

Supported File Types#

Best Practices#

Related#