IngestIQ

Documents

Document lifecycle, processing, and search

What is a Document?#

A Document represents a file that has been ingested into a Knowledge Base. Documents go through a processing lifecycle that transforms them from raw files into searchable, AI-ready content.

Document Lifecycle#

Loading diagram...

Document Status#

StatusDescription
uploadedFile received, awaiting processing
processingCurrently being parsed and embedded
processedSuccessfully processed and searchable
failedProcessing failed (check metadata for error)
reprocessingBeing reprocessed

Document Properties#

PropertyTypeDescription
idUUIDUnique identifier
fileIdStringStorage file identifier
knowledgebaseIdUUIDParent Knowledge Base
pipelineIdUUIDPipeline that created this document
filenameStringOriginal filename
authorStringDocument author (if extracted)
sizeIntegerFile size in bytes
pageCountIntegerNumber of pages (for PDFs)
statusEnumProcessing status
metadataObjectExtracted and custom metadata

Processing Flow#

When a document is processed, IngestIQ:

Upload & Store

File is uploaded and stored in S3/MinIO

Text Extraction

Content is extracted from the file format (PDF, CSV, etc.)

Semantic Chunking

AI splits content into meaningful chunks

Metadata Extraction

Optional: Extract structured metadata (title, author, dates)

Embedding Generation

Each chunk is converted to a vector embedding

Vector Storage

Embeddings are stored in pgvector for search

Listing Documents#

curl http://localhost:3000/api/v2/knowledgebases/{kbId}/documents \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Response#

{
  "documents": [
    {
      "id": "doc-uuid",
      "filename": "product-guide.pdf",
      "status": "processed",
      "size": 1048576,
      "pageCount": 25,
      "metadata": {
        "title": "Product Guide",
        "author": "Engineering Team"
      },
      "createdAt": "2024-01-28T12:00:00.000Z"
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 20,
    "total": 150
  }
}

Searching Documents#

Perform semantic search across your Knowledge Base:

curl -X POST http://localhost:3000/api/v2/documents/search \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "knowledgebaseId": "kb-uuid",
    "query": "How do I configure authentication?",
    "topK": 10
  }'

Search Response#

{
  "results": [
    {
      "documentId": "doc-uuid",
      "filename": "auth-guide.pdf",
      "content": "Authentication is configured by setting...",
      "score": 0.89,
      "metadata": {
        "page": 5,
        "section": "Configuration"
      }
    }
  ]
}

Reprocessing Documents#

Reprocess a document to apply updated parsing or embedding models:

curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/documents/{docId}/reprocess \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Reprocessing re-runs the entire pipeline for the document. Previous embeddings are replaced.

Document Metadata#

Metadata is stored as a flexible JSON object containing:

Automatic Metadata#

  • processingStarted - When processing began
  • processingCompleted - When processing finished
  • error - Error message (if failed)

Extracted Metadata (if enabled)#

  • title - Document title
  • author - Document author
  • keywords - Extracted keywords
  • summary - AI-generated summary

Example Metadata#

{
  "title": "Q4 Financial Report",
  "author": "Finance Team",
  "keywords": ["revenue", "quarterly", "2024"],
  "summary": "This report covers Q4 2024 financial performance...",
  "processingCompleted": "2024-01-28T12:05:00.000Z",
  "chunks": 47,
  "embeddingModel": "text-embedding-3-small"
}

Supported File Types#

TypeExtensionNotes
PDF.pdfNative support with page extraction
CSV.csvParsed as structured data
Excel.xlsx, .xlsSheet-by-sheet processing
Word.docxConverted via Gotenberg
Images.png, .jpgOCR via Gemini Vision
Video.mp4, .movAudio transcription

Best Practices#

Check the metadata.error field for failure reasons:

curl "http://localhost:3000/api/v2/knowledgebases/{kbId}/documents?status=failed"

Enable metadata extraction for better search filtering:

{
  "isMetadataPrompt": true,
  "metadataParsingPrompt": "Extract: title, date, category"
}
Documentation