Documents
Document lifecycle, processing, and search
What is a Document?#
A Document represents a file that has been ingested into a Knowledge Base. Documents go through a processing lifecycle that transforms them from raw files into searchable, AI-ready content.
Document Lifecycle#
Document Status#
| Status | Description |
|---|---|
uploaded | File received, awaiting processing |
processing | Currently being parsed and embedded |
processed | Successfully processed and searchable |
failed | Processing failed (check metadata for error) |
reprocessing | Being reprocessed |
Document Properties#
| Property | Type | Description |
|---|---|---|
id | UUID | Unique identifier |
fileId | String | Storage file identifier |
knowledgebaseId | UUID | Parent Knowledge Base |
pipelineId | UUID | Pipeline that created this document |
filename | String | Original filename |
author | String | Document author (if extracted) |
size | Integer | File size in bytes |
pageCount | Integer | Number of pages (for PDFs) |
status | Enum | Processing status |
metadata | Object | Extracted and custom metadata |
Processing Flow#
When a document is processed, IngestIQ:
Upload & Store
File is uploaded and stored in S3/MinIO
Text Extraction
Content is extracted from the file format (PDF, CSV, etc.)
Semantic Chunking
AI splits content into meaningful chunks
Metadata Extraction
Optional: Extract structured metadata (title, author, dates)
Embedding Generation
Each chunk is converted to a vector embedding
Vector Storage
Embeddings are stored in pgvector for search
Listing Documents#
curl http://localhost:3000/api/v2/knowledgebases/{kbId}/documents \
-H "Authorization: Bearer YOUR_JWT_TOKEN"
Response#
{
"documents": [
{
"id": "doc-uuid",
"filename": "product-guide.pdf",
"status": "processed",
"size": 1048576,
"pageCount": 25,
"metadata": {
"title": "Product Guide",
"author": "Engineering Team"
},
"createdAt": "2024-01-28T12:00:00.000Z"
}
],
"pagination": {
"page": 1,
"limit": 20,
"total": 150
}
}
Searching Documents#
Perform semantic search across your Knowledge Base:
curl -X POST http://localhost:3000/api/v2/documents/search \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"knowledgebaseId": "kb-uuid",
"query": "How do I configure authentication?",
"topK": 10
}'
Search Response#
{
"results": [
{
"documentId": "doc-uuid",
"filename": "auth-guide.pdf",
"content": "Authentication is configured by setting...",
"score": 0.89,
"metadata": {
"page": 5,
"section": "Configuration"
}
}
]
}
Reprocessing Documents#
Reprocess a document to apply updated parsing or embedding models:
curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/documents/{docId}/reprocess \
-H "Authorization: Bearer YOUR_JWT_TOKEN"
Reprocessing re-runs the entire pipeline for the document. Previous embeddings are replaced.
Document Metadata#
Metadata is stored as a flexible JSON object containing:
Automatic Metadata#
processingStarted- When processing beganprocessingCompleted- When processing finishederror- Error message (if failed)
Extracted Metadata (if enabled)#
title- Document titleauthor- Document authorkeywords- Extracted keywordssummary- AI-generated summary
Example Metadata#
{
"title": "Q4 Financial Report",
"author": "Finance Team",
"keywords": ["revenue", "quarterly", "2024"],
"summary": "This report covers Q4 2024 financial performance...",
"processingCompleted": "2024-01-28T12:05:00.000Z",
"chunks": 47,
"embeddingModel": "text-embedding-3-small"
}
Supported File Types#
| Type | Extension | Notes |
|---|---|---|
.pdf | Native support with page extraction | |
| CSV | .csv | Parsed as structured data |
| Excel | .xlsx, .xls | Sheet-by-sheet processing |
| Word | .docx | Converted via Gotenberg |
| Images | .png, .jpg | OCR via Gemini Vision |
| Video | .mp4, .mov | Audio transcription |
Best Practices#
Check the metadata.error field for failure reasons:
curl "http://localhost:3000/api/v2/knowledgebases/{kbId}/documents?status=failed"
Enable metadata extraction for better search filtering:
{
"isMetadataPrompt": true,
"metadataParsingPrompt": "Extract: title, date, category"
}