IngestIQ

Pipelines

Configure automated data ingestion workflows

What is a Pipeline?#

A Pipeline is a configured workflow that defines how documents are ingested, processed, and stored in a Knowledge Base. Each pipeline connects four key components:

Loading diagram...

Pipeline Architecture#

Source Connector

Where data comes from (files, web, Google Drive, video)

Parser Model

AI model for semantic chunking (Gemini, OpenAI)

Embedding Model

Generates vector embeddings (OpenAI)

Destination

Where vectors are stored (pgvector)

Creating a Pipeline#

Via API#

curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "PDF Document Pipeline",
    "description": "Process uploaded PDF files",
    "sourceConnectorConfigId": "uuid-of-source-config",
    "parserModelConfigId": "uuid-of-parser-config",
    "embeddingModelConfigId": "uuid-of-embedding-config",
    "destinationConnectorConfigId": "uuid-of-destination-config"
  }'

Pipeline Configuration#

FieldRequiredDescription
nameRequiredPipeline name (max 255 chars)
descriptionOptionalOptional description
sourceConnectorConfigIdRequiredSource connector configuration
parserModelConfigIdRequiredAI model for parsing
embeddingModelConfigIdRequiredModel for embeddings
destinationConnectorConfigIdRequiredVector storage destination
parsingPromptRequiredCustom parsing instructions
metadataParsingPromptOptionalCustom metadata extraction
isMetadataPromptOptionalEnable metadata extraction

Scheduling#

Pipelines can be scheduled to run automatically:

{
  "scheduleConfig": {
    "enable_automation": true,
    "interval_type": "Daily",
    "interval_time": "09:00",
    "timezone": "UTC"
  }
}

Scheduling requires ENABLE_SCHEDULER=true in your environment configuration.

Executing a Pipeline#

Manual Execution#

Execute a pipeline immediately:

curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/execute \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "files": [/* file data */]
  }'

Execution Status#

Each execution creates a tracking record:

StatusDescription
pendingExecution queued
processingCurrently processing
completedSuccessfully finished
partial_failedSome documents failed
failedExecution failed

View Execution History#

curl http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/executions \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Custom Prompts#

Customize how documents are parsed:

Parsing Prompt#

Control how the AI chunks your documents:

{
  "parsingPrompt": "Extract information focusing on: \n1. Key definitions\n2. Code examples\n3. Step-by-step instructions\nPreserve code blocks exactly as written."
}

Metadata Extraction#

Extract structured metadata from documents:

{
  "isMetadataPrompt": true,
  "metadataParsingPrompt": "Extract: title, author, date, keywords, summary (max 200 words)"
}

Pipeline Status#

StatusDescription
activePipeline is enabled and can be executed
inactivePipeline is disabled

Best Practices#

Tailor your parsing prompt to your document type:

  • Legal docs: Focus on clauses, definitions, obligations
  • Technical docs: Preserve code, command examples
  • Research papers: Extract citations, methodology, findings

Regularly check execution status to catch failures early.

Run manual executions before enabling scheduled runs.

Documentation