Pipelines
Configure automated data ingestion workflows
What is a Pipeline?#
A Pipeline is a configured workflow that defines how documents are ingested, processed, and stored in a Knowledge Base. Each pipeline connects four key components:
Pipeline Architecture#
Where data comes from (files, web, Google Drive, video)
AI model for semantic chunking (Gemini, OpenAI)
Generates vector embeddings (OpenAI)
Where vectors are stored (pgvector)
Creating a Pipeline#
Via API#
curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "PDF Document Pipeline",
"description": "Process uploaded PDF files",
"sourceConnectorConfigId": "uuid-of-source-config",
"parserModelConfigId": "uuid-of-parser-config",
"embeddingModelConfigId": "uuid-of-embedding-config",
"destinationConnectorConfigId": "uuid-of-destination-config"
}'
Pipeline Configuration#
| Field | Required | Description |
|---|---|---|
name | Required | Pipeline name (max 255 chars) |
description | Optional | Optional description |
sourceConnectorConfigId | Required | Source connector configuration |
parserModelConfigId | Required | AI model for parsing |
embeddingModelConfigId | Required | Model for embeddings |
destinationConnectorConfigId | Required | Vector storage destination |
parsingPrompt | Required | Custom parsing instructions |
metadataParsingPrompt | Optional | Custom metadata extraction |
isMetadataPrompt | Optional | Enable metadata extraction |
Scheduling#
Pipelines can be scheduled to run automatically:
{
"scheduleConfig": {
"enable_automation": true,
"interval_type": "Daily",
"interval_time": "09:00",
"timezone": "UTC"
}
}
Scheduling requires ENABLE_SCHEDULER=true in your environment configuration.
Executing a Pipeline#
Manual Execution#
Execute a pipeline immediately:
curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/execute \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"files": [/* file data */]
}'
Execution Status#
Each execution creates a tracking record:
| Status | Description |
|---|---|
pending | Execution queued |
processing | Currently processing |
completed | Successfully finished |
partial_failed | Some documents failed |
failed | Execution failed |
View Execution History#
curl http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/executions \
-H "Authorization: Bearer YOUR_JWT_TOKEN"
Custom Prompts#
Customize how documents are parsed:
Parsing Prompt#
Control how the AI chunks your documents:
{
"parsingPrompt": "Extract information focusing on: \n1. Key definitions\n2. Code examples\n3. Step-by-step instructions\nPreserve code blocks exactly as written."
}
Metadata Extraction#
Extract structured metadata from documents:
{
"isMetadataPrompt": true,
"metadataParsingPrompt": "Extract: title, author, date, keywords, summary (max 200 words)"
}
Pipeline Status#
| Status | Description |
|---|---|
active | Pipeline is enabled and can be executed |
inactive | Pipeline is disabled |
Best Practices#
Tailor your parsing prompt to your document type:
- Legal docs: Focus on clauses, definitions, obligations
- Technical docs: Preserve code, command examples
- Research papers: Extract citations, methodology, findings
Regularly check execution status to catch failures early.
Run manual executions before enabling scheduled runs.