Connectors
Source and destination connectors for data ingestion
What are Connectors?#
Connectors are the integration points that define where data comes from (sources) and where it's stored (destinations). Each pipeline uses one source connector and one destination connector.
Connector Categories#
Define where data originates from
Define where processed data is stored
Available Source Connectors#
| Connector | Code | Description |
|---|---|---|
| File Upload | file_upload | Direct file uploads (PDF, CSV, Excel) |
| Web Scraping | web_scrape | Crawl4AI web content extraction |
| Google Drive | google_drive | OAuth2 Google Drive integration |
| Video/YouTube | video | Video transcription via Whisper |
| Audio | audio | Audio file transcription |
| Image | image | Image OCR via Gemini Vision |
Available Destination Connectors#
| Connector | Code | Description |
|---|---|---|
| pgvector | pgvector | PostgreSQL with pgvector extension |
Connector Configuration#
Each connector type has a configuration schema that defines required and optional settings.
Creating a Connector Config#
curl -X POST http://localhost:3000/api/v2/connector-configs \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "My File Upload Config",
"connectorTypeId": 1,
"config": {
"maxFileSize": "50MB",
"allowedFormats": ["pdf", "csv", "xlsx"]
}
}'
Source Connector Details#
Description: Upload files directly via API
Supported Formats: PDF, CSV, Excel, Word, Images
Configuration:
{
"maxFileSize": "50MB",
"allowedFormats": ["pdf", "csv", "xlsx", "docx"]
}
Description: Extract content from web pages using Crawl4AI
Configuration:
{
"url": "https://example.com",
"depth": 2,
"maxPages": 100
}
Description: Sync documents from Google Drive folders
Requirements: OAuth2 authentication
Configuration:
{
"folderId": "google-drive-folder-id",
"includeSubfolders": true
}
Description: Transcribe video content using OpenAI Whisper
Supported Sources: YouTube URLs, direct video URLs
Configuration:
{
"maxDuration": 10800,
"extractAudio": true,
"whisperModel": "whisper-1"
}
Destination Connector Details#
Description: PostgreSQL with the pgvector extension for efficient similarity search
Features:
- HNSW indexing for fast approximate nearest neighbor search
- Metadata filtering
- Hybrid search (vector + keyword)
Configuration:
{
"tableName": "document_embeddings",
"dimensions": 1536,
"indexType": "hnsw"
}
Connector Type Properties#
| Property | Type | Description |
|---|---|---|
id | Integer | Unique identifier |
name | String | Connector name |
category | Enum | source or destination |
description | String | Connector description |
configSchema | Object | JSON Schema for config validation |
uniqueCode | String | Unique connector code |
isActive | Boolean | Whether connector is enabled |
Listing Connector Types#
curl http://localhost:3000/api/v2/connector-configs/types \
-H "Authorization: Bearer YOUR_JWT_TOKEN"
Best Practices#
Create connector configs once and reuse across multiple pipelines:
- Create a "PDF Upload Config" used by all PDF pipelines
- Create a "Marketing Website Scraper" for marketing content
Name configs descriptively:
- ✅ "Engineering Docs - Google Drive"
- ✅ "Customer Support Videos"
- ❌ "Config 1"