Video & Audio
Transcribe video and audio content
Overview#
The Video connector enables ingestion of video and audio content by extracting audio and transcribing it using OpenAI's Whisper model.
Features#
Direct YouTube URL transcription
Any publicly accessible video URL
MP3, WAV, and other audio formats
High-accuracy transcription
Supported Sources#
| Source | Example |
|---|---|
| YouTube | https://youtube.com/watch?v=... |
| Video URL | https://example.com/video.mp4 |
| Audio URL | https://example.com/audio.mp3 |
| Uploaded Audio | MP3, WAV, M4A files |
| Uploaded Video | MP4, MOV, WebM files |
How It Works#
- Download: Video/audio is downloaded from URL
- Extract: FFmpeg extracts audio from video
- Transcribe: OpenAI Whisper converts speech to text
- Process: Transcription is chunked and embedded
Configuration#
Video Connector Config#
{
"maxDuration": 10800,
"language": "en",
"whisperModel": "whisper-1",
"generateTimestamps": true
}
Configuration Options#
| Option | Type | Default | Description |
|---|---|---|---|
maxDuration | Integer | 10800 | Max duration in seconds (3 hours) |
language | String | auto | Language hint for transcription |
whisperModel | String | whisper-1 | OpenAI Whisper model |
generateTimestamps | Boolean | true | Include timestamps in output |
Environment Variables#
# Video Processing Configuration
VIDEO_MAX_SIZE_MB=500
VIDEO_MAX_DURATION_SECONDS=10800
AUDIO_EXTRACTION_TIMEOUT_MS=60000
FFMPEG_PATH=/usr/bin/ffmpeg
# URL Download Configuration
URL_DOWNLOAD_MAX_SIZE=524288000
URL_DOWNLOAD_TIMEOUT=120000
Processing YouTube Videos#
Via Pipeline Execution#
curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/execute \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ"]
}'
Batch YouTube Processing#
curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/execute \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://www.youtube.com/watch?v=video1",
"https://www.youtube.com/watch?v=video2",
"https://www.youtube.com/watch?v=video3"
]
}'
Processing Audio Files#
Upload Audio#
curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/execute \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-F "files=@podcast-episode.mp3"
Transcription Output#
The transcription includes:
{
"text": "Full transcription text...",
"metadata": {
"duration": 1800,
"language": "en",
"source": "youtube",
"videoTitle": "Product Demo 2024",
"timestamps": [
{ "start": 0, "end": 30, "text": "Introduction..." },
{ "start": 30, "end": 60, "text": "Feature overview..." }
]
}
}
Requirements#
Video processing requires FFmpeg to be installed. This is included in the Docker image.
System Requirements#
- FFmpeg: For audio extraction
- Memory: ~500MB per hour of video
- Storage: Temporary storage for downloaded files
Docker Setup#
FFmpeg is pre-installed in the IngestIQ Docker image:
RUN apt-get update && apt-get install -y ffmpeg
Limits#
| Limit | Default | Environment Variable |
|---|---|---|
| Max file size | 500MB | VIDEO_MAX_SIZE_MB |
| Max duration | 3 hours | VIDEO_MAX_DURATION_SECONDS |
| Download timeout | 2 minutes | URL_DOWNLOAD_TIMEOUT |
Best Practices#
Longer videos take more time and resources. Consider:
- Splitting long videos into segments
- Processing key sections only
- Using timestamps to locate relevant content
Specify the language if known for better accuracy:
{ "language": "en" }
Timestamps help users locate information in the original video:
{ "generateTimestamps": true }
Error Handling#
| Error | Cause | Solution |
|---|---|---|
DOWNLOAD_FAILED | Cannot download video | Check URL accessibility |
DURATION_EXCEEDED | Video too long | Increase limit or split video |
AUDIO_EXTRACTION_FAILED | FFmpeg error | Check FFmpeg installation |
TRANSCRIPTION_FAILED | Whisper API error | Check OpenAI API key and quota |
UNSUPPORTED_FORMAT | Unknown format | Use supported format |
Cost Considerations#
OpenAI Whisper charges based on audio duration. Current pricing: ~$0.006/minute.
A 1-hour video costs approximately $0.36 to transcribe.