IngestIQ

Video & Audio

Transcribe video and audio content

Overview#

The Video connector enables ingestion of video and audio content by extracting audio and transcribing it using OpenAI's Whisper model.

Features#

YouTube Support

Direct YouTube URL transcription

Video URLs

Any publicly accessible video URL

Audio Files

MP3, WAV, and other audio formats

Whisper AI

High-accuracy transcription

Supported Sources#

SourceExample
YouTubehttps://youtube.com/watch?v=...
Video URLhttps://example.com/video.mp4
Audio URLhttps://example.com/audio.mp3
Uploaded AudioMP3, WAV, M4A files
Uploaded VideoMP4, MOV, WebM files

How It Works#

Loading diagram...
  1. Download: Video/audio is downloaded from URL
  2. Extract: FFmpeg extracts audio from video
  3. Transcribe: OpenAI Whisper converts speech to text
  4. Process: Transcription is chunked and embedded

Configuration#

Video Connector Config#

{
  "maxDuration": 10800,
  "language": "en",
  "whisperModel": "whisper-1",
  "generateTimestamps": true
}

Configuration Options#

OptionTypeDefaultDescription
maxDurationInteger10800Max duration in seconds (3 hours)
languageStringautoLanguage hint for transcription
whisperModelStringwhisper-1OpenAI Whisper model
generateTimestampsBooleantrueInclude timestamps in output

Environment Variables#

# Video Processing Configuration
VIDEO_MAX_SIZE_MB=500
VIDEO_MAX_DURATION_SECONDS=10800
AUDIO_EXTRACTION_TIMEOUT_MS=60000
FFMPEG_PATH=/usr/bin/ffmpeg

# URL Download Configuration
URL_DOWNLOAD_MAX_SIZE=524288000
URL_DOWNLOAD_TIMEOUT=120000

Processing YouTube Videos#

Via Pipeline Execution#

curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/execute \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ"]
  }'

Batch YouTube Processing#

curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/execute \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://www.youtube.com/watch?v=video1",
      "https://www.youtube.com/watch?v=video2",
      "https://www.youtube.com/watch?v=video3"
    ]
  }'

Processing Audio Files#

Upload Audio#

curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/execute \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -F "files=@podcast-episode.mp3"

Transcription Output#

The transcription includes:

{
  "text": "Full transcription text...",
  "metadata": {
    "duration": 1800,
    "language": "en",
    "source": "youtube",
    "videoTitle": "Product Demo 2024",
    "timestamps": [
      { "start": 0, "end": 30, "text": "Introduction..." },
      { "start": 30, "end": 60, "text": "Feature overview..." }
    ]
  }
}

Requirements#

Video processing requires FFmpeg to be installed. This is included in the Docker image.

System Requirements#

  • FFmpeg: For audio extraction
  • Memory: ~500MB per hour of video
  • Storage: Temporary storage for downloaded files

Docker Setup#

FFmpeg is pre-installed in the IngestIQ Docker image:

RUN apt-get update && apt-get install -y ffmpeg

Limits#

LimitDefaultEnvironment Variable
Max file size500MBVIDEO_MAX_SIZE_MB
Max duration3 hoursVIDEO_MAX_DURATION_SECONDS
Download timeout2 minutesURL_DOWNLOAD_TIMEOUT

Best Practices#

Longer videos take more time and resources. Consider:

  • Splitting long videos into segments
  • Processing key sections only
  • Using timestamps to locate relevant content

Specify the language if known for better accuracy:

{ "language": "en" }

Timestamps help users locate information in the original video:

{ "generateTimestamps": true }

Error Handling#

ErrorCauseSolution
DOWNLOAD_FAILEDCannot download videoCheck URL accessibility
DURATION_EXCEEDEDVideo too longIncrease limit or split video
AUDIO_EXTRACTION_FAILEDFFmpeg errorCheck FFmpeg installation
TRANSCRIPTION_FAILEDWhisper API errorCheck OpenAI API key and quota
UNSUPPORTED_FORMATUnknown formatUse supported format

Cost Considerations#

OpenAI Whisper charges based on audio duration. Current pricing: ~$0.006/minute.

A 1-hour video costs approximately $0.36 to transcribe.

Documentation