IngestIQ

Web Scraping

Extract content from web pages using Crawl4AI

Overview#

The Web Scraping connector uses Crawl4AI to extract content from websites and ingest it into your Knowledge Base.

Features#

Smart Crawling

Recursively crawl websites with configurable depth

Content Extraction

Extract clean text from HTML pages

JavaScript Support

Handle dynamic JavaScript-rendered content

Sitemap Support

Efficiently crawl using XML sitemaps

Configuration#

Web Scraping Config#

{
  "url": "https://docs.example.com",
  "depth": 2,
  "maxPages": 100,
  "includePatterns": ["/docs/*"],
  "excludePatterns": ["/blog/*", "/changelog/*"],
  "waitForJs": true,
  "timeout": 30000
}

Configuration Options#

OptionTypeDefaultDescription
urlStringRequiredStart URL for crawling
depthInteger2How many levels deep to crawl
maxPagesInteger100Maximum pages to crawl
includePatternsArray[]URL patterns to include
excludePatternsArray[]URL patterns to exclude
waitForJsBooleanfalseWait for JavaScript to render
timeoutInteger30000Request timeout in ms

Creating Web Scrape Config#

curl -X POST http://localhost:3000/api/v2/connector-configs \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Product Documentation Scraper",
    "connectorTypeId": 2,
    "config": {
      "url": "https://docs.myproduct.com",
      "depth": 3,
      "maxPages": 200,
      "includePatterns": ["/docs/*", "/guides/*"],
      "excludePatterns": ["/api/*"]
    }
  }'

Executing a Web Scrape#

curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/execute \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{}'

The pipeline will:

  1. Start from the configured URL
  2. Crawl linked pages up to the specified depth
  3. Extract text content from each page
  4. Process and embed the content

Content Extraction#

What Gets Extracted#

  • Main content text
  • Headings and structure
  • Lists and tables
  • Code blocks
  • Image alt text

What Gets Filtered#

  • Navigation menus
  • Footers and headers
  • Advertisements
  • Cookie banners
  • Duplicate content

URL Patterns#

Include Patterns#

Only crawl URLs matching these patterns:

{
  "includePatterns": [
    "/docs/*",      // All docs pages
    "/api/*",       // API reference
    "/guides/v2/*"  // Only v2 guides
  ]
}

Exclude Patterns#

Skip URLs matching these patterns:

{
  "excludePatterns": [
    "/blog/*",       // Skip blog posts
    "/changelog/*",  // Skip changelog
    "*.pdf",         // Skip PDF files (use file upload instead)
    "*?*"            // Skip URLs with query params
  ]
}

Scheduling Web Scrapes#

Schedule regular crawls to keep content up-to-date:

{
  "scheduleConfig": {
    "enable_automation": true,
    "interval_type": "Weekly",
    "interval_time": "02:00",
    "day_of_week": "Sunday",
    "timezone": "UTC"
  }
}

Schedule web scrapes during off-peak hours to minimize impact on the target website.

Best Practices#

Always check the website's robots.txt and terms of service before scraping. Configure excludePatterns accordingly.

Start with conservative settings:

  • depth: 2-3 for most sites
  • maxPages: 100-500 initially
  • Increase after verifying results

Use includePatterns to focus on relevant content areas rather than crawling entire sites.

Set waitForJs: true for sites using React, Vue, or other JavaScript frameworks.

Error Handling#

ErrorCauseSolution
CRAWL_TIMEOUTPage took too longIncrease timeout or skip slow pages
ACCESS_DENIED403/401 responseCheck if site requires authentication
RATE_LIMITEDToo many requestsAdd delays between requests
CONTENT_EMPTYNo text extractedCheck if site requires JavaScript
Documentation