Web Scraping
Extract content from web pages using Crawl4AI
Overview#
The Web Scraping connector uses Crawl4AI to extract content from websites and ingest it into your Knowledge Base.
Features#
Recursively crawl websites with configurable depth
Extract clean text from HTML pages
Handle dynamic JavaScript-rendered content
Efficiently crawl using XML sitemaps
Configuration#
Web Scraping Config#
{
"url": "https://docs.example.com",
"depth": 2,
"maxPages": 100,
"includePatterns": ["/docs/*"],
"excludePatterns": ["/blog/*", "/changelog/*"],
"waitForJs": true,
"timeout": 30000
}
Configuration Options#
| Option | Type | Default | Description |
|---|---|---|---|
url | String | Required | Start URL for crawling |
depth | Integer | 2 | How many levels deep to crawl |
maxPages | Integer | 100 | Maximum pages to crawl |
includePatterns | Array | [] | URL patterns to include |
excludePatterns | Array | [] | URL patterns to exclude |
waitForJs | Boolean | false | Wait for JavaScript to render |
timeout | Integer | 30000 | Request timeout in ms |
Creating Web Scrape Config#
curl -X POST http://localhost:3000/api/v2/connector-configs \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Product Documentation Scraper",
"connectorTypeId": 2,
"config": {
"url": "https://docs.myproduct.com",
"depth": 3,
"maxPages": 200,
"includePatterns": ["/docs/*", "/guides/*"],
"excludePatterns": ["/api/*"]
}
}'
Executing a Web Scrape#
curl -X POST http://localhost:3000/api/v2/knowledgebases/{kbId}/pipelines/{pipelineId}/execute \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{}'
The pipeline will:
- Start from the configured URL
- Crawl linked pages up to the specified depth
- Extract text content from each page
- Process and embed the content
Content Extraction#
What Gets Extracted#
- Main content text
- Headings and structure
- Lists and tables
- Code blocks
- Image alt text
What Gets Filtered#
- Navigation menus
- Footers and headers
- Advertisements
- Cookie banners
- Duplicate content
URL Patterns#
Include Patterns#
Only crawl URLs matching these patterns:
{
"includePatterns": [
"/docs/*", // All docs pages
"/api/*", // API reference
"/guides/v2/*" // Only v2 guides
]
}
Exclude Patterns#
Skip URLs matching these patterns:
{
"excludePatterns": [
"/blog/*", // Skip blog posts
"/changelog/*", // Skip changelog
"*.pdf", // Skip PDF files (use file upload instead)
"*?*" // Skip URLs with query params
]
}
Scheduling Web Scrapes#
Schedule regular crawls to keep content up-to-date:
{
"scheduleConfig": {
"enable_automation": true,
"interval_type": "Weekly",
"interval_time": "02:00",
"day_of_week": "Sunday",
"timezone": "UTC"
}
}
Schedule web scrapes during off-peak hours to minimize impact on the target website.
Best Practices#
Always check the website's robots.txt and terms of service before scraping. Configure excludePatterns accordingly.
Start with conservative settings:
depth: 2-3for most sitesmaxPages: 100-500initially- Increase after verifying results
Use includePatterns to focus on relevant content areas rather than crawling entire sites.
Set waitForJs: true for sites using React, Vue, or other JavaScript frameworks.
Error Handling#
| Error | Cause | Solution |
|---|---|---|
CRAWL_TIMEOUT | Page took too long | Increase timeout or skip slow pages |
ACCESS_DENIED | 403/401 response | Check if site requires authentication |
RATE_LIMITED | Too many requests | Add delays between requests |
CONTENT_EMPTY | No text extracted | Check if site requires JavaScript |