Web Scraping
import { Aside } from ‘@astrojs/starlight/components’;
OpenSNS uses a multi-layer scraping approach to extract product information from URLs.
Scraping Pipeline
Section titled “Scraping Pipeline”The scraper tries multiple methods in order:
- Playwright (default) - Full browser rendering
- Firecrawl (if configured) - Cloud-based scraping
- Basic HTTP - Simple HTML fetching
- Fallback Data - Minimal placeholder data
Playwright Scraper
Section titled “Playwright Scraper”The primary scraper uses Playwright for full JavaScript rendering:
# Handles:- JavaScript-rendered content- Dynamic loading- Single Page Applications- Anti-bot protections (via real browser)Extracted Data
Section titled “Extracted Data”| Field | Description |
|---|---|
title | Product name |
description | Product description |
features | Bullet points / feature list |
price | Product price |
images | Product image URLs |
content | Main page text |
metadata | Open Graph, JSON-LD data |
Smart Selectors
Section titled “Smart Selectors”The scraper uses multiple selector strategies:
# Title selectors (tried in order)"h1.product-title""h1.product-name""h1[itemprop='name']"".product-title h1""#product-title""h1"Firecrawl Integration
Section titled “Firecrawl Integration”For sites that block automation, Firecrawl provides cloud-based scraping:
- Get API key from firecrawl.dev
- Add to Settings or environment
- Scraper automatically uses it as fallback
Handling Edge Cases
Section titled “Handling Edge Cases”JavaScript-Heavy Sites
Section titled “JavaScript-Heavy Sites”Playwright waits for content to load:
- DOM content loaded
- Optional selector wait
- 1 second delay for dynamic content
Anti-Bot Protection
Section titled “Anti-Bot Protection”The scraper uses realistic browser fingerprints:
- Real Chrome user agent
- Standard viewport size
- JavaScript enabled
Rate Limiting
Section titled “Rate Limiting”Be respectful of target sites:
- One request per campaign
- No rapid successive requests
- Respect robots.txt (manual check)
Troubleshooting
Section titled “Troubleshooting””Scraping failed” Error
Section titled “”Scraping failed” Error”- Check if the URL is accessible in a browser
- Try enabling Firecrawl
- Check for CAPTCHA requirements
- Verify URL is a product page (not homepage)
Missing Product Data
Section titled “Missing Product Data”Some sites don’t expose structured data:
- Description in images only
- Dynamic pricing
- Login-required content
The scraper extracts what’s available and uses AI to fill gaps.
Best Practices
Section titled “Best Practices”- Use direct product URLs - Not category or search pages
- E-commerce platforms work best - Shopify, WooCommerce, Amazon
- Public pages only - Login-required pages won’t work
- Check robots.txt - Respect site policies