Skip to content

Web Scraping

import { Aside } from ‘@astrojs/starlight/components’;

OpenSNS uses a multi-layer scraping approach to extract product information from URLs.

The scraper tries multiple methods in order:

  1. Playwright (default) - Full browser rendering
  2. Firecrawl (if configured) - Cloud-based scraping
  3. Basic HTTP - Simple HTML fetching
  4. Fallback Data - Minimal placeholder data

The primary scraper uses Playwright for full JavaScript rendering:

# Handles:
- JavaScript-rendered content
- Dynamic loading
- Single Page Applications
- Anti-bot protections (via real browser)
FieldDescription
titleProduct name
descriptionProduct description
featuresBullet points / feature list
priceProduct price
imagesProduct image URLs
contentMain page text
metadataOpen Graph, JSON-LD data

The scraper uses multiple selector strategies:

# Title selectors (tried in order)
"h1.product-title"
"h1.product-name"
"h1[itemprop='name']"
".product-title h1"
"#product-title"
"h1"

For sites that block automation, Firecrawl provides cloud-based scraping:

  1. Get API key from firecrawl.dev
  2. Add to Settings or environment
  3. Scraper automatically uses it as fallback

Playwright waits for content to load:

  • DOM content loaded
  • Optional selector wait
  • 1 second delay for dynamic content

The scraper uses realistic browser fingerprints:

  • Real Chrome user agent
  • Standard viewport size
  • JavaScript enabled

Be respectful of target sites:

  • One request per campaign
  • No rapid successive requests
  • Respect robots.txt (manual check)
  1. Check if the URL is accessible in a browser
  2. Try enabling Firecrawl
  3. Check for CAPTCHA requirements
  4. Verify URL is a product page (not homepage)

Some sites don’t expose structured data:

  • Description in images only
  • Dynamic pricing
  • Login-required content

The scraper extracts what’s available and uses AI to fill gaps.

  1. Use direct product URLs - Not category or search pages
  2. E-commerce platforms work best - Shopify, WooCommerce, Amazon
  3. Public pages only - Login-required pages won’t work
  4. Check robots.txt - Respect site policies

Generate ad creatives from any product URL. Open source, self-hostable, free tier available.

Try OpenSNS Free →