Web Scraping vs APIs: The Best Way to Extract Website Metadata
Development8 min read

Web Scraping vs APIs: The Best Way to Extract Website Metadata

Should you build a scraper or use an API? Compare the pros, cons, costs, and maintenance burden of each approach for metadata extraction.

Katsau

Katsau Team

December 24, 2025

Share:

You need to extract metadata from websites—titles, descriptions, images, Open Graph tags. Should you build your own scraper or use a metadata API? This guide breaks down both approaches with real numbers on cost, complexity, and maintenance.

The Problem: Extracting Website Metadata

Whether you're building link previews, an SEO tool, or a content aggregator, you need to fetch data from external websites. Sounds simple, right?

// Naive approach
const response = await fetch('https://example.com');
const html = await response.text();
// Parse HTML and extract meta tags...

But this simple approach fails in production:

  • CORS blocks browser requests to external domains
  • JavaScript-rendered pages return empty HTML
  • Rate limiting from target sites blocks your requests
  • Edge cases break your parser constantly

You have two options: build a robust scraper or use a metadata API.

Option 1: Build Your Own Scraper

Let's look at what it takes to build a production-grade metadata scraper.

Basic Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Your Application                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Scraper Service                          │
├─────────────────────────────────────────────────────────────┤
│  URL Queue → Rate Limiter → Fetcher → Parser → Cache        │
│                    │                                        │
│                    ▼                                        │
│            Headless Browser                                 │
│         (for JS-rendered pages)                             │
└─────────────────────────────────────────────────────────────┘

Implementation: Basic Scraper

Here's a minimal Node.js scraper:

// scraper.ts
import * as cheerio from 'cheerio';

interface Metadata {
  title: string;
  description: string;
  image: string | null;
  favicon: string | null;
}

export async function scrapeMetadata(url: string): Promise<Metadata> {
  const response = await fetch(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; MetadataBot/1.0)',
    },
  });

  if (!response.ok) {
    throw new Error(`HTTP ${response.status}`);
  }

  const html = await response.text();
  const $ = cheerio.load(html);

  return {
    title:
      $('meta[property="og:title"]').attr('content') ||
      $('meta[name="twitter:title"]').attr('content') ||
      $('title').text() ||
      '',
    description:
      $('meta[property="og:description"]').attr('content') ||
      $('meta[name="twitter:description"]').attr('content') ||
      $('meta[name="description"]').attr('content') ||
      '',
    image:
      $('meta[property="og:image"]').attr('content') ||
      $('meta[name="twitter:image"]').attr('content') ||
      null,
    favicon: resolveUrl($('link[rel="icon"]').attr('href'), url),
  };
}

function resolveUrl(path: string | undefined, base: string): string | null {
  if (!path) return null;
  try {
    return new URL(path, base).href;
  } catch {
    return null;
  }
}

The Problems Start

This basic scraper fails on many sites. Let's fix the issues one by one:

Problem 1: JavaScript-Rendered Pages

Many modern sites (React, Vue, Angular) render content client-side:

// scrapeMetadata('https://react-spa.com') returns empty data!

Solution: Add Puppeteer

import puppeteer from 'puppeteer';

async function scrapeWithBrowser(url: string): Promise<string> {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });

  try {
    const page = await browser.newPage();
    await page.setUserAgent('Mozilla/5.0...');
    await page.goto(url, { waitUntil: 'networkidle0', timeout: 30000 });
    return await page.content();
  } finally {
    await browser.close();
  }
}

Cost: Puppeteer uses 50-200MB RAM per instance. Running 10 concurrent scrapes needs 2GB RAM minimum.

Problem 2: Rate Limiting

Sites block rapid requests:

// After 10 requests to the same domain:
// HTTP 429 Too Many Requests

Solution: Add rate limiting per domain

import Bottleneck from 'bottleneck';

const limiters = new Map<string, Bottleneck>();

function getLimiter(domain: string): Bottleneck {
  if (!limiters.has(domain)) {
    limiters.set(domain, new Bottleneck({
      maxConcurrent: 1,
      minTime: 2000, // 1 request per 2 seconds per domain
    }));
  }
  return limiters.get(domain)!;
}

async function rateLimitedFetch(url: string): Promise<Response> {
  const domain = new URL(url).hostname;
  const limiter = getLimiter(domain);
  return limiter.schedule(() => fetch(url));
}

Problem 3: Caching

Fetching the same URL repeatedly wastes resources:

import { Redis } from 'ioredis';

const redis = new Redis();
const CACHE_TTL = 3600; // 1 hour

async function getCachedMetadata(url: string): Promise<Metadata | null> {
  const cached = await redis.get(`meta:${url}`);
  return cached ? JSON.parse(cached) : null;
}

async function cacheMetadata(url: string, data: Metadata): Promise<void> {
  await redis.set(`meta:${url}`, JSON.stringify(data), 'EX', CACHE_TTL);
}

Problem 4: Error Handling

Real-world URLs fail in many ways:

async function robustScrape(url: string): Promise<Metadata> {
  try {
    // Validate URL
    const parsed = new URL(url);
    if (!['http:', 'https:'].includes(parsed.protocol)) {
      throw new Error('Invalid protocol');
    }

    // Check cache first
    const cached = await getCachedMetadata(url);
    if (cached) return cached;

    // Try simple fetch first
    let html: string;
    try {
      const response = await rateLimitedFetch(url);
      html = await response.text();
    } catch {
      // Fall back to browser for problematic sites
      html = await scrapeWithBrowser(url);
    }

    const metadata = parseHtml(html, url);
    await cacheMetadata(url, metadata);
    return metadata;
  } catch (error) {
    // Return minimal fallback
    return {
      title: new URL(url).hostname,
      description: '',
      image: null,
      favicon: null,
    };
  }
}

Full Scraper Infrastructure

A production scraper needs:

Component Purpose Technology
Queue Job management Redis + BullMQ
Browser Pool JS rendering Puppeteer cluster
Rate Limiter Respect site limits Bottleneck
Cache Avoid re-fetching Redis
Proxy Rotation Avoid IP bans Proxy service
Monitoring Track failures Prometheus/Grafana
Retry Logic Handle transient errors Exponential backoff

Cost Analysis: DIY Scraper

Item Monthly Cost
Server (4GB RAM, 2 CPU) $40-80
Redis (caching) $15-30
Proxy service $50-200
Monitoring $20-50
Total infrastructure $125-360/month
Engineering time (setup) 40-80 hours
Maintenance 5-10 hours/month

Option 2: Use a Metadata API

APIs handle all the complexity for you:

async function getMetadata(url: string): Promise<Metadata> {
  const response = await fetch(
    `https://api.katsau.com/v1/extract?url=${encodeURIComponent(url)}`,
    {
      headers: { 'Authorization': 'Bearer YOUR_API_KEY' }
    }
  );

  const { data } = await response.json();
  return data;
}

That's it. One API call, all the metadata you need.

What the API Handles

Challenge DIY Solution API Solution
CORS Backend service Handled
JS rendering Puppeteer cluster Handled
Rate limiting Per-domain limiters Handled
Caching Redis setup Handled
Proxy rotation Proxy service Handled
Edge cases Constant fixes Handled
Uptime Your responsibility 99.9% SLA

Cost Analysis: API

Usage Monthly Cost
1,000 requests Free
10,000 requests ~$20
50,000 requests ~$50
100,000 requests ~$80

No infrastructure. No maintenance. Predictable costs.

When to Build vs Buy

Build Your Own Scraper When:

  1. Extreme customization - You need very specific data extraction
  2. Massive scale - Millions of URLs per day
  3. Sensitive data - Cannot send URLs to third parties
  4. Learning - Educational project

Use an API When:

  1. Time to market - Need it working today
  2. Moderate scale - Thousands to hundreds of thousands of URLs
  3. Reliability - Can't afford downtime
  4. Focus - Want to build product, not infrastructure

Real-World Comparison

Let's compare both approaches for a real use case: building link previews for a chat app.

Scenario: 10,000 link previews/month

DIY Approach:

Setup time: 40 hours × $100/hour = $4,000
Monthly infra: $150
Monthly maintenance: 5 hours × $100 = $500
First year cost: $4,000 + (12 × $650) = $11,800

API Approach:

Setup time: 2 hours × $100/hour = $200
Monthly cost: $20
First year cost: $200 + (12 × $20) = $440

Savings with API: $11,360 in year one

Scenario: 500,000 link previews/month

DIY Approach:

Setup time: 80 hours × $100/hour = $8,000
Monthly infra: $500 (larger servers, more proxies)
Monthly maintenance: 10 hours × $100 = $1,000
First year cost: $8,000 + (12 × $1,500) = $26,000

API Approach:

Setup time: 2 hours × $100/hour = $200
Monthly cost: $200 (enterprise plan)
First year cost: $200 + (12 × $200) = $2,600

Savings with API: $23,400 in year one

Even at high scale, APIs often win on total cost.

Hybrid Approach

Some teams use both:

async function getMetadata(url: string): Promise<Metadata> {
  // Check your cache first
  const cached = await cache.get(url);
  if (cached) return cached;

  // Use API for fresh data
  const data = await apiClient.extract(url);

  // Cache with your own TTL
  await cache.set(url, data, { ttl: 86400 }); // 24 hours

  return data;
}

This gives you:

  • Control over caching strategy
  • Reliability of professional API
  • Cost optimization through local caching

Making the Decision

Ask yourself these questions:

Question Build Buy
Do I need this working in < 1 week?
Is metadata extraction my core product?
Do I have DevOps resources?
Is my budget < $500/month?
Do I need 99.9% uptime?
Am I scraping > 1M URLs/month?

Conclusion

For most teams, a metadata API is the right choice. The math is clear:

  • Lower total cost (infrastructure + engineering time)
  • Faster time to market (days vs weeks)
  • Better reliability (professional SLA vs DIY monitoring)
  • Focus on your product (not scraping infrastructure)

Build your own scraper only if metadata extraction is your core business or you have very specific requirements that APIs can't meet.


Ready to stop maintaining scrapers? Try Katsau's metadata API free — 1,000 requests/month, no credit card required.

Ready to build?

Try Katsau API

Extract metadata, generate link previews, and monitor URLs with our powerful API. Start free with 1,000 requests per month.