Web Scraper

Fetch web page content and convert to clean markdown format.

Usage

Run the fetch script to get web content:

python3 scripts/fetch_url.py <url> [options]

Fetch single URL:

python3 scripts/fetch_url.py "https://example.com/article"

Fetch with custom timeout:

python3 scripts/fetch_url.py "https://example.com/article" --timeout 60

Fetch multiple URLs in parallel:

for url in "https://url1.com" "https://url2.com"; do
  python3 scripts/fetch_url.py "$url" &
done
wait

Single URL: Run fetch_url.py with the URL
Multiple URLs: Run multiple fetch commands in parallel using background processes
Handle errors: If a URL fails, check:
- Network connectivity
- URL validity
- Website may block automated requests (try different User-Agent or use browser automation)

The script converts HTML to clean markdown:

403 Forbidden: Website blocks automated requests. Consider:

Timeout errors: Increase timeout with --timeout 60

Empty content: Website may require JavaScript to render content