The single biggest reason web scrapers fail isn't bad code — it's getting blocked. Modern websites deploy sophisticated anti-bot systems (Cloudflare, DataDome, PerimeterX, Akamai) that detect and ban automated traffic within seconds. The good news: with the right techniques, you can scrape reliably at scale. This guide walks through the ten most effective methods, from quick wins to production-grade infrastructure.
1. Rotate Your IP Address with Proxies
If hundreds of requests hit a site from a single IP in a short window, that IP gets banned — fast. Rotating proxies distribute your requests across a large pool of IP addresses so no single address looks suspicious.
This is the foundation of reliable scraping. Datacenter proxies are cheap and fast but easier to detect; residential and mobile proxies route through real consumer devices and are far harder to block. See our full breakdown in Residential vs. Datacenter vs. Mobile Proxies.
Recommended: Bright Data offers a 72M+ residential IP pool with automatic rotation and a 7-day free trial, no credit card required. Budget-friendly alternatives include IPRoyal and DataImpulse.
2. Set a Realistic User-Agent
The default User-Agent of libraries like Python's requests (e.g. python-requests/2.31.0) is an instant giveaway. Always send a real browser User-Agent, and rotate between several.
import requests
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
)
}
response = requests.get("https://example.com", headers=headers)
3. Send a Complete Set of Headers
Real browsers send far more than just a User-Agent. Anti-bot systems check for the presence and consistency of headers like Accept, Accept-Language, Accept-Encoding, and Referer. A request missing these looks robotic.
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
"Connection": "keep-alive",
}
4. Throttle Your Request Rate
Humans don't load 50 pages per second. Aggressive request rates are one of the easiest patterns to detect. Add delays between requests — and randomize them so the timing doesn't look mechanical.
import time
import random
for url in urls:
scrape(url)
time.sleep(random.uniform(2, 6)) # random 2–6 second delay
5. Respect robots.txt and Rate Limits
Before scraping, check the site's robots.txt and any published rate limits. Honoring them keeps you ethical, reduces your footprint, and lowers your chances of being flagged. Scraping publicly available data is generally permissible, but always avoid logged-in/private areas and personal data.
6. Handle CAPTCHAs
Once you trigger a CAPTCHA, basic scrapers are stuck. You have three options: avoid triggering them in the first place (the techniques in this guide), use a CAPTCHA-solving service, or use a tool that solves them automatically. Managed solutions like the Bright Data Scraping Browser detect and solve CAPTCHAs as part of their unlocking flow, with no extra integration.
7. Use a Headless Browser for JavaScript-Heavy Sites
Many modern sites render content with JavaScript, so a plain HTTP request returns an empty shell. Tools like Playwright, Puppeteer, and Selenium drive a real browser engine that executes JavaScript just like a user's browser would.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com")
print(page.title())
browser.close()
8. Manage Your Browser Fingerprint
Anti-bot systems build a "fingerprint" from dozens of signals — screen resolution, installed fonts, WebGL renderer, timezone, and navigator properties. Headless browsers leak tell-tale values (like navigator.webdriver = true). Use stealth plugins (e.g. playwright-stealth, puppeteer-extra-plugin-stealth) or a managed browser that handles fingerprinting for you.
9. Maintain Cookies and Sessions
Real users carry cookies across requests. A scraper that drops cookies on every request looks anonymous and suspicious. Use a session object to persist cookies naturally.
import requests
session = requests.Session()
session.headers.update(headers)
session.get("https://example.com") # picks up cookies
session.get("https://example.com/page/2") # reuses them
10. Use a Web Unlocker or Scraping API for Tough Targets
When you're up against the most aggressive anti-bot systems, managing all of the above yourself becomes a full-time job. Web Unlockers and scraping APIs bundle proxy rotation, header management, fingerprinting, CAPTCHA solving, and automatic retries into a single endpoint — you send a URL and get clean HTML back.
Try it free: The Bright Data Web Unlocker handles blocks automatically and comes with a 7-day free trial. Start scraping →
For a simpler pay-as-you-go API, ScraperAPI is also worth a look.
Quick Reference
| Technique | Difficulty | Impact |
|---|---|---|
| Rotate IPs with proxies | Medium | Very High |
| Realistic User-Agent | Easy | High |
| Complete headers | Easy | Medium |
| Throttle request rate | Easy | High |
| Handle CAPTCHAs | Hard | High |
| Headless browser | Medium | High |
| Manage fingerprint | Hard | High |
| Maintain sessions | Easy | Medium |
| Web Unlocker / API | Easy | Very High |
The Bottom Line
Avoiding blocks is about looking like a real user: rotating IPs, sending believable headers, pacing your requests, and rendering JavaScript when needed. For small projects, the manual techniques here go a long way. For scraping protected sites at scale, a managed solution that bundles unblocking infrastructure will save you enormous time.
Start your free 7-day trial of Bright Data →
New to scraping? Start with our Web Scraping with Python guide, or compare providers in our proxy and scraper reviews.