Pagination is one of the first obstacles every web scraper hits. Product listings, search results, news archives — any site with more items than fit on one page uses some form of it. Miss it, and your scraper silently collects a fraction of the data you need without any error to alert you.
This guide covers every pagination pattern you'll encounter in the wild and shows you exactly how to handle each one in Python.
Why Pagination Matters
A single product category on a large e-commerce site can span hundreds of pages. A news site's archive may run thousands. If your scraper stops at page one, you might capture 1% of the data you actually need — and you won't even know it's missing.
The goal: build a loop that keeps following pages until there are no more.
Pattern 1: Query String Pagination (?page=N)
The simplest and most common pattern. Each page is a distinct URL with a numeric parameter:
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?page=3
How to detect it: Click "Next" or a page number and watch the URL. If a page=, p=, or start= parameter appears or increments, you're dealing with query string pagination.
How to scrape it:
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://example.com/products"
all_items = []
page = 1
while True:
response = requests.get(BASE_URL, params={"page": page})
soup = BeautifulSoup(response.text, "html.parser")
items = soup.select(".product-card")
if not items:
break # no more results
all_items.extend(items)
page += 1
The key is a termination condition — here we stop when the page returns no items. Alternatives include checking for a disabled "Next" button or comparing the current page number against a total page count you extract from the HTML.
Pattern 2: Offset / Cursor Pagination
Many sites (and especially JSON APIs) use an offset rather than a page number:
https://example.com/products?offset=0&limit=24
https://example.com/products?offset=24&limit=24
https://example.com/products?offset=48&limit=24
Handle this by incrementing offset by the page size each iteration:
LIMIT = 24
offset = 0
while True:
response = requests.get(BASE_URL, params={"limit": LIMIT, "offset": offset})
data = response.json()
if not data["items"]:
break
process(data["items"])
offset += LIMIT
Some APIs return a next_cursor token instead of a numeric offset. Pass cursor=<token> in the next request and stop when the response contains no cursor field.
Pattern 3: "Next" Button Crawling
Some sites don't expose the total page count — they just render a "Next" link. Your scraper needs to find and follow it:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
url = "https://example.com/products"
while url:
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.select(".product-card")
process(items)
next_link = soup.select_one("a.pagination__next")
if next_link and next_link.get("href"):
url = urljoin(url, next_link["href"])
else:
url = None # no more pages
urljoin handles relative URLs gracefully — many sites link to /products?page=2 rather than a full absolute URL.
Tip: If you're getting blocked mid-pagination, rotating proxies will help. Distributing requests across different IP addresses prevents any single IP from accumulating a suspicious request count. See Residential vs. Datacenter vs. Mobile Proxies for guidance on which type to use, or check out Bright Data's residential proxy network which offers 400M+ IPs with automatic rotation.
Pattern 4: Infinite Scroll (AJAX / XHR)
The most deceptive pattern. The page appears to have no pagination — content loads as you scroll. Under the hood, the browser fires XHR requests to load batches of results.
How to detect it: Open DevTools → Network → XHR/Fetch. Scroll the page and watch for new requests. You'll usually see a JSON endpoint being called with page or offset parameters.
Option A: Hit the underlying API directly. This is faster and cleaner than automating a real browser. Copy the XHR URL from DevTools and replicate the request:
# The underlying API endpoint discovered via DevTools
API_URL = "https://example.com/api/products"
headers = {
"X-Requested-With": "XMLHttpRequest",
"Accept": "application/json",
}
page = 1
while True:
response = requests.get(API_URL, params={"page": page}, headers=headers)
data = response.json()
if not data.get("results"):
break
process(data["results"])
page += 1
Option B: Use Playwright to simulate scrolling. If the underlying API is obfuscated or requires complex auth tokens, drive a real browser:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/products")
prev_height = 0
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000) # wait for new content to load
new_height = page.evaluate("document.body.scrollHeight")
if new_height == prev_height:
break # reached the bottom
prev_height = new_height
html = page.content()
browser.close()
For JavaScript-heavy pagination on tough sites, a managed scraping browser handles rendering, fingerprinting, and proxy rotation for you — no Playwright configuration required. Bright Data's Scraping Browser is purpose-built for exactly this use case.
Pattern 5: Directory + Detail Pages (Two-Pass Crawl)
For directory-style sites — real estate listings, job boards, business directories — you typically need two loops: one for the paginated index, one for each detail page linked from it.
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
INDEX_URL = "https://example.com/listings?page={page}"
detail_urls = []
# Pass 1: collect all detail-page URLs from the paginated index
page = 1
while True:
soup = BeautifulSoup(
requests.get(INDEX_URL.format(page=page)).text, "html.parser"
)
links = [urljoin(INDEX_URL, a["href"]) for a in soup.select("a.listing-link")]
if not links:
break
detail_urls.extend(links)
page += 1
# Pass 2: scrape each detail page
for url in detail_urls:
detail = BeautifulSoup(requests.get(url).text, "html.parser")
process(detail)
Handling Duplicates and Resumability
Long pagination runs can fail halfway through — a network error, a block, or a timeout. Protect yourself with a seen-set so you don't reprocess items, and save progress to disk so you can resume:
import json, os
STATE_FILE = "scrape_state.json"
seen = set(json.load(open(STATE_FILE)) if os.path.exists(STATE_FILE) else [])
for item in scraped_items:
if item["id"] not in seen:
save(item)
seen.add(item["id"])
# checkpoint regularly so a crash doesn't lose everything
json.dump(list(seen), open(STATE_FILE, "w"))
Rate Limiting and Politeness
Pagination scrapers fire many requests in a tight loop — exactly the pattern anti-bot systems watch for. Always add a randomized delay between page requests:
import time, random
time.sleep(random.uniform(1.5, 4.0))
For large crawls, combine this with proxy rotation so no single IP bears your full request volume. The full anti-blocking playbook is here.
If you'd rather skip the infrastructure work entirely, managed scraping APIs handle proxy rotation, retries, and JavaScript rendering behind a single endpoint. ScraperAPI and ZenRows are popular options; Oxylabs covers enterprise-scale crawls. Compare them head-to-head in our scraping API reviews and Bright Data vs. Oxylabs comparison.
Quick Reference
| Pagination Pattern | How to Detect | Scraping Approach |
|---|---|---|
Query string (?page=N) | URL parameter increments on "Next" | Loop incrementing page param |
Offset (?offset=N) | URL offset increases by page size | Loop incrementing offset by page size |
| Next-button crawl | "Next" link present, no page count | Follow href of "Next" link |
| Infinite scroll — JSON API | XHR requests visible in DevTools | Call the XHR endpoint directly |
| Infinite scroll — rendered | No clean XHR endpoint found | Playwright scroll-to-bottom loop |
| Directory + detail pages | Index lists links, detail pages hold data | Two-pass loop |
The Bottom Line
Most pagination boils down to a loop: request a page, collect what you need, find the next URL, repeat until done. The tricky cases — infinite scroll, obfuscated APIs, aggressive anti-bot systems — are solved either by reverse-engineering the underlying network request or by driving a real browser.
Build in a termination condition, add politeness delays, and rotate your IPs on any large run. Those three habits transform a fragile one-off script into a reliable data pipeline.
Get started with Bright Data →
New to scraping? Start with our Web Scraping with Python guide or read How to Avoid Getting Blocked While Scraping.