[{"data":1,"prerenderedAt":5009},["ShallowReactive",2],{"blog-list":3},[4,687,1136,1681,2223,2644,2972,3347,3659,3935,4267,4671],{"id":5,"title":6,"body":7,"date":666,"description":667,"draft":668,"extension":669,"meta":670,"navigation":149,"path":671,"readingTime":672,"seo":673,"stem":674,"tags":675,"takeaways":680,"updated":685,"__hash__":686},"blog\u002Fblog\u002Frotating-proxies-for-web-scraping.md","How to Integrate Rotating Proxies for Web Scraping (Without Getting Blocked)",{"type":8,"value":9,"toc":655},"minimark",[10,14,23,28,31,35,38,113,120,124,127,209,217,221,224,370,384,388,391,475,486,490,493,525,528,532,535,589,592,596,631,635,651],[11,12,6],"h1",{"id":13},"how-to-integrate-rotating-proxies-for-web-scraping-without-getting-blocked",[15,16,17,18,22],"p",{},"If your scraper works for the first hundred requests and then starts returning ",[19,20,21],"code",{},"403",", empty pages, or CAPTCHAs, you have an IP reputation problem, not a code problem. The fix is rotating proxies. This guide covers how to choose the right proxy type, integrate it into a Python scraper, and build the rotation and retry logic that keeps a job running at scale.",[24,25,27],"h2",{"id":26},"why-a-single-ip-gets-blocked","Why a single IP gets blocked",[15,29,30],{},"Every request you send carries your IP address. Anti-bot systems (Cloudflare, DataDome, Akamai, PerimeterX) track request volume, timing, and behavior per IP. A datacenter IP sending 500 requests a minute to a product page looks nothing like a human, so it gets rate-limited or banned. Rotating proxies spread your requests across many IPs so no single address crosses the threshold.",[24,32,34],{"id":33},"proxy-types-and-when-to-use-each","Proxy types, and when to use each",[15,36,37],{},"There are three categories, and picking the wrong one is the most common reason a scrape fails.",[39,40,41,60],"table",{},[42,43,44],"thead",{},[45,46,47,51,54,57],"tr",{},[48,49,50],"th",{},"Type",[48,52,53],{},"Cost",[48,55,56],{},"Detection risk",[48,58,59],{},"Best for",[61,62,63,81,97],"tbody",{},[45,64,65,72,75,78],{},[66,67,68],"td",{},[69,70,71],"strong",{},"Datacenter",[66,73,74],{},"Cheapest",[66,76,77],{},"High",[66,79,80],{},"Unprotected sites, internal tools, high volume where bans are cheap",[45,82,83,88,91,94],{},[66,84,85],{},[69,86,87],{},"Residential",[66,89,90],{},"Mid to high",[66,92,93],{},"Low",[66,95,96],{},"E-commerce, sites behind Cloudflare\u002FDataDome",[45,98,99,104,107,110],{},[66,100,101],{},[69,102,103],{},"Mobile (4G\u002F5G)",[66,105,106],{},"Highest",[66,108,109],{},"Lowest",[66,111,112],{},"The hardest targets like Instagram, sneaker sites, aggressive WAFs",[15,114,115,116,119],{},"Rule of thumb: ",[69,117,118],{},"start with datacenter, escalate to residential only when you see blocks."," Paying for residential on a site that doesn't need it just burns budget.",[24,121,123],{"id":122},"basic-integration-in-python-requests","Basic integration in Python (requests)",[15,125,126],{},"Most providers give you a single gateway endpoint that rotates the IP for you on every request:",[128,129,134],"pre",{"className":130,"code":131,"language":132,"meta":133,"style":133},"language-python shiki shiki-themes github-light github-dark","import requests\n\nPROXY = \"http:\u002F\u002FUSER:PASS@gateway.provider.com:7000\"\n\nproxies = {\"http\": PROXY, \"https\": PROXY}\n\nresp = requests.get(\n    \"https:\u002F\u002Fexample.com\u002Fproducts\",\n    proxies=proxies,\n    timeout=20,\n)\nprint(resp.status_code, resp.url)\n","python","",[19,135,136,144,151,157,162,168,173,179,185,191,197,203],{"__ignoreMap":133},[137,138,141],"span",{"class":139,"line":140},"line",1,[137,142,143],{},"import requests\n",[137,145,147],{"class":139,"line":146},2,[137,148,150],{"emptyLinePlaceholder":149},true,"\n",[137,152,154],{"class":139,"line":153},3,[137,155,156],{},"PROXY = \"http:\u002F\u002FUSER:PASS@gateway.provider.com:7000\"\n",[137,158,160],{"class":139,"line":159},4,[137,161,150],{"emptyLinePlaceholder":149},[137,163,165],{"class":139,"line":164},5,[137,166,167],{},"proxies = {\"http\": PROXY, \"https\": PROXY}\n",[137,169,171],{"class":139,"line":170},6,[137,172,150],{"emptyLinePlaceholder":149},[137,174,176],{"class":139,"line":175},7,[137,177,178],{},"resp = requests.get(\n",[137,180,182],{"class":139,"line":181},8,[137,183,184],{},"    \"https:\u002F\u002Fexample.com\u002Fproducts\",\n",[137,186,188],{"class":139,"line":187},9,[137,189,190],{},"    proxies=proxies,\n",[137,192,194],{"class":139,"line":193},10,[137,195,196],{},"    timeout=20,\n",[137,198,200],{"class":139,"line":199},11,[137,201,202],{},")\n",[137,204,206],{"class":139,"line":205},12,[137,207,208],{},"print(resp.status_code, resp.url)\n",[15,210,211,212,216],{},"This is the simplest setup: the provider's gateway hands you a fresh IP per request. It works, but it gives you no control over ",[213,214,215],"em",{},"when"," to rotate or how to react to a ban.",[24,218,220],{"id":219},"manual-rotation-with-a-proxy-pool","Manual rotation with a proxy pool",[15,222,223],{},"When you need control, for instance keeping the same IP across a multi-step login flow before rotating, manage the pool yourself:",[128,225,227],{"className":130,"code":226,"language":132,"meta":133,"style":133},"import random\nimport requests\n\nPROXY_POOL = [\n    \"http:\u002F\u002FUSER:PASS@p1.provider.com:8000\",\n    \"http:\u002F\u002FUSER:PASS@p2.provider.com:8000\",\n    \"http:\u002F\u002FUSER:PASS@p3.provider.com:8000\",\n]\n\ndef fetch(url: str, max_retries: int = 3) -> requests.Response | None:\n    tried = set()\n    for _ in range(max_retries):\n        proxy = random.choice([p for p in PROXY_POOL if p not in tried])\n        tried.add(proxy)\n        try:\n            resp = requests.get(\n                url,\n                proxies={\"http\": proxy, \"https\": proxy},\n                timeout=20,\n            )\n            if resp.status_code == 200:\n                return resp\n            # 403\u002F429 → this IP is burned, rotate\n        except requests.RequestException:\n            continue  # dead proxy, try the next one\n    return None\n",[19,228,229,234,238,242,247,252,257,262,267,271,276,281,286,292,298,304,310,316,322,328,334,340,346,352,358,364],{"__ignoreMap":133},[137,230,231],{"class":139,"line":140},[137,232,233],{},"import random\n",[137,235,236],{"class":139,"line":146},[137,237,143],{},[137,239,240],{"class":139,"line":153},[137,241,150],{"emptyLinePlaceholder":149},[137,243,244],{"class":139,"line":159},[137,245,246],{},"PROXY_POOL = [\n",[137,248,249],{"class":139,"line":164},[137,250,251],{},"    \"http:\u002F\u002FUSER:PASS@p1.provider.com:8000\",\n",[137,253,254],{"class":139,"line":170},[137,255,256],{},"    \"http:\u002F\u002FUSER:PASS@p2.provider.com:8000\",\n",[137,258,259],{"class":139,"line":175},[137,260,261],{},"    \"http:\u002F\u002FUSER:PASS@p3.provider.com:8000\",\n",[137,263,264],{"class":139,"line":181},[137,265,266],{},"]\n",[137,268,269],{"class":139,"line":187},[137,270,150],{"emptyLinePlaceholder":149},[137,272,273],{"class":139,"line":193},[137,274,275],{},"def fetch(url: str, max_retries: int = 3) -> requests.Response | None:\n",[137,277,278],{"class":139,"line":199},[137,279,280],{},"    tried = set()\n",[137,282,283],{"class":139,"line":205},[137,284,285],{},"    for _ in range(max_retries):\n",[137,287,289],{"class":139,"line":288},13,[137,290,291],{},"        proxy = random.choice([p for p in PROXY_POOL if p not in tried])\n",[137,293,295],{"class":139,"line":294},14,[137,296,297],{},"        tried.add(proxy)\n",[137,299,301],{"class":139,"line":300},15,[137,302,303],{},"        try:\n",[137,305,307],{"class":139,"line":306},16,[137,308,309],{},"            resp = requests.get(\n",[137,311,313],{"class":139,"line":312},17,[137,314,315],{},"                url,\n",[137,317,319],{"class":139,"line":318},18,[137,320,321],{},"                proxies={\"http\": proxy, \"https\": proxy},\n",[137,323,325],{"class":139,"line":324},19,[137,326,327],{},"                timeout=20,\n",[137,329,331],{"class":139,"line":330},20,[137,332,333],{},"            )\n",[137,335,337],{"class":139,"line":336},21,[137,338,339],{},"            if resp.status_code == 200:\n",[137,341,343],{"class":139,"line":342},22,[137,344,345],{},"                return resp\n",[137,347,349],{"class":139,"line":348},23,[137,350,351],{},"            # 403\u002F429 → this IP is burned, rotate\n",[137,353,355],{"class":139,"line":354},24,[137,356,357],{},"        except requests.RequestException:\n",[137,359,361],{"class":139,"line":360},25,[137,362,363],{},"            continue  # dead proxy, try the next one\n",[137,365,367],{"class":139,"line":366},26,[137,368,369],{},"    return None\n",[15,371,372,373,383],{},"The key ideas: ",[69,374,375,376,378,379,382],{},"track which proxies you've already tried for a given request, treat ",[19,377,21],{},"\u002F",[19,380,381],{},"429"," as a signal to rotate, and silently skip dead proxies."," Without retry logic, a single bad IP fails the whole job.",[24,385,387],{"id":386},"proxies-with-a-headless-browser-playwright","Proxies with a headless browser (Playwright)",[15,389,390],{},"For JavaScript-rendered sites you need a real browser. Playwright takes a proxy per context, which lets you isolate sessions:",[128,392,394],{"className":130,"code":393,"language":132,"meta":133,"style":133},"from playwright.async_api import async_playwright\n\nasync def scrape(url: str, proxy: str):\n    async with async_playwright() as p:\n        browser = await p.chromium.launch(\n            proxy={\n                \"server\": \"http:\u002F\u002Fgateway.provider.com:7000\",\n                \"username\": \"USER\",\n                \"password\": \"PASS\",\n            },\n        )\n        page = await browser.new_page()\n        await page.goto(url, wait_until=\"networkidle\")\n        html = await page.content()\n        await browser.close()\n        return html\n",[19,395,396,401,405,410,415,420,425,430,435,440,445,450,455,460,465,470],{"__ignoreMap":133},[137,397,398],{"class":139,"line":140},[137,399,400],{},"from playwright.async_api import async_playwright\n",[137,402,403],{"class":139,"line":146},[137,404,150],{"emptyLinePlaceholder":149},[137,406,407],{"class":139,"line":153},[137,408,409],{},"async def scrape(url: str, proxy: str):\n",[137,411,412],{"class":139,"line":159},[137,413,414],{},"    async with async_playwright() as p:\n",[137,416,417],{"class":139,"line":164},[137,418,419],{},"        browser = await p.chromium.launch(\n",[137,421,422],{"class":139,"line":170},[137,423,424],{},"            proxy={\n",[137,426,427],{"class":139,"line":175},[137,428,429],{},"                \"server\": \"http:\u002F\u002Fgateway.provider.com:7000\",\n",[137,431,432],{"class":139,"line":181},[137,433,434],{},"                \"username\": \"USER\",\n",[137,436,437],{"class":139,"line":187},[137,438,439],{},"                \"password\": \"PASS\",\n",[137,441,442],{"class":139,"line":193},[137,443,444],{},"            },\n",[137,446,447],{"class":139,"line":199},[137,448,449],{},"        )\n",[137,451,452],{"class":139,"line":205},[137,453,454],{},"        page = await browser.new_page()\n",[137,456,457],{"class":139,"line":288},[137,458,459],{},"        await page.goto(url, wait_until=\"networkidle\")\n",[137,461,462],{"class":139,"line":294},[137,463,464],{},"        html = await page.content()\n",[137,466,467],{"class":139,"line":300},[137,468,469],{},"        await browser.close()\n",[137,471,472],{"class":139,"line":306},[137,473,474],{},"        return html\n",[15,476,477,478,481,482,485],{},"One critical detail: ",[69,479,480],{},"match your proxy's geolocation to the site's expected audience."," Scraping a US retailer through a German residential IP often triggers extra verification. Most residential providers let you pin a country (",[19,483,484],{},"gateway.provider.com:7000?country=us",").",[24,487,489],{"id":488},"combining-proxies-with-fingerprint-stealth","Combining proxies with fingerprint stealth",[15,491,492],{},"Rotating IPs alone is not enough on aggressively protected sites. A fresh residential IP paired with an obvious headless-Chrome fingerprint still gets flagged. The full stack looks like:",[494,495,496,503,513,519],"ol",{},[497,498,499,502],"li",{},[69,500,501],{},"Residential\u002Fmobile proxy"," for a clean IP reputation.",[497,504,505,508,509,512],{},[69,506,507],{},"Fingerprint spoofing"," with realistic ",[19,510,511],{},"navigator"," properties, WebGL, canvas, fonts.",[497,514,515,518],{},[69,516,517],{},"Human-like timing"," using randomized delays, no perfectly even request intervals.",[497,520,521,524],{},[69,522,523],{},"Session persistence"," that reuses cookies and the same IP within a logical session, rotating between sessions.",[15,526,527],{},"Skip any one layer and the others can't compensate. This is why \"just add proxies\" often fails on Cloudflare-protected targets: the IP was clean, but the fingerprint gave it away.",[24,529,531],{"id":530},"a-retry-pattern-that-survives-real-jobs","A retry pattern that survives real jobs",[15,533,534],{},"In production I wrap every request in exponential backoff with proxy rotation on hard failures:",[128,536,538],{"className":130,"code":537,"language":132,"meta":133,"style":133},"import time\n\ndef fetch_with_backoff(url: str, max_attempts: int = 5):\n    for attempt in range(max_attempts):\n        resp = fetch(url)  # rotates proxy internally\n        if resp is not None:\n            return resp\n        sleep = min(2 ** attempt, 30)  # cap backoff at 30s\n        time.sleep(sleep)\n    raise RuntimeError(f\"Failed after {max_attempts} attempts: {url}\")\n",[19,539,540,545,549,554,559,564,569,574,579,584],{"__ignoreMap":133},[137,541,542],{"class":139,"line":140},[137,543,544],{},"import time\n",[137,546,547],{"class":139,"line":146},[137,548,150],{"emptyLinePlaceholder":149},[137,550,551],{"class":139,"line":153},[137,552,553],{},"def fetch_with_backoff(url: str, max_attempts: int = 5):\n",[137,555,556],{"class":139,"line":159},[137,557,558],{},"    for attempt in range(max_attempts):\n",[137,560,561],{"class":139,"line":164},[137,562,563],{},"        resp = fetch(url)  # rotates proxy internally\n",[137,565,566],{"class":139,"line":170},[137,567,568],{},"        if resp is not None:\n",[137,570,571],{"class":139,"line":175},[137,572,573],{},"            return resp\n",[137,575,576],{"class":139,"line":181},[137,577,578],{},"        sleep = min(2 ** attempt, 30)  # cap backoff at 30s\n",[137,580,581],{"class":139,"line":187},[137,582,583],{},"        time.sleep(sleep)\n",[137,585,586],{"class":139,"line":193},[137,587,588],{},"    raise RuntimeError(f\"Failed after {max_attempts} attempts: {url}\")\n",[15,590,591],{},"Exponential backoff prevents you from hammering a site that's already rate-limiting you, which on some WAFs escalates a soft block into a hard ban.",[24,593,595],{"id":594},"common-mistakes-to-avoid","Common mistakes to avoid",[597,598,599,609,619,625],"ul",{},[497,600,601,604,605,608],{},[69,602,603],{},"Rotating too aggressively."," A new IP on every single request can look ",[213,606,607],{},"more"," suspicious than a stable session. Match rotation to the site's tolerance.",[497,610,611,614,615,618],{},[69,612,613],{},"Ignoring response bodies."," A ",[19,616,617],{},"200"," status with a CAPTCHA page in the body is still a block. Validate content, not just status codes.",[497,620,621,624],{},[69,622,623],{},"Leaking your real IP."," WebRTC, DNS, and direct API calls can bypass the proxy. Test with an IP-check endpoint before trusting your setup.",[497,626,627,630],{},[69,628,629],{},"Buying the cheapest residential pool."," Oversold pools have burned IPs already flagged across thousands of sites.",[24,632,634],{"id":633},"need-this-built-for-your-project","Need this built for your project?",[15,636,637,638,645,646,650],{},"I build production scraping systems with proxy integration, anti-bot bypass, and the retry infrastructure to keep them running at scale, across Cloudflare, DataDome, and Akamai-protected sites. If you have a scraping or automation project, ",[639,640,644],"a",{"href":641,"rel":642},"https:\u002F\u002Fwww.upwork.com\u002Ffreelancers\u002Fphanvuong2",[643],"nofollow","hire me on Upwork"," or get in touch through the ",[639,647,649],{"href":648},"\u002F#contact","contact form",". I reply within 24 hours with a scope and quote.",[652,653,654],"style",{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}",{"title":133,"searchDepth":146,"depth":146,"links":656},[657,658,659,660,661,662,663,664,665],{"id":26,"depth":146,"text":27},{"id":33,"depth":146,"text":34},{"id":122,"depth":146,"text":123},{"id":219,"depth":146,"text":220},{"id":386,"depth":146,"text":387},{"id":488,"depth":146,"text":489},{"id":530,"depth":146,"text":531},{"id":594,"depth":146,"text":595},{"id":633,"depth":146,"text":634},"2026-06-12","A practical guide to integrating residential and rotating proxies into a Python scraper: proxy types, rotation strategies, retry logic, and how to avoid IP bans on protected sites.",false,"md",{},"\u002Fblog\u002Frotating-proxies-for-web-scraping","8 min read",{"title":6,"description":667},"blog\u002Frotating-proxies-for-web-scraping",[676,677,132,678,679],"web scraping","proxies","anti-bot","playwright",[681,682,683,684],"Datacenter proxies are cheapest but blocked fast; residential and mobile cost more but pass protected sites.","Start with datacenter and escalate to residential only when you actually see blocks.","Treat 403 and 429 responses as a signal to rotate, and silently skip dead proxies.","Match proxy geolocation to the site's audience, and pair proxies with fingerprint stealth and human-like timing.",null,"1Ocj1ZSzLA0gcR97EZs8BvMRpVTWAsngaK8NHENlbtM",{"id":688,"title":689,"body":690,"date":1121,"description":1122,"draft":668,"extension":669,"meta":1123,"navigation":149,"path":1124,"readingTime":1125,"seo":1126,"stem":1127,"tags":1128,"takeaways":1130,"updated":685,"__hash__":1135},"blog\u002Fblog\u002Fbypass-cloudflare-web-scraping.md","How to Scrape Cloudflare-Protected Sites in 2026 (A Practical Approach)",{"type":8,"value":691,"toc":1111},[692,695,701,705,708,751,758,762,776,822,829,833,840,970,984,988,995,1002,1006,1009,1023,1026,1030,1036,1081,1088,1092,1095,1099,1109],[11,693,689],{"id":694},"how-to-scrape-cloudflare-protected-sites-in-2026-a-practical-approach",[15,696,697,698,700],{},"Cloudflare protects a large share of the web, and its bot management has gotten much harder to beat. If you've hit the \"Checking your browser\" interstitial, a Turnstile challenge, or a silent ",[19,699,21],{},", this is what's actually happening and how to get through it reliably.",[24,702,704],{"id":703},"what-cloudflare-actually-checks","What Cloudflare actually checks",[15,706,707],{},"Cloudflare doesn't rely on one signal. It scores you across several layers, and failing any one can flag you:",[597,709,710,723,729,739,745],{},[497,711,712,715,716,719,720,722],{},[69,713,714],{},"TLS fingerprint (JA3\u002FJA4)."," The way your HTTP client negotiates TLS reveals whether you're a real browser or a Python ",[19,717,718],{},"requests"," session. This is why plain ",[19,721,718],{}," gets blocked instantly, before any JavaScript runs.",[497,724,725,728],{},[69,726,727],{},"HTTP\u002F2 fingerprint."," Header order, pseudo-header order, and frame settings differ between real Chrome and automation libraries.",[497,730,731,734,735,738],{},[69,732,733],{},"Browser fingerprint."," JavaScript challenges probe ",[19,736,737],{},"navigator.webdriver",", WebGL, canvas, installed fonts, screen properties, and dozens of other values.",[497,740,741,744],{},[69,742,743],{},"Behavioral signals."," Mouse movement, timing, and navigation patterns.",[497,746,747,750],{},[69,748,749],{},"IP reputation."," Datacenter IPs start with a low trust score.",[15,752,753,754,757],{},"The takeaway: ",[69,755,756],{},"a scraper that fixes only one layer still fails."," Clean IP with a headless fingerprint? Blocked. Perfect fingerprint from a flagged datacenter IP? Blocked.",[24,759,761],{"id":760},"why-plain-http-clients-cant-win","Why plain HTTP clients can't win",[15,763,764,765,767,768,771,772,775],{},"A request from ",[19,766,718],{}," or ",[19,769,770],{},"httpx"," is rejected at the TLS layer before Cloudflare even serves the challenge. Libraries like ",[19,773,774],{},"curl_cffi"," help by impersonating a real browser's TLS fingerprint:",[128,777,779],{"className":130,"code":778,"language":132,"meta":133,"style":133},"from curl_cffi import requests\n\n# Impersonate a real Chrome TLS + HTTP2 fingerprint\nresp = requests.get(\n    \"https:\u002F\u002Fprotected-site.com\",\n    impersonate=\"chrome131\",\n    timeout=20,\n)\nprint(resp.status_code)\n",[19,780,781,786,790,795,799,804,809,813,817],{"__ignoreMap":133},[137,782,783],{"class":139,"line":140},[137,784,785],{},"from curl_cffi import requests\n",[137,787,788],{"class":139,"line":146},[137,789,150],{"emptyLinePlaceholder":149},[137,791,792],{"class":139,"line":153},[137,793,794],{},"# Impersonate a real Chrome TLS + HTTP2 fingerprint\n",[137,796,797],{"class":139,"line":159},[137,798,178],{},[137,800,801],{"class":139,"line":164},[137,802,803],{},"    \"https:\u002F\u002Fprotected-site.com\",\n",[137,805,806],{"class":139,"line":170},[137,807,808],{},"    impersonate=\"chrome131\",\n",[137,810,811],{"class":139,"line":175},[137,812,196],{},[137,814,815],{"class":139,"line":181},[137,816,202],{},[137,818,819],{"class":139,"line":187},[137,820,821],{},"print(resp.status_code)\n",[15,823,824,825,828],{},"This gets you past the TLS check and works on Cloudflare's ",[213,826,827],{},"lower"," security settings. But on sites running a managed challenge or Turnstile, you need a real browser to execute the JavaScript.",[24,830,832],{"id":831},"the-reliable-approach-a-stealth-browser","The reliable approach: a stealth browser",[15,834,835,836,839],{},"For managed challenges, run an actual browser with anti-detection patches. With Playwright, the base setup looks like this, but the stock launch is ",[213,837,838],{},"not"," enough:",[128,841,843],{"className":130,"code":842,"language":132,"meta":133,"style":133},"from playwright.async_api import async_playwright\n\nasync def scrape(url: str):\n    async with async_playwright() as p:\n        browser = await p.chromium.launch(\n            headless=True,\n            args=[\n                \"--disable-blink-features=AutomationControlled\",\n            ],\n            proxy={\n                \"server\": \"http:\u002F\u002Fgateway.provider.com:7000\",\n                \"username\": \"USER\",\n                \"password\": \"PASS\",\n            },\n        )\n        ctx = await browser.new_context(\n            user_agent=\"Mozilla\u002F5.0 (Windows NT 10.0; Win64; x64) \"\n                       \"AppleWebKit\u002F537.36 (KHTML, like Gecko) \"\n                       \"Chrome\u002F131.0.0.0 Safari\u002F537.36\",\n            viewport={\"width\": 1920, \"height\": 1080},\n            locale=\"en-US\",\n        )\n        page = await ctx.new_page()\n        await page.goto(url, wait_until=\"domcontentloaded\")\n        # Wait out the challenge, then read the real content\n        await page.wait_for_load_state(\"networkidle\")\n        return await page.content()\n",[19,844,845,849,853,858,862,866,871,876,881,886,890,894,898,902,906,910,915,920,925,930,935,940,944,949,954,959,964],{"__ignoreMap":133},[137,846,847],{"class":139,"line":140},[137,848,400],{},[137,850,851],{"class":139,"line":146},[137,852,150],{"emptyLinePlaceholder":149},[137,854,855],{"class":139,"line":153},[137,856,857],{},"async def scrape(url: str):\n",[137,859,860],{"class":139,"line":159},[137,861,414],{},[137,863,864],{"class":139,"line":164},[137,865,419],{},[137,867,868],{"class":139,"line":170},[137,869,870],{},"            headless=True,\n",[137,872,873],{"class":139,"line":175},[137,874,875],{},"            args=[\n",[137,877,878],{"class":139,"line":181},[137,879,880],{},"                \"--disable-blink-features=AutomationControlled\",\n",[137,882,883],{"class":139,"line":187},[137,884,885],{},"            ],\n",[137,887,888],{"class":139,"line":193},[137,889,424],{},[137,891,892],{"class":139,"line":199},[137,893,429],{},[137,895,896],{"class":139,"line":205},[137,897,434],{},[137,899,900],{"class":139,"line":288},[137,901,439],{},[137,903,904],{"class":139,"line":294},[137,905,444],{},[137,907,908],{"class":139,"line":300},[137,909,449],{},[137,911,912],{"class":139,"line":306},[137,913,914],{},"        ctx = await browser.new_context(\n",[137,916,917],{"class":139,"line":312},[137,918,919],{},"            user_agent=\"Mozilla\u002F5.0 (Windows NT 10.0; Win64; x64) \"\n",[137,921,922],{"class":139,"line":318},[137,923,924],{},"                       \"AppleWebKit\u002F537.36 (KHTML, like Gecko) \"\n",[137,926,927],{"class":139,"line":324},[137,928,929],{},"                       \"Chrome\u002F131.0.0.0 Safari\u002F537.36\",\n",[137,931,932],{"class":139,"line":330},[137,933,934],{},"            viewport={\"width\": 1920, \"height\": 1080},\n",[137,936,937],{"class":139,"line":336},[137,938,939],{},"            locale=\"en-US\",\n",[137,941,942],{"class":139,"line":342},[137,943,449],{},[137,945,946],{"class":139,"line":348},[137,947,948],{},"        page = await ctx.new_page()\n",[137,950,951],{"class":139,"line":354},[137,952,953],{},"        await page.goto(url, wait_until=\"domcontentloaded\")\n",[137,955,956],{"class":139,"line":360},[137,957,958],{},"        # Wait out the challenge, then read the real content\n",[137,960,961],{"class":139,"line":366},[137,962,963],{},"        await page.wait_for_load_state(\"networkidle\")\n",[137,965,967],{"class":139,"line":966},27,[137,968,969],{},"        return await page.content()\n",[15,971,972,973,975,976,979,980,983],{},"The hidden work is in the patches that hide automation: removing ",[19,974,737],{},", spoofing the permissions API, faking plugins and WebGL vendor strings, and matching the user-agent to the actual browser build. Tools like ",[19,977,978],{},"playwright-stealth",", ",[19,981,982],{},"undetected-chromedialog",", or the Camoufox\u002Fnodriver projects automate much of this, but they need maintenance as Cloudflare updates its detection.",[24,985,987],{"id":986},"residential-proxies-are-not-optional-here","Residential proxies are not optional here",[15,989,990,991,994],{},"On Cloudflare-protected sites, datacenter IPs start with a trust deficit you usually can't overcome. Pair the stealth browser with residential or mobile proxies, and ",[69,992,993],{},"match the proxy country to the site's audience",". A US store accessed through a foreign IP often triggers extra verification even when everything else is perfect.",[15,996,997,998,1001],{},"See my detailed guide on ",[639,999,1000],{"href":671},"integrating rotating proxies"," for the rotation and retry logic.",[24,1003,1005],{"id":1004},"handling-turnstile-challenges","Handling Turnstile challenges",[15,1007,1008],{},"When a Turnstile or interactive challenge appears, you have two paths:",[494,1010,1011,1017],{},[497,1012,1013,1016],{},[69,1014,1015],{},"Let the stealth browser solve it passively."," With a clean fingerprint and good IP, Turnstile often passes without interaction.",[497,1018,1019,1022],{},[69,1020,1021],{},"Use a solver service"," (2Captcha, CapSolver) for the token when passive solving fails. The solver returns a token you inject into the form submission.",[15,1024,1025],{},"In practice, a well-configured stealth browser passes most non-interactive challenges on its own, and the solver is the fallback for the hardest cases.",[24,1027,1029],{"id":1028},"validate-the-response-not-just-the-status","Validate the response, not just the status",[15,1031,1032,1033,1035],{},"A ",[19,1034,617],{}," response can still be a block page. Always check the body:",[128,1037,1039],{"className":130,"code":1038,"language":132,"meta":133,"style":133},"def is_blocked(html: str) -> bool:\n    markers = [\n        \"cf-challenge\",\n        \"Checking your browser\",\n        \"Just a moment\",\n        \"cf-turnstile\",\n    ]\n    return any(m in html for m in markers)\n",[19,1040,1041,1046,1051,1056,1061,1066,1071,1076],{"__ignoreMap":133},[137,1042,1043],{"class":139,"line":140},[137,1044,1045],{},"def is_blocked(html: str) -> bool:\n",[137,1047,1048],{"class":139,"line":146},[137,1049,1050],{},"    markers = [\n",[137,1052,1053],{"class":139,"line":153},[137,1054,1055],{},"        \"cf-challenge\",\n",[137,1057,1058],{"class":139,"line":159},[137,1059,1060],{},"        \"Checking your browser\",\n",[137,1062,1063],{"class":139,"line":164},[137,1064,1065],{},"        \"Just a moment\",\n",[137,1067,1068],{"class":139,"line":170},[137,1069,1070],{},"        \"cf-turnstile\",\n",[137,1072,1073],{"class":139,"line":175},[137,1074,1075],{},"    ]\n",[137,1077,1078],{"class":139,"line":181},[137,1079,1080],{},"    return any(m in html for m in markers)\n",[15,1082,1083,1084,1087],{},"If ",[19,1085,1086],{},"is_blocked()"," returns true, rotate the proxy, back off, and retry. Do not treat it as success.",[24,1089,1091],{"id":1090},"when-this-gets-hard","When this gets hard",[15,1093,1094],{},"Cloudflare updates its detection continuously, so a setup that works today can break next month. A production scraper needs monitoring, alerting on block-rate spikes, and a maintenance plan, not a one-off script. That ongoing reliability is the real deliverable, and it's where most DIY scrapers fall apart.",[24,1096,1098],{"id":1097},"need-a-cloudflare-protected-site-scraped-reliably","Need a Cloudflare-protected site scraped reliably?",[15,1100,1101,1102,1105,1106,1108],{},"I build and maintain production scrapers that get through Cloudflare, DataDome, and Akamai, with the stealth, proxy, and monitoring infrastructure to keep them running. If you have a project, ",[639,1103,644],{"href":641,"rel":1104},[643]," or reach out via the ",[639,1107,649],{"href":648},". I respond within 24 hours.",[652,1110,654],{},{"title":133,"searchDepth":146,"depth":146,"links":1112},[1113,1114,1115,1116,1117,1118,1119,1120],{"id":703,"depth":146,"text":704},{"id":760,"depth":146,"text":761},{"id":831,"depth":146,"text":832},{"id":986,"depth":146,"text":987},{"id":1004,"depth":146,"text":1005},{"id":1028,"depth":146,"text":1029},{"id":1090,"depth":146,"text":1091},{"id":1097,"depth":146,"text":1098},"2026-06-10","What Cloudflare actually checks, why most scrapers fail against it, and the layered approach of stealth browsers, fingerprinting, and residential proxies that reliably gets through.",{},"\u002Fblog\u002Fbypass-cloudflare-web-scraping","7 min read",{"title":689,"description":1122},"blog\u002Fbypass-cloudflare-web-scraping",[676,1129,678,679,132],"cloudflare",[1131,1132,1133,1134],"Cloudflare scores you across TLS, HTTP\u002F2, browser fingerprint, behavior, and IP reputation.","Plain HTTP clients fail at the TLS layer; curl_cffi can impersonate a real browser.","Managed challenges need a real, patched stealth browser, not a stock headless launch.","Residential proxies matched to the site's country are required, and a 200 response can still be a block page.","t9vyXQhugYzXupp7mEvMSK7IHv5_iFwn9OJWAdEY8jQ",{"id":1137,"title":1138,"body":1139,"date":1666,"description":1667,"draft":668,"extension":669,"meta":1668,"navigation":149,"path":1669,"readingTime":672,"seo":1670,"stem":1671,"tags":1672,"takeaways":1675,"updated":685,"__hash__":1680},"blog\u002Fblog\u002Fsolving-captchas-2captcha-capsolver.md","Solving CAPTCHAs in Your Scraper with 2Captcha and CapSolver",{"type":8,"value":1140,"toc":1656},[1141,1144,1147,1151,1154,1157,1171,1174,1178,1181,1260,1263,1267,1274,1415,1419,1422,1470,1481,1485,1488,1597,1601,1604,1634,1638,1641,1645,1654],[11,1142,1138],{"id":1143},"solving-captchas-in-your-scraper-with-2captcha-and-capsolver",[15,1145,1146],{},"CAPTCHAs are the wall most scrapers hit once a site decides it does not trust you. The good news is that almost every common CAPTCHA can be solved programmatically through a solving service. This guide shows how to integrate 2Captcha and CapSolver, when each one fits, and how to keep costs under control.",[24,1148,1150],{"id":1149},"how-captcha-solving-services-work","How CAPTCHA solving services work",[15,1152,1153],{},"You do not solve the CAPTCHA yourself. Instead you send the challenge to a service, the service returns a token, and you inject that token into the page exactly as a real browser would after a human passed the test.",[15,1155,1156],{},"The flow is always the same:",[494,1158,1159,1162,1165,1168],{},[497,1160,1161],{},"Detect the CAPTCHA on the page and read its site key.",[497,1163,1164],{},"Send the site key and page URL to the solving service.",[497,1166,1167],{},"Poll until the service returns a solution token.",[497,1169,1170],{},"Inject the token into the hidden form field and submit.",[15,1172,1173],{},"The token, not the image, is what the target site validates. This is why solving services work even on invisible reCAPTCHA v3 where there is no puzzle to click.",[24,1175,1177],{"id":1176},"_2captcha-vs-capsolver-which-to-pick","2Captcha vs CapSolver: which to pick",[15,1179,1180],{},"Both services cover the major CAPTCHA types. The practical differences matter more than the feature list.",[39,1182,1183,1196],{},[42,1184,1185],{},[45,1186,1187,1190,1193],{},[48,1188,1189],{},"Factor",[48,1191,1192],{},"2Captcha",[48,1194,1195],{},"CapSolver",[61,1197,1198,1209,1219,1230,1239,1250],{},[45,1199,1200,1203,1206],{},[66,1201,1202],{},"Speed",[66,1204,1205],{},"Human powered, slower",[66,1207,1208],{},"AI powered, faster",[45,1210,1211,1214,1217],{},[66,1212,1213],{},"reCAPTCHA v2",[66,1215,1216],{},"Reliable",[66,1218,1216],{},[45,1220,1221,1224,1227],{},[66,1222,1223],{},"reCAPTCHA v3",[66,1225,1226],{},"Supported",[66,1228,1229],{},"Strong",[45,1231,1232,1235,1237],{},[66,1233,1234],{},"Cloudflare Turnstile",[66,1236,1226],{},[66,1238,1229],{},[45,1240,1241,1244,1247],{},[66,1242,1243],{},"Pricing model",[66,1245,1246],{},"Per solve",[66,1248,1249],{},"Per solve, cheaper at volume",[45,1251,1252,1254,1257],{},[66,1253,59],{},[66,1255,1256],{},"Image and token tasks",[66,1258,1259],{},"High volume token tasks",[15,1261,1262],{},"Rule of thumb: start with CapSolver for speed on token based challenges, keep 2Captcha as a fallback for odd image puzzles and broad coverage.",[24,1264,1266],{"id":1265},"solving-recaptcha-v2-with-2captcha","Solving reCAPTCHA v2 with 2Captcha",[15,1268,1269,1270,1273],{},"First find the site key in the page. It sits in the ",[19,1271,1272],{},"data-sitekey"," attribute of the reCAPTCHA element. Then send it to the service.",[128,1275,1277],{"className":130,"code":1276,"language":132,"meta":133,"style":133},"import time\nimport requests\n\nAPI_KEY = \"your_2captcha_key\"\n\ndef solve_recaptcha_v2(site_key: str, page_url: str) -> str:\n    # 1. Submit the task\n    r = requests.post(\"https:\u002F\u002F2captcha.com\u002Fin.php\", data={\n        \"key\": API_KEY,\n        \"method\": \"userrecaptcha\",\n        \"googlekey\": site_key,\n        \"pageurl\": page_url,\n        \"json\": 1,\n    }).json()\n    request_id = r[\"request\"]\n\n    # 2. Poll for the token\n    for _ in range(24):\n        time.sleep(5)\n        res = requests.get(\"https:\u002F\u002F2captcha.com\u002Fres.php\", params={\n            \"key\": API_KEY,\n            \"action\": \"get\",\n            \"id\": request_id,\n            \"json\": 1,\n        }).json()\n        if res[\"status\"] == 1:\n            return res[\"request\"]  # the g-recaptcha-response token\n    raise TimeoutError(\"CAPTCHA not solved in time\")\n",[19,1278,1279,1283,1287,1291,1296,1300,1305,1310,1315,1320,1325,1330,1335,1340,1345,1350,1354,1359,1364,1369,1374,1379,1384,1389,1394,1399,1404,1409],{"__ignoreMap":133},[137,1280,1281],{"class":139,"line":140},[137,1282,544],{},[137,1284,1285],{"class":139,"line":146},[137,1286,143],{},[137,1288,1289],{"class":139,"line":153},[137,1290,150],{"emptyLinePlaceholder":149},[137,1292,1293],{"class":139,"line":159},[137,1294,1295],{},"API_KEY = \"your_2captcha_key\"\n",[137,1297,1298],{"class":139,"line":164},[137,1299,150],{"emptyLinePlaceholder":149},[137,1301,1302],{"class":139,"line":170},[137,1303,1304],{},"def solve_recaptcha_v2(site_key: str, page_url: str) -> str:\n",[137,1306,1307],{"class":139,"line":175},[137,1308,1309],{},"    # 1. Submit the task\n",[137,1311,1312],{"class":139,"line":181},[137,1313,1314],{},"    r = requests.post(\"https:\u002F\u002F2captcha.com\u002Fin.php\", data={\n",[137,1316,1317],{"class":139,"line":187},[137,1318,1319],{},"        \"key\": API_KEY,\n",[137,1321,1322],{"class":139,"line":193},[137,1323,1324],{},"        \"method\": \"userrecaptcha\",\n",[137,1326,1327],{"class":139,"line":199},[137,1328,1329],{},"        \"googlekey\": site_key,\n",[137,1331,1332],{"class":139,"line":205},[137,1333,1334],{},"        \"pageurl\": page_url,\n",[137,1336,1337],{"class":139,"line":288},[137,1338,1339],{},"        \"json\": 1,\n",[137,1341,1342],{"class":139,"line":294},[137,1343,1344],{},"    }).json()\n",[137,1346,1347],{"class":139,"line":300},[137,1348,1349],{},"    request_id = r[\"request\"]\n",[137,1351,1352],{"class":139,"line":306},[137,1353,150],{"emptyLinePlaceholder":149},[137,1355,1356],{"class":139,"line":312},[137,1357,1358],{},"    # 2. Poll for the token\n",[137,1360,1361],{"class":139,"line":318},[137,1362,1363],{},"    for _ in range(24):\n",[137,1365,1366],{"class":139,"line":324},[137,1367,1368],{},"        time.sleep(5)\n",[137,1370,1371],{"class":139,"line":330},[137,1372,1373],{},"        res = requests.get(\"https:\u002F\u002F2captcha.com\u002Fres.php\", params={\n",[137,1375,1376],{"class":139,"line":336},[137,1377,1378],{},"            \"key\": API_KEY,\n",[137,1380,1381],{"class":139,"line":342},[137,1382,1383],{},"            \"action\": \"get\",\n",[137,1385,1386],{"class":139,"line":348},[137,1387,1388],{},"            \"id\": request_id,\n",[137,1390,1391],{"class":139,"line":354},[137,1392,1393],{},"            \"json\": 1,\n",[137,1395,1396],{"class":139,"line":360},[137,1397,1398],{},"        }).json()\n",[137,1400,1401],{"class":139,"line":366},[137,1402,1403],{},"        if res[\"status\"] == 1:\n",[137,1405,1406],{"class":139,"line":966},[137,1407,1408],{},"            return res[\"request\"]  # the g-recaptcha-response token\n",[137,1410,1412],{"class":139,"line":1411},28,[137,1413,1414],{},"    raise TimeoutError(\"CAPTCHA not solved in time\")\n",[24,1416,1418],{"id":1417},"injecting-the-token-with-playwright","Injecting the token with Playwright",[15,1420,1421],{},"The token is useless until it is placed in the page and the form is submitted. Inject it into the hidden textarea reCAPTCHA expects.",[128,1423,1425],{"className":130,"code":1424,"language":132,"meta":133,"style":133},"token = solve_recaptcha_v2(site_key, page_url)\n\nawait page.evaluate(\n    \"\"\"(token) => {\n        document.querySelector('#g-recaptcha-response').value = token;\n    }\"\"\",\n    token,\n)\nawait page.click(\"button[type=submit]\")\n",[19,1426,1427,1432,1436,1441,1446,1451,1456,1461,1465],{"__ignoreMap":133},[137,1428,1429],{"class":139,"line":140},[137,1430,1431],{},"token = solve_recaptcha_v2(site_key, page_url)\n",[137,1433,1434],{"class":139,"line":146},[137,1435,150],{"emptyLinePlaceholder":149},[137,1437,1438],{"class":139,"line":153},[137,1439,1440],{},"await page.evaluate(\n",[137,1442,1443],{"class":139,"line":159},[137,1444,1445],{},"    \"\"\"(token) => {\n",[137,1447,1448],{"class":139,"line":164},[137,1449,1450],{},"        document.querySelector('#g-recaptcha-response').value = token;\n",[137,1452,1453],{"class":139,"line":170},[137,1454,1455],{},"    }\"\"\",\n",[137,1457,1458],{"class":139,"line":175},[137,1459,1460],{},"    token,\n",[137,1462,1463],{"class":139,"line":181},[137,1464,202],{},[137,1466,1467],{"class":139,"line":187},[137,1468,1469],{},"await page.click(\"button[type=submit]\")\n",[15,1471,1472,1473,1476,1477,1480],{},"For reCAPTCHA v3 there is no checkbox. The token goes into whatever field the site reads, often a hidden input the site script populates, and you usually pass a ",[19,1474,1475],{},"min_score"," and ",[19,1478,1479],{},"action"," to the solver so the returned token matches what the site expects.",[24,1482,1484],{"id":1483},"solving-cloudflare-turnstile-with-capsolver","Solving Cloudflare Turnstile with CapSolver",[15,1486,1487],{},"Turnstile is increasingly common and CapSolver handles it well. The pattern is identical, only the task type changes.",[128,1489,1491],{"className":130,"code":1490,"language":132,"meta":133,"style":133},"import requests\n\nCAPSOLVER_KEY = \"your_capsolver_key\"\n\ndef solve_turnstile(site_key: str, page_url: str) -> str:\n    create = requests.post(\"https:\u002F\u002Fapi.capsolver.com\u002FcreateTask\", json={\n        \"clientKey\": CAPSOLVER_KEY,\n        \"task\": {\n            \"type\": \"AntiTurnstileTaskProxyLess\",\n            \"websiteURL\": page_url,\n            \"websiteKey\": site_key,\n        },\n    }).json()\n    task_id = create[\"taskId\"]\n\n    while True:\n        res = requests.post(\"https:\u002F\u002Fapi.capsolver.com\u002FgetTaskResult\", json={\n            \"clientKey\": CAPSOLVER_KEY,\n            \"taskId\": task_id,\n        }).json()\n        if res[\"status\"] == \"ready\":\n            return res[\"solution\"][\"token\"]\n",[19,1492,1493,1497,1501,1506,1510,1515,1520,1525,1530,1535,1540,1545,1550,1554,1559,1563,1568,1573,1578,1583,1587,1592],{"__ignoreMap":133},[137,1494,1495],{"class":139,"line":140},[137,1496,143],{},[137,1498,1499],{"class":139,"line":146},[137,1500,150],{"emptyLinePlaceholder":149},[137,1502,1503],{"class":139,"line":153},[137,1504,1505],{},"CAPSOLVER_KEY = \"your_capsolver_key\"\n",[137,1507,1508],{"class":139,"line":159},[137,1509,150],{"emptyLinePlaceholder":149},[137,1511,1512],{"class":139,"line":164},[137,1513,1514],{},"def solve_turnstile(site_key: str, page_url: str) -> str:\n",[137,1516,1517],{"class":139,"line":170},[137,1518,1519],{},"    create = requests.post(\"https:\u002F\u002Fapi.capsolver.com\u002FcreateTask\", json={\n",[137,1521,1522],{"class":139,"line":175},[137,1523,1524],{},"        \"clientKey\": CAPSOLVER_KEY,\n",[137,1526,1527],{"class":139,"line":181},[137,1528,1529],{},"        \"task\": {\n",[137,1531,1532],{"class":139,"line":187},[137,1533,1534],{},"            \"type\": \"AntiTurnstileTaskProxyLess\",\n",[137,1536,1537],{"class":139,"line":193},[137,1538,1539],{},"            \"websiteURL\": page_url,\n",[137,1541,1542],{"class":139,"line":199},[137,1543,1544],{},"            \"websiteKey\": site_key,\n",[137,1546,1547],{"class":139,"line":205},[137,1548,1549],{},"        },\n",[137,1551,1552],{"class":139,"line":288},[137,1553,1344],{},[137,1555,1556],{"class":139,"line":294},[137,1557,1558],{},"    task_id = create[\"taskId\"]\n",[137,1560,1561],{"class":139,"line":300},[137,1562,150],{"emptyLinePlaceholder":149},[137,1564,1565],{"class":139,"line":306},[137,1566,1567],{},"    while True:\n",[137,1569,1570],{"class":139,"line":312},[137,1571,1572],{},"        res = requests.post(\"https:\u002F\u002Fapi.capsolver.com\u002FgetTaskResult\", json={\n",[137,1574,1575],{"class":139,"line":318},[137,1576,1577],{},"            \"clientKey\": CAPSOLVER_KEY,\n",[137,1579,1580],{"class":139,"line":324},[137,1581,1582],{},"            \"taskId\": task_id,\n",[137,1584,1585],{"class":139,"line":330},[137,1586,1398],{},[137,1588,1589],{"class":139,"line":336},[137,1590,1591],{},"        if res[\"status\"] == \"ready\":\n",[137,1593,1594],{"class":139,"line":342},[137,1595,1596],{},"            return res[\"solution\"][\"token\"]\n",[24,1598,1600],{"id":1599},"keeping-costs-under-control","Keeping costs under control",[15,1602,1603],{},"Solving services charge per solve, and on a large job the bill adds up fast. The cheapest CAPTCHA is the one you never trigger.",[597,1605,1606,1616,1622,1628],{},[497,1607,1608,1611,1612,1615],{},[69,1609,1610],{},"Reduce triggers first."," A clean residential IP and a realistic browser fingerprint mean fewer CAPTCHAs in the first place. Solving is the fallback, not the strategy. See my guide on ",[639,1613,1614],{"href":1124},"bypassing Cloudflare"," for the stealth side.",[497,1617,1618,1621],{},[69,1619,1620],{},"Cache sessions."," Once you pass a challenge, reuse the cookies. Do not solve a fresh CAPTCHA on every request.",[497,1623,1624,1627],{},[69,1625,1626],{},"Solve only when blocked."," Detect the CAPTCHA and call the service only if it actually appears, rather than pre solving on every page.",[497,1629,1630,1633],{},[69,1631,1632],{},"Set a budget cap."," Track solves per run and stop the job if the count spikes, which usually means your fingerprint or proxy went bad.",[24,1635,1637],{"id":1636},"when-solving-services-are-not-enough","When solving services are not enough",[15,1639,1640],{},"Some sites layer behavioral analysis on top of the CAPTCHA. A valid token from a session that never moved a mouse or scrolled can still be rejected. In those cases you need the full stealth stack: residential proxies, a patched browser fingerprint, and human like interaction timing, with the solver as one piece rather than the whole answer.",[24,1642,1644],{"id":1643},"need-captchas-handled-in-your-scraping-project","Need CAPTCHAs handled in your scraping project?",[15,1646,1647,1648,1651,1652,650],{},"I build scraping systems that combine stealth, proxy rotation, and CAPTCHA solving so they keep running on protected sites. If you have a project that keeps hitting CAPTCHAs, ",[639,1649,644],{"href":641,"rel":1650},[643]," or reach out through the ",[639,1653,649],{"href":648},[652,1655,654],{},{"title":133,"searchDepth":146,"depth":146,"links":1657},[1658,1659,1660,1661,1662,1663,1664,1665],{"id":1149,"depth":146,"text":1150},{"id":1176,"depth":146,"text":1177},{"id":1265,"depth":146,"text":1266},{"id":1417,"depth":146,"text":1418},{"id":1483,"depth":146,"text":1484},{"id":1599,"depth":146,"text":1600},{"id":1636,"depth":146,"text":1637},{"id":1643,"depth":146,"text":1644},"2026-06-08","A practical guide to integrating CAPTCHA solving services into a Python scraper. Covers reCAPTCHA v2 and v3, hCaptcha, Cloudflare Turnstile, token injection, and cost control.",{},"\u002Fblog\u002Fsolving-captchas-2captcha-capsolver",{"title":1138,"description":1667},"blog\u002Fsolving-captchas-2captcha-capsolver",[1673,676,132,1674,678],"captcha","automation",[1676,1677,1678,1679],"Solving services return a token you inject; you do not solve the puzzle yourself.","CapSolver is fast for token challenges; 2Captcha gives broad coverage as a fallback.","The cheapest CAPTCHA is one you never trigger, so reduce triggers with clean IPs and fingerprints first.","Cache sessions and solve only when actually blocked to keep costs down.","vHMGfU8-0OOVYAG-0hnOxPDnNwGpE_VU2mwP_M-vp1U",{"id":1682,"title":1683,"body":1684,"date":2207,"description":2208,"draft":668,"extension":669,"meta":2209,"navigation":149,"path":2210,"readingTime":2211,"seo":2212,"stem":2213,"tags":2214,"takeaways":2217,"updated":685,"__hash__":2222},"blog\u002Fblog\u002Fscrapy-large-scale-scraping.md","Building a Large-Scale Web Scraper with Scrapy",{"type":8,"value":1685,"toc":2196},[1686,1689,1699,1703,1706,1732,1735,1739,1742,1834,1837,1841,1844,1910,1917,1942,1946,1949,1989,1992,1996,1999,2050,2056,2060,2063,2157,2164,2168,2175,2179,2182,2186,2194],[11,1687,1683],{"id":1688},"building-a-large-scale-web-scraper-with-scrapy",[15,1690,1691,1692,1694,1695,1698],{},"When a scraping job grows past a few thousand pages, a hand written script with ",[19,1693,718],{}," and a ",[19,1696,1697],{},"for"," loop starts to fall apart. Scrapy is the framework built for this scale. It handles concurrency, retries, throttling, and data export so you can focus on the extraction logic. This guide covers the parts that matter for production.",[24,1700,1702],{"id":1701},"why-scrapy-over-a-plain-script","Why Scrapy over a plain script",[15,1704,1705],{},"A simple script does one request at a time and breaks on the first unexpected error. Scrapy gives you the infrastructure for free:",[597,1707,1708,1714,1720,1726],{},[497,1709,1710,1713],{},[69,1711,1712],{},"Asynchronous by default."," It fetches many pages concurrently without you managing threads or async code by hand.",[497,1715,1716,1719],{},[69,1717,1718],{},"Built in retries and throttling."," Failed requests retry automatically, and AutoThrottle adapts the request rate to the server.",[497,1721,1722,1725],{},[69,1723,1724],{},"Middleware system."," Proxies, custom headers, and retry rules plug in cleanly.",[497,1727,1728,1731],{},[69,1729,1730],{},"Item pipelines."," Clean, validate, and store scraped data in stages.",[15,1733,1734],{},"The tradeoff is a steeper learning curve. For a one off scrape of a single page, Scrapy is overkill. For a recurring job across many pages, it pays for itself quickly.",[24,1736,1738],{"id":1737},"a-basic-spider","A basic spider",[15,1740,1741],{},"A spider defines where to start, how to follow links, and how to parse each page.",[128,1743,1745],{"className":130,"code":1744,"language":132,"meta":133,"style":133},"import scrapy\n\nclass ProductSpider(scrapy.Spider):\n    name = \"products\"\n    start_urls = [\"https:\u002F\u002Fexample.com\u002Fcategory\u002Fpage\u002F1\"]\n\n    def parse(self, response):\n        for product in response.css(\"div.product\"):\n            yield {\n                \"name\": product.css(\"h2.title::text\").get(),\n                \"price\": product.css(\"span.price::text\").get(),\n                \"url\": product.css(\"a::attr(href)\").get(),\n            }\n\n        # Follow pagination\n        next_page = response.css(\"a.next::attr(href)\").get()\n        if next_page:\n            yield response.follow(next_page, callback=self.parse)\n",[19,1746,1747,1752,1756,1761,1766,1771,1775,1780,1785,1790,1795,1800,1805,1810,1814,1819,1824,1829],{"__ignoreMap":133},[137,1748,1749],{"class":139,"line":140},[137,1750,1751],{},"import scrapy\n",[137,1753,1754],{"class":139,"line":146},[137,1755,150],{"emptyLinePlaceholder":149},[137,1757,1758],{"class":139,"line":153},[137,1759,1760],{},"class ProductSpider(scrapy.Spider):\n",[137,1762,1763],{"class":139,"line":159},[137,1764,1765],{},"    name = \"products\"\n",[137,1767,1768],{"class":139,"line":164},[137,1769,1770],{},"    start_urls = [\"https:\u002F\u002Fexample.com\u002Fcategory\u002Fpage\u002F1\"]\n",[137,1772,1773],{"class":139,"line":170},[137,1774,150],{"emptyLinePlaceholder":149},[137,1776,1777],{"class":139,"line":175},[137,1778,1779],{},"    def parse(self, response):\n",[137,1781,1782],{"class":139,"line":181},[137,1783,1784],{},"        for product in response.css(\"div.product\"):\n",[137,1786,1787],{"class":139,"line":187},[137,1788,1789],{},"            yield {\n",[137,1791,1792],{"class":139,"line":193},[137,1793,1794],{},"                \"name\": product.css(\"h2.title::text\").get(),\n",[137,1796,1797],{"class":139,"line":199},[137,1798,1799],{},"                \"price\": product.css(\"span.price::text\").get(),\n",[137,1801,1802],{"class":139,"line":205},[137,1803,1804],{},"                \"url\": product.css(\"a::attr(href)\").get(),\n",[137,1806,1807],{"class":139,"line":288},[137,1808,1809],{},"            }\n",[137,1811,1812],{"class":139,"line":294},[137,1813,150],{"emptyLinePlaceholder":149},[137,1815,1816],{"class":139,"line":300},[137,1817,1818],{},"        # Follow pagination\n",[137,1820,1821],{"class":139,"line":306},[137,1822,1823],{},"        next_page = response.css(\"a.next::attr(href)\").get()\n",[137,1825,1826],{"class":139,"line":312},[137,1827,1828],{},"        if next_page:\n",[137,1830,1831],{"class":139,"line":318},[137,1832,1833],{},"            yield response.follow(next_page, callback=self.parse)\n",[15,1835,1836],{},"Scrapy queues every yielded request and schedules it with the concurrency settings you choose, so following thousands of pagination links needs no extra code.",[24,1838,1840],{"id":1839},"item-pipelines-for-clean-data","Item pipelines for clean data",[15,1842,1843],{},"Raw scraped fields are messy. Prices have currency symbols, whitespace creeps in, and duplicates appear. Pipelines process each item before it is stored.",[128,1845,1847],{"className":130,"code":1846,"language":132,"meta":133,"style":133},"class CleanPricePipeline:\n    def process_item(self, item, spider):\n        if item.get(\"price\"):\n            item[\"price\"] = (\n                item[\"price\"].replace(\"$\", \"\").replace(\",\", \"\").strip()\n            )\n        return item\n\nclass DropEmptyPipeline:\n    def process_item(self, item, spider):\n        if not item.get(\"name\"):\n            raise scrapy.exceptions.DropItem(\"Missing name\")\n        return item\n",[19,1848,1849,1854,1859,1864,1869,1874,1878,1883,1887,1892,1896,1901,1906],{"__ignoreMap":133},[137,1850,1851],{"class":139,"line":140},[137,1852,1853],{},"class CleanPricePipeline:\n",[137,1855,1856],{"class":139,"line":146},[137,1857,1858],{},"    def process_item(self, item, spider):\n",[137,1860,1861],{"class":139,"line":153},[137,1862,1863],{},"        if item.get(\"price\"):\n",[137,1865,1866],{"class":139,"line":159},[137,1867,1868],{},"            item[\"price\"] = (\n",[137,1870,1871],{"class":139,"line":164},[137,1872,1873],{},"                item[\"price\"].replace(\"$\", \"\").replace(\",\", \"\").strip()\n",[137,1875,1876],{"class":139,"line":170},[137,1877,333],{},[137,1879,1880],{"class":139,"line":175},[137,1881,1882],{},"        return item\n",[137,1884,1885],{"class":139,"line":181},[137,1886,150],{"emptyLinePlaceholder":149},[137,1888,1889],{"class":139,"line":187},[137,1890,1891],{},"class DropEmptyPipeline:\n",[137,1893,1894],{"class":139,"line":193},[137,1895,1858],{},[137,1897,1898],{"class":139,"line":199},[137,1899,1900],{},"        if not item.get(\"name\"):\n",[137,1902,1903],{"class":139,"line":205},[137,1904,1905],{},"            raise scrapy.exceptions.DropItem(\"Missing name\")\n",[137,1907,1908],{"class":139,"line":288},[137,1909,1882],{},[15,1911,1912,1913,1916],{},"Register them in ",[19,1914,1915],{},"settings.py"," with a priority number that sets the order:",[128,1918,1920],{"className":130,"code":1919,"language":132,"meta":133,"style":133},"ITEM_PIPELINES = {\n    \"myproject.pipelines.CleanPricePipeline\": 100,\n    \"myproject.pipelines.DropEmptyPipeline\": 200,\n}\n",[19,1921,1922,1927,1932,1937],{"__ignoreMap":133},[137,1923,1924],{"class":139,"line":140},[137,1925,1926],{},"ITEM_PIPELINES = {\n",[137,1928,1929],{"class":139,"line":146},[137,1930,1931],{},"    \"myproject.pipelines.CleanPricePipeline\": 100,\n",[137,1933,1934],{"class":139,"line":153},[137,1935,1936],{},"    \"myproject.pipelines.DropEmptyPipeline\": 200,\n",[137,1938,1939],{"class":139,"line":159},[137,1940,1941],{},"}\n",[24,1943,1945],{"id":1944},"tuning-concurrency-without-getting-banned","Tuning concurrency without getting banned",[15,1947,1948],{},"The default settings are conservative. For a large job you want more throughput, but pushing too hard gets you blocked. The key settings:",[128,1950,1952],{"className":130,"code":1951,"language":132,"meta":133,"style":133},"# settings.py\nCONCURRENT_REQUESTS = 16\nCONCURRENT_REQUESTS_PER_DOMAIN = 8\nDOWNLOAD_DELAY = 0.5\nAUTOTHROTTLE_ENABLED = True\nAUTOTHROTTLE_TARGET_CONCURRENCY = 4.0\nRETRY_TIMES = 3\n",[19,1953,1954,1959,1964,1969,1974,1979,1984],{"__ignoreMap":133},[137,1955,1956],{"class":139,"line":140},[137,1957,1958],{},"# settings.py\n",[137,1960,1961],{"class":139,"line":146},[137,1962,1963],{},"CONCURRENT_REQUESTS = 16\n",[137,1965,1966],{"class":139,"line":153},[137,1967,1968],{},"CONCURRENT_REQUESTS_PER_DOMAIN = 8\n",[137,1970,1971],{"class":139,"line":159},[137,1972,1973],{},"DOWNLOAD_DELAY = 0.5\n",[137,1975,1976],{"class":139,"line":164},[137,1977,1978],{},"AUTOTHROTTLE_ENABLED = True\n",[137,1980,1981],{"class":139,"line":170},[137,1982,1983],{},"AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0\n",[137,1985,1986],{"class":139,"line":175},[137,1987,1988],{},"RETRY_TIMES = 3\n",[15,1990,1991],{},"AutoThrottle is the part most people miss. It watches response latency and slows down automatically when the server is under load, which keeps you below the rate that triggers bans. Start gentle and increase concurrency only while the block rate stays at zero.",[24,1993,1995],{"id":1994},"adding-proxies-with-middleware","Adding proxies with middleware",[15,1997,1998],{},"For protected sites you need rotating proxies. Scrapy applies them through downloader middleware so every request goes through the pool.",[128,2000,2002],{"className":130,"code":2001,"language":132,"meta":133,"style":133},"import random\n\nclass ProxyMiddleware:\n    PROXIES = [\n        \"http:\u002F\u002Fuser:pass@p1.provider.com:8000\",\n        \"http:\u002F\u002Fuser:pass@p2.provider.com:8000\",\n    ]\n\n    def process_request(self, request, spider):\n        request.meta[\"proxy\"] = random.choice(self.PROXIES)\n",[19,2003,2004,2008,2012,2017,2022,2027,2032,2036,2040,2045],{"__ignoreMap":133},[137,2005,2006],{"class":139,"line":140},[137,2007,233],{},[137,2009,2010],{"class":139,"line":146},[137,2011,150],{"emptyLinePlaceholder":149},[137,2013,2014],{"class":139,"line":153},[137,2015,2016],{},"class ProxyMiddleware:\n",[137,2018,2019],{"class":139,"line":159},[137,2020,2021],{},"    PROXIES = [\n",[137,2023,2024],{"class":139,"line":164},[137,2025,2026],{},"        \"http:\u002F\u002Fuser:pass@p1.provider.com:8000\",\n",[137,2028,2029],{"class":139,"line":170},[137,2030,2031],{},"        \"http:\u002F\u002Fuser:pass@p2.provider.com:8000\",\n",[137,2033,2034],{"class":139,"line":175},[137,2035,1075],{},[137,2037,2038],{"class":139,"line":181},[137,2039,150],{"emptyLinePlaceholder":149},[137,2041,2042],{"class":139,"line":187},[137,2043,2044],{},"    def process_request(self, request, spider):\n",[137,2046,2047],{"class":139,"line":193},[137,2048,2049],{},"        request.meta[\"proxy\"] = random.choice(self.PROXIES)\n",[15,2051,2052,2053,2055],{},"For the rotation, retry, and geolocation details that make this reliable, see my guide on ",[639,2054,1000],{"href":671},".",[24,2057,2059],{"id":2058},"exporting-to-a-database","Exporting to a database",[15,2061,2062],{},"For a real pipeline you want the data in a database, not a CSV. A storage pipeline writes each item as it is scraped.",[128,2064,2066],{"className":130,"code":2065,"language":132,"meta":133,"style":133},"import psycopg2\n\nclass PostgresPipeline:\n    def open_spider(self, spider):\n        self.conn = psycopg2.connect(\"dbname=scrape user=postgres\")\n        self.cur = self.conn.cursor()\n\n    def process_item(self, item, spider):\n        self.cur.execute(\n            \"INSERT INTO products (name, price, url) VALUES (%s, %s, %s) \"\n            \"ON CONFLICT (url) DO UPDATE SET price = EXCLUDED.price\",\n            (item[\"name\"], item[\"price\"], item[\"url\"]),\n        )\n        self.conn.commit()\n        return item\n\n    def close_spider(self, spider):\n        self.cur.close()\n        self.conn.close()\n",[19,2067,2068,2073,2077,2082,2087,2092,2097,2101,2105,2110,2115,2120,2125,2129,2134,2138,2142,2147,2152],{"__ignoreMap":133},[137,2069,2070],{"class":139,"line":140},[137,2071,2072],{},"import psycopg2\n",[137,2074,2075],{"class":139,"line":146},[137,2076,150],{"emptyLinePlaceholder":149},[137,2078,2079],{"class":139,"line":153},[137,2080,2081],{},"class PostgresPipeline:\n",[137,2083,2084],{"class":139,"line":159},[137,2085,2086],{},"    def open_spider(self, spider):\n",[137,2088,2089],{"class":139,"line":164},[137,2090,2091],{},"        self.conn = psycopg2.connect(\"dbname=scrape user=postgres\")\n",[137,2093,2094],{"class":139,"line":170},[137,2095,2096],{},"        self.cur = self.conn.cursor()\n",[137,2098,2099],{"class":139,"line":175},[137,2100,150],{"emptyLinePlaceholder":149},[137,2102,2103],{"class":139,"line":181},[137,2104,1858],{},[137,2106,2107],{"class":139,"line":187},[137,2108,2109],{},"        self.cur.execute(\n",[137,2111,2112],{"class":139,"line":193},[137,2113,2114],{},"            \"INSERT INTO products (name, price, url) VALUES (%s, %s, %s) \"\n",[137,2116,2117],{"class":139,"line":199},[137,2118,2119],{},"            \"ON CONFLICT (url) DO UPDATE SET price = EXCLUDED.price\",\n",[137,2121,2122],{"class":139,"line":205},[137,2123,2124],{},"            (item[\"name\"], item[\"price\"], item[\"url\"]),\n",[137,2126,2127],{"class":139,"line":288},[137,2128,449],{},[137,2130,2131],{"class":139,"line":294},[137,2132,2133],{},"        self.conn.commit()\n",[137,2135,2136],{"class":139,"line":300},[137,2137,1882],{},[137,2139,2140],{"class":139,"line":306},[137,2141,150],{"emptyLinePlaceholder":149},[137,2143,2144],{"class":139,"line":312},[137,2145,2146],{},"    def close_spider(self, spider):\n",[137,2148,2149],{"class":139,"line":318},[137,2150,2151],{},"        self.cur.close()\n",[137,2153,2154],{"class":139,"line":324},[137,2155,2156],{},"        self.conn.close()\n",[15,2158,2159,2160,2163],{},"The ",[19,2161,2162],{},"ON CONFLICT"," clause makes re-runs idempotent, so scraping the same page twice updates the price instead of creating a duplicate row.",[24,2165,2167],{"id":2166},"handling-javascript-heavy-pages","Handling JavaScript heavy pages",[15,2169,2170,2171,2174],{},"Scrapy fetches raw HTML and does not run JavaScript. For pages that render content client side, pair Scrapy with a browser using ",[19,2172,2173],{},"scrapy-playwright",", which lets a spider request a fully rendered page only when needed while keeping the fast path for static pages.",[24,2176,2178],{"id":2177},"when-scrapy-is-the-right-call","When Scrapy is the right call",[15,2180,2181],{},"Reach for Scrapy when the job is recurring, spans many pages, and needs reliability: price monitoring, catalog extraction, or any pipeline that runs on a schedule. For a quick one time grab of a single page, a small script is simpler. Match the tool to the job.",[24,2183,2185],{"id":2184},"need-a-production-scraping-pipeline-built","Need a production scraping pipeline built?",[15,2187,2188,2189,1651,2192,1108],{},"I build Scrapy based pipelines with proxy rotation, retry logic, and database export that run on a schedule and stay reliable at scale. If you have a recurring scraping need, ",[639,2190,644],{"href":641,"rel":2191},[643],[639,2193,649],{"href":648},[652,2195,654],{},{"title":133,"searchDepth":146,"depth":146,"links":2197},[2198,2199,2200,2201,2202,2203,2204,2205,2206],{"id":1701,"depth":146,"text":1702},{"id":1737,"depth":146,"text":1738},{"id":1839,"depth":146,"text":1840},{"id":1944,"depth":146,"text":1945},{"id":1994,"depth":146,"text":1995},{"id":2058,"depth":146,"text":2059},{"id":2166,"depth":146,"text":2167},{"id":2177,"depth":146,"text":2178},{"id":2184,"depth":146,"text":2185},"2026-06-05","How to use Scrapy for production scraping at scale. Covers spiders, item pipelines, concurrency tuning, proxy and retry middleware, and exporting to databases.",{},"\u002Fblog\u002Fscrapy-large-scale-scraping","9 min read",{"title":1683,"description":2208},"blog\u002Fscrapy-large-scale-scraping",[2215,676,132,2216,1674],"scrapy","data pipeline",[2218,2219,2220,2221],"Use Scrapy when the job is recurring and spans many pages, not for a one-off scrape.","Item pipelines clean, validate, and store data in stages.","AutoThrottle adapts the request rate to avoid bans; raise concurrency only while block rate stays at zero.","Use ON CONFLICT upserts so re-runs update existing rows instead of duplicating.","3Rf6J8LSUfGb3ScqiCyjd2tMEH31AVwQjz9pvRjjTaY",{"id":2224,"title":2225,"body":2226,"date":2628,"description":2629,"draft":668,"extension":669,"meta":2630,"navigation":149,"path":2631,"readingTime":1125,"seo":2632,"stem":2633,"tags":2634,"takeaways":2638,"updated":685,"__hash__":2643},"blog\u002Fblog\u002Fplaywright-vs-puppeteer-vs-selenium.md","Playwright vs Puppeteer vs Selenium for Web Scraping in 2026",{"type":8,"value":2227,"toc":2618},[2228,2231,2234,2238,2245,2343,2347,2354,2408,2411,2415,2418,2474,2481,2485,2488,2491,2530,2534,2540,2569,2575,2579,2582,2596,2600,2603,2607,2616],[11,2229,2225],{"id":2230},"playwright-vs-puppeteer-vs-selenium-for-web-scraping-in-2026",[15,2232,2233],{},"When a site renders content with JavaScript, a plain HTTP request returns an empty shell. You need a real browser, and that means one of three tools: Playwright, Puppeteer, or Selenium. They all drive a browser, but they are not interchangeable. This guide covers the practical differences that decide which one fits your scraping project.",[24,2235,2237],{"id":2236},"quick-verdict","Quick verdict",[15,2239,2240,2241,2244],{},"If you are starting fresh in 2026, ",[69,2242,2243],{},"use Playwright",". It is the most capable, has the cleanest API, and supports the most languages. The other two still have valid niches, which the rest of this guide covers.",[39,2246,2247,2262],{},[42,2248,2249],{},[45,2250,2251,2253,2256,2259],{},[48,2252,1189],{},[48,2254,2255],{},"Playwright",[48,2257,2258],{},"Puppeteer",[48,2260,2261],{},"Selenium",[61,2263,2264,2278,2292,2304,2317,2330],{},[45,2265,2266,2269,2272,2275],{},[66,2267,2268],{},"Languages",[66,2270,2271],{},"Python, JS, Java, .NET",[66,2273,2274],{},"JavaScript only",[66,2276,2277],{},"Almost every language",[45,2279,2280,2283,2286,2289],{},[66,2281,2282],{},"Browsers",[66,2284,2285],{},"Chromium, Firefox, WebKit",[66,2287,2288],{},"Chromium, Firefox",[66,2290,2291],{},"All major browsers",[45,2293,2294,2296,2299,2301],{},[66,2295,1202],{},[66,2297,2298],{},"Fast",[66,2300,2298],{},[66,2302,2303],{},"Slower",[45,2305,2306,2309,2312,2315],{},[66,2307,2308],{},"Auto waiting",[66,2310,2311],{},"Built in",[66,2313,2314],{},"Manual",[66,2316,2314],{},[45,2318,2319,2322,2325,2328],{},[66,2320,2321],{},"Stealth ecosystem",[66,2323,2324],{},"Growing",[66,2326,2327],{},"Mature",[66,2329,2327],{},[45,2331,2332,2334,2337,2340],{},[66,2333,59],{},[66,2335,2336],{},"New projects, cross browser",[66,2338,2339],{},"Node only Chrome work",[66,2341,2342],{},"Legacy, broad browser support",[24,2344,2346],{"id":2345},"playwright-the-default-choice","Playwright: the default choice",[15,2348,2349,2350,2353],{},"Playwright is the newest of the three and learned from the mistakes of the others. Its biggest practical advantage is ",[69,2351,2352],{},"auto waiting",": it waits for elements to be ready before acting, which eliminates most of the flaky timing bugs that plague Selenium scripts.",[128,2355,2357],{"className":130,"code":2356,"language":132,"meta":133,"style":133},"from playwright.async_api import async_playwright\n\nasync def scrape(url: str):\n    async with async_playwright() as p:\n        browser = await p.chromium.launch(headless=True)\n        page = await browser.new_page()\n        await page.goto(url)\n        # No manual wait needed, Playwright waits for the selector\n        title = await page.text_content(\"h1\")\n        await browser.close()\n        return title\n",[19,2358,2359,2363,2367,2371,2375,2380,2384,2389,2394,2399,2403],{"__ignoreMap":133},[137,2360,2361],{"class":139,"line":140},[137,2362,400],{},[137,2364,2365],{"class":139,"line":146},[137,2366,150],{"emptyLinePlaceholder":149},[137,2368,2369],{"class":139,"line":153},[137,2370,857],{},[137,2372,2373],{"class":139,"line":159},[137,2374,414],{},[137,2376,2377],{"class":139,"line":164},[137,2378,2379],{},"        browser = await p.chromium.launch(headless=True)\n",[137,2381,2382],{"class":139,"line":170},[137,2383,454],{},[137,2385,2386],{"class":139,"line":175},[137,2387,2388],{},"        await page.goto(url)\n",[137,2390,2391],{"class":139,"line":181},[137,2392,2393],{},"        # No manual wait needed, Playwright waits for the selector\n",[137,2395,2396],{"class":139,"line":187},[137,2397,2398],{},"        title = await page.text_content(\"h1\")\n",[137,2400,2401],{"class":139,"line":193},[137,2402,469],{},[137,2404,2405],{"class":139,"line":199},[137,2406,2407],{},"        return title\n",[15,2409,2410],{},"It also drives Chromium, Firefox, and WebKit with the same code, which matters when a site behaves differently across engines. For scraping, the WebKit support is useful for matching Safari behavior on sites that fingerprint the browser.",[24,2412,2414],{"id":2413},"puppeteer-great-if-you-live-in-node","Puppeteer: great if you live in Node",[15,2416,2417],{},"Puppeteer is Chrome focused and JavaScript only. If your stack is already Node and you only need Chromium, it is lean and well documented. The API is close to Playwright because the same team originally built both.",[128,2419,2423],{"className":2420,"code":2421,"language":2422,"meta":133,"style":133},"language-javascript shiki shiki-themes github-light github-dark","const puppeteer = require(\"puppeteer\");\n\n(async () => {\n  const browser = await puppeteer.launch({ headless: true });\n  const page = await browser.newPage();\n  await page.goto(\"https:\u002F\u002Fexample.com\");\n  const title = await page.$eval(\"h1\", el => el.textContent);\n  await browser.close();\n  console.log(title);\n})();\n","javascript",[19,2424,2425,2430,2434,2439,2444,2449,2454,2459,2464,2469],{"__ignoreMap":133},[137,2426,2427],{"class":139,"line":140},[137,2428,2429],{},"const puppeteer = require(\"puppeteer\");\n",[137,2431,2432],{"class":139,"line":146},[137,2433,150],{"emptyLinePlaceholder":149},[137,2435,2436],{"class":139,"line":153},[137,2437,2438],{},"(async () => {\n",[137,2440,2441],{"class":139,"line":159},[137,2442,2443],{},"  const browser = await puppeteer.launch({ headless: true });\n",[137,2445,2446],{"class":139,"line":164},[137,2447,2448],{},"  const page = await browser.newPage();\n",[137,2450,2451],{"class":139,"line":170},[137,2452,2453],{},"  await page.goto(\"https:\u002F\u002Fexample.com\");\n",[137,2455,2456],{"class":139,"line":175},[137,2457,2458],{},"  const title = await page.$eval(\"h1\", el => el.textContent);\n",[137,2460,2461],{"class":139,"line":181},[137,2462,2463],{},"  await browser.close();\n",[137,2465,2466],{"class":139,"line":187},[137,2467,2468],{},"  console.log(title);\n",[137,2470,2471],{"class":139,"line":193},[137,2472,2473],{},"})();\n",[15,2475,2476,2477,2480],{},"For stealth scraping, ",[19,2478,2479],{},"puppeteer-extra"," with the stealth plugin is a mature, battle tested option that hides many automation signals out of the box. This ecosystem maturity is Puppeteer's main edge over Playwright today, though the gap is closing.",[24,2482,2484],{"id":2483},"selenium-when-you-need-broad-browser-support","Selenium: when you need broad browser support",[15,2486,2487],{},"Selenium is the oldest and has the widest reach. It supports almost every programming language and every real browser, including older versions. If you must automate Internet Explorer mode, a specific Safari build, or you have an existing team skilled in Selenium, it remains a sensible choice.",[15,2489,2490],{},"The downsides for scraping are real: it is slower, has no built in auto waiting, and the WebDriver protocol adds overhead. For a new scraping project these costs usually outweigh the benefits.",[128,2492,2494],{"className":130,"code":2493,"language":132,"meta":133,"style":133},"from selenium import webdriver\nfrom selenium.webdriver.common.by import By\n\ndriver = webdriver.Chrome()\ndriver.get(\"https:\u002F\u002Fexample.com\")\ntitle = driver.find_element(By.TAG_NAME, \"h1\").text\ndriver.quit()\n",[19,2495,2496,2501,2506,2510,2515,2520,2525],{"__ignoreMap":133},[137,2497,2498],{"class":139,"line":140},[137,2499,2500],{},"from selenium import webdriver\n",[137,2502,2503],{"class":139,"line":146},[137,2504,2505],{},"from selenium.webdriver.common.by import By\n",[137,2507,2508],{"class":139,"line":153},[137,2509,150],{"emptyLinePlaceholder":149},[137,2511,2512],{"class":139,"line":159},[137,2513,2514],{},"driver = webdriver.Chrome()\n",[137,2516,2517],{"class":139,"line":164},[137,2518,2519],{},"driver.get(\"https:\u002F\u002Fexample.com\")\n",[137,2521,2522],{"class":139,"line":170},[137,2523,2524],{},"title = driver.find_element(By.TAG_NAME, \"h1\").text\n",[137,2526,2527],{"class":139,"line":175},[137,2528,2529],{},"driver.quit()\n",[24,2531,2533],{"id":2532},"the-stealth-question","The stealth question",[15,2535,2536,2537,2539],{},"For scraping protected sites, all three need patching to hide automation. The raw browser leaks ",[19,2538,737],{},", a headless user agent, and other tells. Each tool has a stealth path:",[597,2541,2542,2551,2560],{},[497,2543,2544,2547,2548,2550],{},[69,2545,2546],{},"Playwright:"," ",[19,2549,978],{},", or purpose built forks like Camoufox and nodriver.",[497,2552,2553,2547,2556,2559],{},[69,2554,2555],{},"Puppeteer:",[19,2557,2558],{},"puppeteer-extra-plugin-stealth",", the most mature option.",[497,2561,2562,2547,2565,2568],{},[69,2563,2564],{},"Selenium:",[19,2566,2567],{},"undetected-chromedriver",", widely used and effective.",[15,2570,2571,2572,2574],{},"No stealth plugin is permanent. Anti-bot vendors update their detection, and the plugins follow. This is why production scrapers need maintenance, not a one time setup. See my guide on ",[639,2573,1614],{"href":1124}," for the full stealth stack.",[24,2576,2578],{"id":2577},"performance-at-scale","Performance at scale",[15,2580,2581],{},"For large jobs, the bottleneck is rarely the tool and almost always the browser memory footprint. A real browser uses far more RAM than an HTTP request, so you cannot run thousands in parallel on one machine. The practical pattern:",[597,2583,2584,2587,2590,2593],{},[497,2585,2586],{},"Use a browser only for pages that truly need JavaScript.",[497,2588,2589],{},"Fall back to plain HTTP requests for static pages.",[497,2591,2592],{},"Reuse browser contexts instead of launching a new browser per page.",[497,2594,2595],{},"Run a pool of browsers across workers, not one giant instance.",[24,2597,2599],{"id":2598},"which-one-should-you-use","Which one should you use",[15,2601,2602],{},"For a new scraping project in 2026, Playwright is the right default for its API, cross browser support, and language options. Choose Puppeteer if you are committed to Node and Chrome only and want the mature stealth ecosystem. Choose Selenium only when you need a browser or language the others do not support.",[24,2604,2606],{"id":2605},"need-browser-automation-built-for-your-scraping-project","Need browser automation built for your scraping project?",[15,2608,2609,2610,1651,2613,2615],{},"I build scrapers on Playwright, Puppeteer, and Selenium with the stealth and proxy infrastructure to run reliably on protected sites. If you have a project that needs real browser automation, ",[639,2611,644],{"href":641,"rel":2612},[643],[639,2614,649],{"href":648},". I reply within 24 hours.",[652,2617,654],{},{"title":133,"searchDepth":146,"depth":146,"links":2619},[2620,2621,2622,2623,2624,2625,2626,2627],{"id":2236,"depth":146,"text":2237},{"id":2345,"depth":146,"text":2346},{"id":2413,"depth":146,"text":2414},{"id":2483,"depth":146,"text":2484},{"id":2532,"depth":146,"text":2533},{"id":2577,"depth":146,"text":2578},{"id":2598,"depth":146,"text":2599},{"id":2605,"depth":146,"text":2606},"2026-06-02","A practical comparison of the three main browser automation tools for scraping. Speed, stealth, language support, and which one to choose for your project.",{},"\u002Fblog\u002Fplaywright-vs-puppeteer-vs-selenium",{"title":2225,"description":2629},"blog\u002Fplaywright-vs-puppeteer-vs-selenium",[679,2635,2636,2637,676],"puppeteer","selenium","browser automation",[2639,2640,2641,2642],"For new scraping projects in 2026, Playwright is the best default choice.","Puppeteer fits Node-only Chrome work and has the most mature stealth ecosystem.","Selenium is for broad browser and language support or legacy needs.","All three need stealth patches and proxies to run on protected sites.","jvdEU_QeXUCy1D_3AfnA5R7pZQZSFJMTxWQyXbzD1Ik",{"id":2645,"title":2646,"body":2647,"date":2957,"description":2958,"draft":668,"extension":669,"meta":2959,"navigation":149,"path":2960,"readingTime":672,"seo":2961,"stem":2962,"tags":2963,"takeaways":2966,"updated":685,"__hash__":2971},"blog\u002Fblog\u002Fbypass-datadome-perimeterx.md","How to Scrape Sites Protected by DataDome and PerimeterX",{"type":8,"value":2648,"toc":2944},[2649,2652,2655,2659,2662,2688,2695,2699,2702,2705,2709,2712,2717,2723,2727,2733,2820,2823,2827,2830,2865,2868,2872,2875,2879,2885,2920,2923,2927,2930,2934,2942],[11,2650,2646],{"id":2651},"how-to-scrape-sites-protected-by-datadome-and-perimeterx",[15,2653,2654],{},"DataDome and PerimeterX (now part of HUMAN) are among the toughest bot protection systems on the web. They go beyond the IP and header checks of a basic firewall and build a behavioral profile of every visitor. If your scraper passes Cloudflare but dies on these, this guide explains why and what actually works.",[24,2656,2658],{"id":2657},"what-makes-them-harder-than-a-basic-waf","What makes them harder than a basic WAF",[15,2660,2661],{},"A simple firewall checks your IP reputation and a few headers. DataDome and PerimeterX collect far more signals and score them together with machine learning:",[597,2663,2664,2670,2676,2682],{},[497,2665,2666,2669],{},[69,2667,2668],{},"Deep browser fingerprinting."," Canvas, WebGL, audio context, installed fonts, screen metrics, and dozens of JavaScript properties.",[497,2671,2672,2675],{},[69,2673,2674],{},"Behavioral biometrics."," Mouse movement curves, scroll velocity, keystroke timing, and how naturally you navigate.",[497,2677,2678,2681],{},[69,2679,2680],{},"Device consistency."," Whether your user agent, fingerprint, and TLS signature all agree with each other.",[497,2683,2684,2687],{},[69,2685,2686],{},"Session reputation."," A score that builds over time, so a session that suddenly acts like a bot gets flagged even if it started clean.",[15,2689,2690,2691,2694],{},"The key insight: these systems look for ",[69,2692,2693],{},"consistency and humanity",", not just a clean IP. A perfect residential IP attached to an obvious headless browser fails immediately.",[24,2696,2698],{"id":2697},"why-most-scrapers-fail-here","Why most scrapers fail here",[15,2700,2701],{},"The common failure is fixing one layer and ignoring the rest. People add residential proxies and still get blocked because the browser fingerprint screams automation. Or they patch the fingerprint but run from a flagged datacenter IP. DataDome and PerimeterX correlate signals, so any single inconsistency is enough.",[15,2703,2704],{},"The second common failure is behavior. Even a flawless fingerprint and IP get caught if the session loads ten pages per second in a perfectly even rhythm no human could produce.",[24,2706,2708],{"id":2707},"the-layered-approach-that-works","The layered approach that works",[15,2710,2711],{},"Getting through requires all of these together, not any one alone.",[2713,2714,2716],"h3",{"id":2715},"_1-residential-or-mobile-proxies","1. Residential or mobile proxies",[15,2718,2719,2720,1001],{},"Datacenter IPs start with a trust deficit you cannot overcome here. Use residential or, for the hardest targets, mobile proxies, and match the proxy country to the site's audience. See my guide on ",[639,2721,2722],{"href":671},"rotating proxies",[2713,2724,2726],{"id":2725},"_2-a-genuinely-patched-browser-fingerprint","2. A genuinely patched browser fingerprint",[15,2728,2729,2730,2732],{},"The browser must present a consistent, realistic fingerprint with no automation tells. This means a real user agent that matches the actual browser build, correct WebGL vendor strings, a populated plugins array, and ",[19,2731,737],{}," removed. Purpose built tools like Camoufox and nodriver handle much of this, but they need updates as detection evolves.",[128,2734,2736],{"className":130,"code":2735,"language":132,"meta":133,"style":133},"from playwright.async_api import async_playwright\n\nasync def stealth_context(p, proxy):\n    browser = await p.chromium.launch(\n        headless=True,\n        args=[\"--disable-blink-features=AutomationControlled\"],\n        proxy=proxy,\n    )\n    ctx = await browser.new_context(\n        user_agent=\"Mozilla\u002F5.0 (Macintosh; Intel Mac OS X 10_15_7) \"\n                   \"AppleWebKit\u002F537.36 (KHTML, like Gecko) \"\n                   \"Chrome\u002F131.0.0.0 Safari\u002F537.36\",\n        viewport={\"width\": 1440, \"height\": 900},\n        locale=\"en-US\",\n        timezone_id=\"America\u002FNew_York\",\n    )\n    return ctx\n",[19,2737,2738,2742,2746,2751,2756,2761,2766,2771,2776,2781,2786,2791,2796,2801,2806,2811,2815],{"__ignoreMap":133},[137,2739,2740],{"class":139,"line":140},[137,2741,400],{},[137,2743,2744],{"class":139,"line":146},[137,2745,150],{"emptyLinePlaceholder":149},[137,2747,2748],{"class":139,"line":153},[137,2749,2750],{},"async def stealth_context(p, proxy):\n",[137,2752,2753],{"class":139,"line":159},[137,2754,2755],{},"    browser = await p.chromium.launch(\n",[137,2757,2758],{"class":139,"line":164},[137,2759,2760],{},"        headless=True,\n",[137,2762,2763],{"class":139,"line":170},[137,2764,2765],{},"        args=[\"--disable-blink-features=AutomationControlled\"],\n",[137,2767,2768],{"class":139,"line":175},[137,2769,2770],{},"        proxy=proxy,\n",[137,2772,2773],{"class":139,"line":181},[137,2774,2775],{},"    )\n",[137,2777,2778],{"class":139,"line":187},[137,2779,2780],{},"    ctx = await browser.new_context(\n",[137,2782,2783],{"class":139,"line":193},[137,2784,2785],{},"        user_agent=\"Mozilla\u002F5.0 (Macintosh; Intel Mac OS X 10_15_7) \"\n",[137,2787,2788],{"class":139,"line":199},[137,2789,2790],{},"                   \"AppleWebKit\u002F537.36 (KHTML, like Gecko) \"\n",[137,2792,2793],{"class":139,"line":205},[137,2794,2795],{},"                   \"Chrome\u002F131.0.0.0 Safari\u002F537.36\",\n",[137,2797,2798],{"class":139,"line":288},[137,2799,2800],{},"        viewport={\"width\": 1440, \"height\": 900},\n",[137,2802,2803],{"class":139,"line":294},[137,2804,2805],{},"        locale=\"en-US\",\n",[137,2807,2808],{"class":139,"line":300},[137,2809,2810],{},"        timezone_id=\"America\u002FNew_York\",\n",[137,2812,2813],{"class":139,"line":306},[137,2814,2775],{},[137,2816,2817],{"class":139,"line":312},[137,2818,2819],{},"    return ctx\n",[15,2821,2822],{},"Note the timezone and locale. DataDome checks whether your timezone matches your IP geolocation, so a US proxy with a European timezone is a red flag.",[2713,2824,2826],{"id":2825},"_3-human-like-behavior","3. Human like behavior",[15,2828,2829],{},"Add realistic interaction before extracting data. Move the mouse, scroll gradually, and vary your timing.",[128,2831,2833],{"className":130,"code":2832,"language":132,"meta":133,"style":133},"async def human_warmup(page):\n    await page.mouse.move(200, 300)\n    await page.wait_for_timeout(800)\n    await page.mouse.wheel(0, 600)\n    await page.wait_for_timeout(1200)\n    await page.mouse.move(500, 450)\n",[19,2834,2835,2840,2845,2850,2855,2860],{"__ignoreMap":133},[137,2836,2837],{"class":139,"line":140},[137,2838,2839],{},"async def human_warmup(page):\n",[137,2841,2842],{"class":139,"line":146},[137,2843,2844],{},"    await page.mouse.move(200, 300)\n",[137,2846,2847],{"class":139,"line":153},[137,2848,2849],{},"    await page.wait_for_timeout(800)\n",[137,2851,2852],{"class":139,"line":159},[137,2853,2854],{},"    await page.mouse.wheel(0, 600)\n",[137,2856,2857],{"class":139,"line":164},[137,2858,2859],{},"    await page.wait_for_timeout(1200)\n",[137,2861,2862],{"class":139,"line":170},[137,2863,2864],{},"    await page.mouse.move(500, 450)\n",[15,2866,2867],{},"This is not optional on PerimeterX, which weighs behavioral biometrics heavily. A session that never moves the mouse is an obvious bot.",[2713,2869,2871],{"id":2870},"_4-session-and-cookie-management","4. Session and cookie management",[15,2873,2874],{},"Both systems issue a cookie that carries your trust score. Once you earn a good score, reuse that session. Throwing away cookies and re-solving on every request both wastes effort and looks suspicious. Persist the session, rotate to a new one when the score degrades.",[24,2876,2878],{"id":2877},"detecting-when-you-are-blocked","Detecting when you are blocked",[15,2880,2881,2882,2884],{},"These systems often return a ",[19,2883,617],{}," with a block page or a challenge, not an obvious error. Always validate the body.",[128,2886,2888],{"className":130,"code":2887,"language":132,"meta":133,"style":133},"def is_blocked(html: str, status: int) -> bool:\n    if status in (403, 429):\n        return True\n    markers = [\"datadome\", \"px-captcha\", \"_px\", \"blocked by\"]\n    lowered = html.lower()\n    return any(m in lowered for m in markers)\n",[19,2889,2890,2895,2900,2905,2910,2915],{"__ignoreMap":133},[137,2891,2892],{"class":139,"line":140},[137,2893,2894],{},"def is_blocked(html: str, status: int) -> bool:\n",[137,2896,2897],{"class":139,"line":146},[137,2898,2899],{},"    if status in (403, 429):\n",[137,2901,2902],{"class":139,"line":153},[137,2903,2904],{},"        return True\n",[137,2906,2907],{"class":139,"line":159},[137,2908,2909],{},"    markers = [\"datadome\", \"px-captcha\", \"_px\", \"blocked by\"]\n",[137,2911,2912],{"class":139,"line":164},[137,2913,2914],{},"    lowered = html.lower()\n",[137,2916,2917],{"class":139,"line":170},[137,2918,2919],{},"    return any(m in lowered for m in markers)\n",[15,2921,2922],{},"When blocked, rotate the proxy and session together, back off, and retry. Hammering with the same flagged session escalates a soft block into a hard ban.",[24,2924,2926],{"id":2925},"a-realistic-expectation","A realistic expectation",[15,2928,2929],{},"DataDome and PerimeterX update their detection continuously. A setup that works this month may need adjustment next month. Scraping these sites reliably is an ongoing engineering effort with monitoring and maintenance, not a one time script. Anyone promising a permanent bypass is overselling.",[24,2931,2933],{"id":2932},"need-a-hard-target-scraped-reliably","Need a hard target scraped reliably?",[15,2935,2936,2937,1651,2940,1108],{},"I build and maintain scrapers that get through DataDome, PerimeterX, Cloudflare, and Akamai, with the stealth, proxy, and monitoring infrastructure to keep them running. If you have a tough target, ",[639,2938,644],{"href":641,"rel":2939},[643],[639,2941,649],{"href":648},[652,2943,654],{},{"title":133,"searchDepth":146,"depth":146,"links":2945},[2946,2947,2948,2954,2955,2956],{"id":2657,"depth":146,"text":2658},{"id":2697,"depth":146,"text":2698},{"id":2707,"depth":146,"text":2708,"children":2949},[2950,2951,2952,2953],{"id":2715,"depth":153,"text":2716},{"id":2725,"depth":153,"text":2726},{"id":2825,"depth":153,"text":2826},{"id":2870,"depth":153,"text":2871},{"id":2877,"depth":146,"text":2878},{"id":2925,"depth":146,"text":2926},{"id":2932,"depth":146,"text":2933},"2026-05-28","What DataDome and PerimeterX detect, why they are harder than basic WAFs, and the layered approach of stealth browsers, residential proxies, and session management that gets through.",{},"\u002Fblog\u002Fbypass-datadome-perimeterx",{"title":2646,"description":2958},"blog\u002Fbypass-datadome-perimeterx",[2964,2965,678,676,677],"datadome","perimeterx",[2967,2968,2969,2970],"DataDome and PerimeterX score consistency and humanity, not just IP reputation.","A clean IP with a headless fingerprint still fails; fix every layer together.","Match timezone and locale to the proxy's geolocation to stay consistent.","Add human-like mouse movement and scrolling, especially against PerimeterX.","K-G6DoY_R1THNq4eu7fjespGxALcbOItwJ97jfrHGPY",{"id":2973,"title":2974,"body":2975,"date":3331,"description":3332,"draft":668,"extension":669,"meta":3333,"navigation":149,"path":3334,"readingTime":672,"seo":3335,"stem":3336,"tags":3337,"takeaways":3341,"updated":685,"__hash__":3346},"blog\u002Fblog\u002Fscrape-amazon-product-data.md","How to Scrape Amazon Product Data Reliably",{"type":8,"value":2976,"toc":3320},[2977,2980,2983,2987,2990,3010,3013,3017,3020,3024,3027,3137,3140,3144,3147,3176,3183,3187,3193,3217,3220,3224,3227,3292,3296,3299,3303,3306,3310,3318],[11,2978,2974],{"id":2979},"how-to-scrape-amazon-product-data-reliably",[15,2981,2982],{},"Amazon is one of the most requested scraping targets and one of the most defended. Product data, pricing, and reviews drive competitive intelligence, repricing, and market research. This guide covers how to extract that data reliably, what breaks, and when to use the official channels instead.",[24,2984,2986],{"id":2985},"what-you-can-extract","What you can extract",[15,2988,2989],{},"A typical Amazon product scrape pulls:",[597,2991,2992,2995,2998,3001,3004,3007],{},[497,2993,2994],{},"Title, brand, and ASIN",[497,2996,2997],{},"Current price, list price, and any deal price",[497,2999,3000],{},"Star rating and review count",[497,3002,3003],{},"Availability and Buy Box seller",[497,3005,3006],{},"Images and bullet point features",[497,3008,3009],{},"Review text and ratings",[15,3011,3012],{},"Each of these lives in a predictable spot in the page, but Amazon changes its markup often and serves different layouts to different regions and visitors, which is the first thing that breaks naive scrapers.",[24,3014,3016],{"id":3015},"the-legal-and-policy-reality","The legal and policy reality",[15,3018,3019],{},"Scraping publicly visible product data is common, but Amazon's terms of service prohibit it, and Amazon actively defends against it. Be clear eyed: respect robots directives where it matters to you, do not scrape personal data, throttle your requests, and consider the official API for anything where compliance is a hard requirement. This guide is about the technical how, not a claim that it is permitted by Amazon.",[24,3021,3023],{"id":3022},"basic-extraction-with-selectors","Basic extraction with selectors",[15,3025,3026],{},"For a single region and layout, the selectors are straightforward. The challenge is that Amazon uses several layouts, so robust code tries multiple selectors per field.",[128,3028,3030],{"className":130,"code":3029,"language":132,"meta":133,"style":133},"from playwright.async_api import async_playwright\n\nasync def scrape_product(url: str):\n    async with async_playwright() as p:\n        browser = await p.chromium.launch(headless=True)\n        page = await browser.new_page()\n        await page.goto(url, wait_until=\"domcontentloaded\")\n\n        title = await page.text_content(\"#productTitle\")\n\n        # Price lives in different spots depending on layout\n        price = None\n        for sel in [\".a-price .a-offscreen\", \"#priceblock_ourprice\",\n                    \"#corePrice_feature_div .a-offscreen\"]:\n            el = await page.query_selector(sel)\n            if el:\n                price = (await el.text_content()).strip()\n                break\n\n        rating = await page.text_content(\"span[data-hook=rating-out-of-text]\")\n        await browser.close()\n        return {\"title\": title.strip() if title else None,\n                \"price\": price, \"rating\": rating}\n",[19,3031,3032,3036,3040,3045,3049,3053,3057,3061,3065,3070,3074,3079,3084,3089,3094,3099,3104,3109,3114,3118,3123,3127,3132],{"__ignoreMap":133},[137,3033,3034],{"class":139,"line":140},[137,3035,400],{},[137,3037,3038],{"class":139,"line":146},[137,3039,150],{"emptyLinePlaceholder":149},[137,3041,3042],{"class":139,"line":153},[137,3043,3044],{},"async def scrape_product(url: str):\n",[137,3046,3047],{"class":139,"line":159},[137,3048,414],{},[137,3050,3051],{"class":139,"line":164},[137,3052,2379],{},[137,3054,3055],{"class":139,"line":170},[137,3056,454],{},[137,3058,3059],{"class":139,"line":175},[137,3060,953],{},[137,3062,3063],{"class":139,"line":181},[137,3064,150],{"emptyLinePlaceholder":149},[137,3066,3067],{"class":139,"line":187},[137,3068,3069],{},"        title = await page.text_content(\"#productTitle\")\n",[137,3071,3072],{"class":139,"line":193},[137,3073,150],{"emptyLinePlaceholder":149},[137,3075,3076],{"class":139,"line":199},[137,3077,3078],{},"        # Price lives in different spots depending on layout\n",[137,3080,3081],{"class":139,"line":205},[137,3082,3083],{},"        price = None\n",[137,3085,3086],{"class":139,"line":288},[137,3087,3088],{},"        for sel in [\".a-price .a-offscreen\", \"#priceblock_ourprice\",\n",[137,3090,3091],{"class":139,"line":294},[137,3092,3093],{},"                    \"#corePrice_feature_div .a-offscreen\"]:\n",[137,3095,3096],{"class":139,"line":300},[137,3097,3098],{},"            el = await page.query_selector(sel)\n",[137,3100,3101],{"class":139,"line":306},[137,3102,3103],{},"            if el:\n",[137,3105,3106],{"class":139,"line":312},[137,3107,3108],{},"                price = (await el.text_content()).strip()\n",[137,3110,3111],{"class":139,"line":318},[137,3112,3113],{},"                break\n",[137,3115,3116],{"class":139,"line":324},[137,3117,150],{"emptyLinePlaceholder":149},[137,3119,3120],{"class":139,"line":330},[137,3121,3122],{},"        rating = await page.text_content(\"span[data-hook=rating-out-of-text]\")\n",[137,3124,3125],{"class":139,"line":336},[137,3126,469],{},[137,3128,3129],{"class":139,"line":342},[137,3130,3131],{},"        return {\"title\": title.strip() if title else None,\n",[137,3133,3134],{"class":139,"line":348},[137,3135,3136],{},"                \"price\": price, \"rating\": rating}\n",[15,3138,3139],{},"The multi selector fallback for price is the single most important reliability trick. Amazon shows at least four price layouts, and hard coding one guarantees breakage.",[24,3141,3143],{"id":3142},"handling-amazons-anti-bot-defenses","Handling Amazon's anti-bot defenses",[15,3145,3146],{},"Amazon serves CAPTCHAs and the \"Robot Check\" page when it suspects automation. To stay below that threshold:",[597,3148,3149,3158,3164,3170],{},[497,3150,3151,3154,3155,2055],{},[69,3152,3153],{},"Use residential proxies"," and rotate them. Datacenter IPs get the robot page fast. See my ",[639,3156,3157],{"href":671},"rotating proxies guide",[497,3159,3160,3163],{},[69,3161,3162],{},"Match the region."," Use a proxy in the same country as the Amazon domain you are scraping, or you get redirected and see wrong prices.",[497,3165,3166,3169],{},[69,3167,3168],{},"Slow down."," Amazon tolerates a steady, human like rate. Bursts trigger the check.",[497,3171,3172,3175],{},[69,3173,3174],{},"Persist sessions."," Reuse cookies once you have a clean session rather than starting fresh each request.",[15,3177,3178,3179,3182],{},"When the robot check does appear, you can solve it with a CAPTCHA service, covered in my guide on ",[639,3180,3181],{"href":1669},"solving CAPTCHAs",", but reducing how often it appears is cheaper than solving it.",[24,3184,3186],{"id":3185},"detecting-the-robot-check","Detecting the robot check",[15,3188,3189,3190,3192],{},"Like most protected sites, Amazon returns a ",[19,3191,617],{}," with the block page rather than an error. Validate the content.",[128,3194,3196],{"className":130,"code":3195,"language":132,"meta":133,"style":133},"def is_robot_check(html: str) -> bool:\n    markers = [\"Robot Check\", \"Enter the characters you see below\",\n               \"automated access\"]\n    return any(m in html for m in markers)\n",[19,3197,3198,3203,3208,3213],{"__ignoreMap":133},[137,3199,3200],{"class":139,"line":140},[137,3201,3202],{},"def is_robot_check(html: str) -> bool:\n",[137,3204,3205],{"class":139,"line":146},[137,3206,3207],{},"    markers = [\"Robot Check\", \"Enter the characters you see below\",\n",[137,3209,3210],{"class":139,"line":153},[137,3211,3212],{},"               \"automated access\"]\n",[137,3214,3215],{"class":139,"line":159},[137,3216,1080],{},[15,3218,3219],{},"If detected, rotate the proxy and session, back off, and retry.",[24,3221,3223],{"id":3222},"scraping-reviews-and-pagination","Scraping reviews and pagination",[15,3225,3226],{},"Reviews span many pages. Follow the next page link and respect a delay between requests so the review crawl does not spike your rate.",[128,3228,3230],{"className":130,"code":3229,"language":132,"meta":133,"style":133},"async def scrape_reviews(page, max_pages=10):\n    reviews = []\n    for _ in range(max_pages):\n        for r in await page.query_selector_all(\"div[data-hook=review]\"):\n            body = await r.query_selector(\"span[data-hook=review-body]\")\n            reviews.append((await body.text_content()).strip() if body else \"\")\n        nxt = await page.query_selector(\"li.a-last a\")\n        if not nxt:\n            break\n        await nxt.click()\n        await page.wait_for_timeout(2000)\n    return reviews\n",[19,3231,3232,3237,3242,3247,3252,3257,3262,3267,3272,3277,3282,3287],{"__ignoreMap":133},[137,3233,3234],{"class":139,"line":140},[137,3235,3236],{},"async def scrape_reviews(page, max_pages=10):\n",[137,3238,3239],{"class":139,"line":146},[137,3240,3241],{},"    reviews = []\n",[137,3243,3244],{"class":139,"line":153},[137,3245,3246],{},"    for _ in range(max_pages):\n",[137,3248,3249],{"class":139,"line":159},[137,3250,3251],{},"        for r in await page.query_selector_all(\"div[data-hook=review]\"):\n",[137,3253,3254],{"class":139,"line":164},[137,3255,3256],{},"            body = await r.query_selector(\"span[data-hook=review-body]\")\n",[137,3258,3259],{"class":139,"line":170},[137,3260,3261],{},"            reviews.append((await body.text_content()).strip() if body else \"\")\n",[137,3263,3264],{"class":139,"line":175},[137,3265,3266],{},"        nxt = await page.query_selector(\"li.a-last a\")\n",[137,3268,3269],{"class":139,"line":181},[137,3270,3271],{},"        if not nxt:\n",[137,3273,3274],{"class":139,"line":187},[137,3275,3276],{},"            break\n",[137,3278,3279],{"class":139,"line":193},[137,3280,3281],{},"        await nxt.click()\n",[137,3283,3284],{"class":139,"line":199},[137,3285,3286],{},"        await page.wait_for_timeout(2000)\n",[137,3288,3289],{"class":139,"line":205},[137,3290,3291],{},"    return reviews\n",[24,3293,3295],{"id":3294},"the-official-alternative-amazons-api","The official alternative: Amazon's API",[15,3297,3298],{},"If your use case allows it, Amazon's Product Advertising API and the Selling Partner API provide structured data without scraping. They have strict eligibility rules and rate limits, and they do not expose everything the website shows, but for compliant, stable access they are worth evaluating before building a scraper. For price monitoring at scale where the API does not fit, scraping remains the common path.",[24,3300,3302],{"id":3301},"keeping-it-reliable-over-time","Keeping it reliable over time",[15,3304,3305],{},"Amazon changes its markup and tightens its defenses regularly. A scraper that works today will break, so build for maintenance: monitor your success rate, alert when extraction returns nulls, and keep the selector fallbacks updated. The real deliverable is a pipeline that stays working, not a script that ran once.",[24,3307,3309],{"id":3308},"need-amazon-or-e-commerce-data-at-scale","Need Amazon or e-commerce data at scale?",[15,3311,3312,3313,1651,3316,1108],{},"I build e-commerce scrapers for price monitoring, catalog extraction, and competitor tracking, with the proxy and anti-bot infrastructure to run reliably. If you need product data at scale, ",[639,3314,644],{"href":641,"rel":3315},[643],[639,3317,649],{"href":648},[652,3319,654],{},{"title":133,"searchDepth":146,"depth":146,"links":3321},[3322,3323,3324,3325,3326,3327,3328,3329,3330],{"id":2985,"depth":146,"text":2986},{"id":3015,"depth":146,"text":3016},{"id":3022,"depth":146,"text":3023},{"id":3142,"depth":146,"text":3143},{"id":3185,"depth":146,"text":3186},{"id":3222,"depth":146,"text":3223},{"id":3294,"depth":146,"text":3295},{"id":3301,"depth":146,"text":3302},{"id":3308,"depth":146,"text":3309},"2026-05-24","A practical guide to scraping Amazon product listings, prices, and reviews at scale. Covers selectors, anti-bot handling, the official API alternative, and staying reliable.",{},"\u002Fblog\u002Fscrape-amazon-product-data",{"title":2974,"description":3332},"blog\u002Fscrape-amazon-product-data",[3338,3339,676,3340,132],"amazon","e-commerce","price monitoring",[3342,3343,3344,3345],"Amazon serves several price layouts, so use multiple fallback selectors per field.","Use residential proxies matched to the marketplace country to get correct prices.","The robot check returns a 200 with a block page, so validate the content.","Consider Amazon's official APIs where compliance is a hard requirement.","gmag89HmtLX-KNDs3LaFsWqF7-xNsBDmU6f3YfLQ-Ts",{"id":3348,"title":3349,"body":3350,"date":3645,"description":3646,"draft":668,"extension":669,"meta":3647,"navigation":149,"path":3648,"readingTime":672,"seo":3649,"stem":3650,"tags":3651,"takeaways":3653,"updated":685,"__hash__":3658},"blog\u002Fblog\u002Fresidential-proxy-services-compared.md","Residential Proxy Services Compared: Bright Data, Oxylabs, Smartproxy",{"type":8,"value":3351,"toc":3633},[3352,3355,3358,3362,3365,3370,3374,3462,3465,3468,3471,3474,3477,3480,3483,3486,3489,3493,3496,3500,3503,3523,3526,3530,3533,3586,3589,3593,3596,3616,3619,3623,3631],[11,3353,3349],{"id":3354},"residential-proxy-services-compared-bright-data-oxylabs-smartproxy",[15,3356,3357],{},"The proxy provider you choose decides whether a scraping job runs smoothly or burns money on banned IPs. Residential proxies are the standard for protected sites, but the providers differ a lot in price, pool quality, and features. This guide compares the main options and how to pick.",[24,3359,3361],{"id":3360},"why-residential-and-why-the-provider-matters","Why residential, and why the provider matters",[15,3363,3364],{},"Residential proxies route your requests through real consumer devices, so the target site sees a normal home IP rather than a datacenter range it can flag instantly. But not all residential pools are equal. A cheap, oversold pool is full of IPs already burned across thousands of sites. The provider's pool quality is often more important than the headline price.",[15,3366,3367,3368,2055],{},"For when to use residential versus datacenter or mobile, see my guide on ",[639,3369,2722],{"href":671},[24,3371,3373],{"id":3372},"the-main-providers-at-a-glance","The main providers at a glance",[39,3375,3376,3394],{},[42,3377,3378],{},[45,3379,3380,3383,3386,3389,3392],{},[48,3381,3382],{},"Provider",[48,3384,3385],{},"Pool size",[48,3387,3388],{},"Pricing",[48,3390,3391],{},"Strength",[48,3393,59],{},[61,3395,3396,3413,3428,3445],{},[45,3397,3398,3401,3404,3407,3410],{},[66,3399,3400],{},"Bright Data",[66,3402,3403],{},"Very large",[66,3405,3406],{},"Premium",[66,3408,3409],{},"Coverage, tooling, compliance",[66,3411,3412],{},"Enterprise, hard targets",[45,3414,3415,3418,3420,3422,3425],{},[66,3416,3417],{},"Oxylabs",[66,3419,3403],{},[66,3421,3406],{},[66,3423,3424],{},"Pool quality, support",[66,3426,3427],{},"Large scale, reliability",[45,3429,3430,3433,3436,3439,3442],{},[66,3431,3432],{},"Smartproxy",[66,3434,3435],{},"Large",[66,3437,3438],{},"Mid range",[66,3440,3441],{},"Value, ease of use",[66,3443,3444],{},"Small to mid projects",[45,3446,3447,3450,3453,3456,3459],{},[66,3448,3449],{},"IPRoyal",[66,3451,3452],{},"Mid",[66,3454,3455],{},"Budget",[66,3457,3458],{},"Low cost entry",[66,3460,3461],{},"Hobby, light jobs",[24,3463,3400],{"id":3464},"bright-data",[15,3466,3467],{},"Bright Data is the largest and most feature rich provider. Beyond raw proxies it offers a Web Unlocker that handles anti-bot bypass for you, a scraping browser, and pre built dataset products. It has strong geo targeting down to the city level and a compliance focused onboarding process.",[15,3469,3470],{},"The tradeoff is price and complexity. It is the most expensive option and the dashboard has a learning curve. For enterprise jobs on the hardest targets where reliability justifies the cost, it is the safe choice. For a small project it is overkill.",[24,3472,3417],{"id":3473},"oxylabs",[15,3475,3476],{},"Oxylabs sits alongside Bright Data at the premium tier. Its residential pool is large and well maintained, and its support is strong, which matters when a job breaks at 2 in the morning. It also offers a Web Unblocker product similar to Bright Data's for offloading anti-bot handling.",[15,3478,3479],{},"In practice the choice between Oxylabs and Bright Data often comes down to pricing for your specific volume and which support team you prefer. Both deliver high pool quality.",[24,3481,3432],{"id":3482},"smartproxy",[15,3484,3485],{},"Smartproxy is the sweet spot for most small to mid sized scraping projects. The pool is solid, the pricing is more approachable than the two premium providers, and the dashboard is genuinely easy to use. Geo targeting covers country and city level for most regions.",[15,3487,3488],{},"If you are scraping moderately protected sites and do not need enterprise scale, Smartproxy usually gives the best value. It is my common recommendation for projects that have outgrown datacenter proxies but do not need a premium pool.",[24,3490,3492],{"id":3491},"budget-options","Budget options",[15,3494,3495],{},"IPRoyal and similar lower cost providers can work for light jobs on weakly protected sites. The risk is pool quality: cheaper pools have more burned IPs, so you may pay less per gigabyte but waste more requests on bans. For anything where reliability matters, the savings often disappear once you account for failed requests.",[24,3497,3499],{"id":3498},"how-pricing-models-differ","How pricing models differ",[15,3501,3502],{},"Most residential providers charge per gigabyte of traffic, not per IP. This changes how you optimize:",[597,3504,3505,3511,3517],{},[497,3506,3507,3510],{},[69,3508,3509],{},"Block images and assets"," you do not need, since they count against your bandwidth.",[497,3512,3513,3516],{},[69,3514,3515],{},"Avoid re-fetching"," pages you already have.",[497,3518,3519,3522],{},[69,3520,3521],{},"Use datacenter proxies"," for the easy pages and save residential bandwidth for the protected ones.",[15,3524,3525],{},"A few providers offer per IP or unlimited plans, which can be cheaper for high bandwidth jobs like scraping image heavy catalogs. Match the pricing model to your traffic shape.",[24,3527,3529],{"id":3528},"a-simple-integration","A simple integration",[15,3531,3532],{},"Whichever provider you choose, the integration is the same gateway pattern. The provider rotates the IP for you, and you can usually pin a country with a parameter.",[128,3534,3536],{"className":130,"code":3535,"language":132,"meta":133,"style":133},"import requests\n\n# Most providers give a rotating gateway endpoint\nPROXY = \"http:\u002F\u002FUSER:PASS@gate.smartproxy.com:7000\"\n\nresp = requests.get(\n    \"https:\u002F\u002Fexample.com\",\n    proxies={\"http\": PROXY, \"https\": PROXY},\n    timeout=20,\n)\nprint(resp.status_code)\n",[19,3537,3538,3542,3546,3551,3556,3560,3564,3569,3574,3578,3582],{"__ignoreMap":133},[137,3539,3540],{"class":139,"line":140},[137,3541,143],{},[137,3543,3544],{"class":139,"line":146},[137,3545,150],{"emptyLinePlaceholder":149},[137,3547,3548],{"class":139,"line":153},[137,3549,3550],{},"# Most providers give a rotating gateway endpoint\n",[137,3552,3553],{"class":139,"line":159},[137,3554,3555],{},"PROXY = \"http:\u002F\u002FUSER:PASS@gate.smartproxy.com:7000\"\n",[137,3557,3558],{"class":139,"line":164},[137,3559,150],{"emptyLinePlaceholder":149},[137,3561,3562],{"class":139,"line":170},[137,3563,178],{},[137,3565,3566],{"class":139,"line":175},[137,3567,3568],{},"    \"https:\u002F\u002Fexample.com\",\n",[137,3570,3571],{"class":139,"line":181},[137,3572,3573],{},"    proxies={\"http\": PROXY, \"https\": PROXY},\n",[137,3575,3576],{"class":139,"line":187},[137,3577,196],{},[137,3579,3580],{"class":139,"line":193},[137,3581,202],{},[137,3583,3584],{"class":139,"line":199},[137,3585,821],{},[15,3587,3588],{},"For sticky sessions where you keep the same IP across a login flow, providers offer a session parameter in the username or a dedicated sticky port.",[24,3590,3592],{"id":3591},"how-to-choose","How to choose",[15,3594,3595],{},"The decision comes down to your target difficulty and budget:",[597,3597,3598,3604,3610],{},[497,3599,3600,3603],{},[69,3601,3602],{},"Hard targets, enterprise scale:"," Bright Data or Oxylabs, and consider their managed unblocker products.",[497,3605,3606,3609],{},[69,3607,3608],{},"Mid sized projects, good value:"," Smartproxy.",[497,3611,3612,3615],{},[69,3613,3614],{},"Light, low budget jobs on easy sites:"," a budget provider or even datacenter proxies.",[15,3617,3618],{},"Start with the cheapest tier that works for your target, and escalate only when you see blocks. Paying for a premium pool on a site that does not need it is wasted money.",[24,3620,3622],{"id":3621},"need-help-choosing-and-integrating-proxies","Need help choosing and integrating proxies?",[15,3624,3625,3626,1651,3629,1108],{},"I build scraping systems with the right proxy setup for each target, from datacenter to premium residential, with the rotation and retry logic to run reliably. If you need help, ",[639,3627,644],{"href":641,"rel":3628},[643],[639,3630,649],{"href":648},[652,3632,654],{},{"title":133,"searchDepth":146,"depth":146,"links":3634},[3635,3636,3637,3638,3639,3640,3641,3642,3643,3644],{"id":3360,"depth":146,"text":3361},{"id":3372,"depth":146,"text":3373},{"id":3464,"depth":146,"text":3400},{"id":3473,"depth":146,"text":3417},{"id":3482,"depth":146,"text":3432},{"id":3491,"depth":146,"text":3492},{"id":3498,"depth":146,"text":3499},{"id":3528,"depth":146,"text":3529},{"id":3591,"depth":146,"text":3592},{"id":3621,"depth":146,"text":3622},"2026-05-20","A practical comparison of the major residential proxy providers for web scraping. Pricing models, pool quality, geo targeting, and how to choose for your project.",{},"\u002Fblog\u002Fresidential-proxy-services-compared",{"title":3349,"description":3646},"blog\u002Fresidential-proxy-services-compared",[677,676,3652,3473,678],"bright data",[3654,3655,3656,3657],"Pool quality often matters more than the headline price.","Bright Data and Oxylabs suit enterprise and hard targets; Smartproxy is the value pick.","Most providers bill per gigabyte, so block images and assets you do not need.","Start with the cheapest tier that works and escalate only when you see blocks.","vs9SoqfmhrC9zezWjqpjTzBsCXd7CHdO-_wh4h0Nn8U",{"id":3660,"title":3661,"body":3662,"date":3920,"description":3921,"draft":668,"extension":669,"meta":3922,"navigation":149,"path":3923,"readingTime":1125,"seo":3924,"stem":3925,"tags":3926,"takeaways":3929,"updated":685,"__hash__":3934},"blog\u002Fblog\u002Fno-code-scraping-automation-n8n.md","Automating Scraping Workflows with n8n, Make, and Zapier",{"type":8,"value":3663,"toc":3910},[3664,3667,3670,3674,3677,3691,3694,3698,3757,3768,3772,3775,3803,3806,3810,3813,3816,3824,3827,3831,3834,3851,3859,3863,3866,3886,3889,3893,3896,3900,3908],[11,3665,3661],{"id":3666},"automating-scraping-workflows-with-n8n-make-and-zapier",[15,3668,3669],{},"Scraping data is only half the job. The value comes from getting that data into your business systems automatically: a spreadsheet, a CRM, a Slack alert, or a database. No-code automation tools like n8n, Make, and Zapier are the glue that connects a scraper to everything else. This guide shows how to wire them together.",[24,3671,3673],{"id":3672},"the-pattern-scraper-plus-automation-layer","The pattern: scraper plus automation layer",[15,3675,3676],{},"The reliable architecture separates two concerns:",[494,3678,3679,3685],{},[497,3680,3681,3684],{},[69,3682,3683],{},"A scraper"," that does the hard part: fetching pages, handling anti-bot, and extracting clean data.",[497,3686,3687,3690],{},[69,3688,3689],{},"An automation layer"," that moves that data where it needs to go and runs the whole thing on a schedule.",[15,3692,3693],{},"You can try to do everything inside a no-code tool, but the scraping itself is where these tools are weakest. They struggle with JavaScript rendering, proxies, and CAPTCHAs. The robust pattern is a real scraper exposed as an endpoint, with the no-code tool orchestrating around it.",[24,3695,3697],{"id":3696},"comparing-the-three-tools","Comparing the three tools",[39,3699,3700,3714],{},[42,3701,3702],{},[45,3703,3704,3707,3710,3712],{},[48,3705,3706],{},"Tool",[48,3708,3709],{},"Hosting",[48,3711,3391],{},[48,3713,59],{},[61,3715,3716,3730,3744],{},[45,3717,3718,3721,3724,3727],{},[66,3719,3720],{},"n8n",[66,3722,3723],{},"Self host or cloud",[66,3725,3726],{},"Flexible, code friendly, cheap at scale",[66,3728,3729],{},"Technical teams, data heavy flows",[45,3731,3732,3735,3738,3741],{},[66,3733,3734],{},"Make",[66,3736,3737],{},"Cloud",[66,3739,3740],{},"Visual, powerful, good value",[66,3742,3743],{},"Mid complexity automations",[45,3745,3746,3749,3751,3754],{},[66,3747,3748],{},"Zapier",[66,3750,3737],{},[66,3752,3753],{},"Largest app catalog, easiest",[66,3755,3756],{},"Simple flows, many integrations",[15,3758,3759,3761,3762,3764,3765,3767],{},[69,3760,3720],{}," is the best fit for scraping work because you can self host it, run custom JavaScript or Python in a node, and it does not charge per task the way the others do. ",[69,3763,3734],{}," is a strong visual middle ground. ",[69,3766,3748],{}," is the easiest and has the most app integrations, but it gets expensive at volume and is the most limited for custom logic.",[24,3769,3771],{"id":3770},"connecting-a-scraper-with-a-webhook","Connecting a scraper with a webhook",[15,3773,3774],{},"The cleanest integration is a webhook. Your scraper finishes a run and posts the results to the automation tool, which then distributes them.",[128,3776,3778],{"className":130,"code":3777,"language":132,"meta":133,"style":133},"import requests\n\ndef send_to_automation(data: list[dict]):\n    webhook_url = \"https:\u002F\u002Fyour-n8n-instance.com\u002Fwebhook\u002Fscrape-results\"\n    requests.post(webhook_url, json={\"items\": data}, timeout=10)\n",[19,3779,3780,3784,3788,3793,3798],{"__ignoreMap":133},[137,3781,3782],{"class":139,"line":140},[137,3783,143],{},[137,3785,3786],{"class":139,"line":146},[137,3787,150],{"emptyLinePlaceholder":149},[137,3789,3790],{"class":139,"line":153},[137,3791,3792],{},"def send_to_automation(data: list[dict]):\n",[137,3794,3795],{"class":139,"line":159},[137,3796,3797],{},"    webhook_url = \"https:\u002F\u002Fyour-n8n-instance.com\u002Fwebhook\u002Fscrape-results\"\n",[137,3799,3800],{"class":139,"line":164},[137,3801,3802],{},"    requests.post(webhook_url, json={\"items\": data}, timeout=10)\n",[15,3804,3805],{},"In n8n you create a Webhook node that receives this payload, then chain nodes to write to Google Sheets, insert into a database, or send a Slack message. No polling, no glue scripts, the data arrives the moment the scrape finishes.",[24,3807,3809],{"id":3808},"scheduling-the-scrape","Scheduling the scrape",[15,3811,3812],{},"You have two scheduling options. Either the automation tool triggers the scraper on a schedule, or the scraper runs on its own cron and pushes results out.",[15,3814,3815],{},"For an automation tool driven schedule, n8n and Make both have a Schedule trigger. It calls your scraper's endpoint, waits for the data, and processes it:",[128,3817,3822],{"className":3818,"code":3820,"language":3821},[3819],"language-text","[Schedule: every day 6am] -> [HTTP Request: call scraper API]\n  -> [Filter: only price drops] -> [Slack: alert team]\n  -> [Google Sheets: append rows]\n","text",[19,3823,3820],{"__ignoreMap":133},[15,3825,3826],{},"This is a genuinely useful pattern for price monitoring: scrape daily, filter for changes, alert on drops, and log everything to a sheet, with no manual step.",[24,3828,3830],{"id":3829},"a-real-example-price-drop-alerts","A real example: price drop alerts",[15,3832,3833],{},"Putting it together, a price monitoring automation looks like this:",[494,3835,3836,3839,3842,3845,3848],{},[497,3837,3838],{},"n8n Schedule trigger fires every morning.",[497,3840,3841],{},"HTTP Request node calls your scraper, which returns current prices for a watchlist.",[497,3843,3844],{},"A Function node compares against yesterday's stored prices.",[497,3846,3847],{},"An IF node branches on whether any price dropped.",[497,3849,3850],{},"On a drop, a Slack node alerts the team and a Sheets node logs it.",[15,3852,3853,3854,1476,3856,3858],{},"The scraper handles proxies and anti-bot, covered in my guides on ",[639,3855,2722],{"href":671},[639,3857,1614],{"href":1124},". The automation layer handles everything after the data exists.",[24,3860,3862],{"id":3861},"when-to-add-custom-code","When to add custom code",[15,3864,3865],{},"No-code tools cover most of the orchestration, but you will hit limits. Add custom code when you need:",[597,3867,3868,3874,3880],{},[497,3869,3870,3873],{},[69,3871,3872],{},"Complex data transforms"," beyond what the built in nodes offer. n8n's Code node runs JavaScript or Python inline.",[497,3875,3876,3879],{},[69,3877,3878],{},"Real scraping logic"," with browser automation, which belongs in a dedicated service, not a no-code node.",[497,3881,3882,3885],{},[69,3883,3884],{},"Stateful comparisons"," across runs that need a real database rather than a spreadsheet.",[15,3887,3888],{},"The right balance is no-code for orchestration and notifications, real code for the scraping and any heavy logic.",[24,3890,3892],{"id":3891},"why-not-do-it-all-in-zapier","Why not do it all in Zapier",[15,3894,3895],{},"It is tempting to use Zapier's built in web request actions to scrape directly. This works only for simple, unprotected, static pages. The moment a site uses JavaScript rendering, anti-bot protection, or pagination, the no-code request action fails. Use the automation tool for what it is good at, and pair it with a proper scraper for the extraction.",[24,3897,3899],{"id":3898},"need-a-scraping-and-automation-pipeline-built","Need a scraping and automation pipeline built?",[15,3901,3902,3903,1651,3906,1108],{},"I build scrapers that plug into n8n, Make, Zapier, or a custom backend, so your data flows into your business systems automatically and runs on a schedule. If you want an end to end pipeline, ",[639,3904,644],{"href":641,"rel":3905},[643],[639,3907,649],{"href":648},[652,3909,654],{},{"title":133,"searchDepth":146,"depth":146,"links":3911},[3912,3913,3914,3915,3916,3917,3918,3919],{"id":3672,"depth":146,"text":3673},{"id":3696,"depth":146,"text":3697},{"id":3770,"depth":146,"text":3771},{"id":3808,"depth":146,"text":3809},{"id":3829,"depth":146,"text":3830},{"id":3861,"depth":146,"text":3862},{"id":3891,"depth":146,"text":3892},{"id":3898,"depth":146,"text":3899},"2026-05-15","How to connect a scraper to no-code automation tools so data flows into your business systems automatically. Covers webhooks, scheduling, and when to add custom code.",{},"\u002Fblog\u002Fno-code-scraping-automation-n8n",{"title":3661,"description":3921},"blog\u002Fno-code-scraping-automation-n8n",[1674,3720,3927,3928,676],"make","zapier",[3930,3931,3932,3933],"Separate concerns: a real scraper for extraction, no-code tools for orchestration.","n8n is the best fit for scraping work thanks to self-hosting, custom code, and no per-task fees.","Connect a scraper to the automation layer with a webhook for instant data flow.","Do not scrape protected sites directly in Zapier; it only handles simple static pages.","gZvRnORXzlfkdszoddvt5Tn29iDTZvo8B0mSu31H5Zo",{"id":3936,"title":3937,"body":3938,"date":4250,"description":4251,"draft":668,"extension":669,"meta":4252,"navigation":149,"path":4253,"readingTime":1125,"seo":4254,"stem":4255,"tags":4256,"takeaways":4261,"updated":685,"__hash__":4266},"blog\u002Fblog\u002Fscrape-google-search-results-serpapi.md","Scraping Google Search Results: SerpAPI vs Building Your Own",{"type":8,"value":3939,"toc":4241},[3940,3943,3946,3950,3953,3967,3970,3974,4020,4027,4031,4034,4101,4104,4107,4111,4114,4138,4203,4210,4214,4217,4220,4224,4227,4231,4239],[11,3941,3937],{"id":3942},"scraping-google-search-results-serpapi-vs-building-your-own",[15,3944,3945],{},"Google search data powers SEO tools, rank tracking, and market research. But Google is one of the most aggressively defended scraping targets, so the question is rarely \"how do I parse the HTML\" and more \"do I build this or pay a service.\" This guide covers both paths and when each makes sense.",[24,3947,3949],{"id":3948},"why-google-is-hard-to-scrape-directly","Why Google is hard to scrape directly",[15,3951,3952],{},"Google detects automated queries fast and responds with a CAPTCHA or a block. The defenses include:",[597,3954,3955,3958,3961,3964],{},[497,3956,3957],{},"Rapid rate limiting per IP, far stricter than most sites.",[497,3959,3960],{},"A frequent \"unusual traffic\" CAPTCHA.",[497,3962,3963],{},"Constantly changing HTML markup that breaks selectors.",[497,3965,3966],{},"Different layouts by region, device, and personalization.",[15,3968,3969],{},"This combination means a naive scraper gets blocked within a handful of requests, and even a working scraper needs constant maintenance as the markup shifts.",[24,3971,3973],{"id":3972},"the-two-paths","The two paths",[39,3975,3976,3990],{},[42,3977,3978],{},[45,3979,3980,3983,3985,3988],{},[48,3981,3982],{},"Path",[48,3984,53],{},[48,3986,3987],{},"Maintenance",[48,3989,59],{},[61,3991,3992,4006],{},[45,3993,3994,3997,4000,4003],{},[66,3995,3996],{},"SERP API service",[66,3998,3999],{},"Per query fee",[66,4001,4002],{},"None",[66,4004,4005],{},"Most cases, low to mid volume",[45,4007,4008,4011,4014,4017],{},[66,4009,4010],{},"Custom scraper",[66,4012,4013],{},"Proxy and dev cost",[66,4015,4016],{},"High, ongoing",[66,4018,4019],{},"Very high volume, special needs",[15,4021,4022,4023,4026],{},"The honest default: ",[69,4024,4025],{},"for most projects, a SERP API is the right answer."," Building your own only pays off at very high volume or when you need something the APIs do not offer.",[24,4028,4030],{"id":4029},"option-1-serp-api-services","Option 1: SERP API services",[15,4032,4033],{},"Services like SerpAPI, Bright Data's SERP API, Oxylabs, and others handle the proxies, CAPTCHA solving, and parsing for you. You send a query and get structured JSON back.",[128,4035,4037],{"className":130,"code":4036,"language":132,"meta":133,"style":133},"import requests\n\ndef search(query: str, api_key: str):\n    resp = requests.get(\"https:\u002F\u002Fserpapi.com\u002Fsearch\", params={\n        \"q\": query,\n        \"engine\": \"google\",\n        \"api_key\": api_key,\n    })\n    data = resp.json()\n    return [\n        {\"position\": r[\"position\"], \"title\": r[\"title\"], \"link\": r[\"link\"]}\n        for r in data.get(\"organic_results\", [])\n    ]\n",[19,4038,4039,4043,4047,4052,4057,4062,4067,4072,4077,4082,4087,4092,4097],{"__ignoreMap":133},[137,4040,4041],{"class":139,"line":140},[137,4042,143],{},[137,4044,4045],{"class":139,"line":146},[137,4046,150],{"emptyLinePlaceholder":149},[137,4048,4049],{"class":139,"line":153},[137,4050,4051],{},"def search(query: str, api_key: str):\n",[137,4053,4054],{"class":139,"line":159},[137,4055,4056],{},"    resp = requests.get(\"https:\u002F\u002Fserpapi.com\u002Fsearch\", params={\n",[137,4058,4059],{"class":139,"line":164},[137,4060,4061],{},"        \"q\": query,\n",[137,4063,4064],{"class":139,"line":170},[137,4065,4066],{},"        \"engine\": \"google\",\n",[137,4068,4069],{"class":139,"line":175},[137,4070,4071],{},"        \"api_key\": api_key,\n",[137,4073,4074],{"class":139,"line":181},[137,4075,4076],{},"    })\n",[137,4078,4079],{"class":139,"line":187},[137,4080,4081],{},"    data = resp.json()\n",[137,4083,4084],{"class":139,"line":193},[137,4085,4086],{},"    return [\n",[137,4088,4089],{"class":139,"line":199},[137,4090,4091],{},"        {\"position\": r[\"position\"], \"title\": r[\"title\"], \"link\": r[\"link\"]}\n",[137,4093,4094],{"class":139,"line":205},[137,4095,4096],{},"        for r in data.get(\"organic_results\", [])\n",[137,4098,4099],{"class":139,"line":288},[137,4100,1075],{},[15,4102,4103],{},"You get clean results, no blocks to manage, and the service absorbs every Google change. The cost is per query, which adds up at high volume but is cheap compared to the engineering time of maintaining your own.",[15,4105,4106],{},"These services parse far more than organic links: featured snippets, the People Also Ask box, local packs, ads, and related searches all come back structured. Replicating that parsing yourself is significant work.",[24,4108,4110],{"id":4109},"option-2-building-your-own-scraper","Option 2: building your own scraper",[15,4112,4113],{},"If you have very high volume where per query fees become prohibitive, or you need data the APIs do not expose, you build your own. This means combining several techniques from my other guides:",[597,4115,4116,4124,4132],{},[497,4117,4118,4121,4122,2055],{},[69,4119,4120],{},"Residential proxies"," with aggressive rotation, since Google rate limits hard. See my ",[639,4123,3157],{"href":671},[497,4125,4126,4129,4130,2055],{},[69,4127,4128],{},"CAPTCHA solving"," for the \"unusual traffic\" page, covered in my guide on ",[639,4131,3181],{"href":1669},[497,4133,4134,4137],{},[69,4135,4136],{},"Resilient parsing"," with fallback selectors, because the markup changes often.",[128,4139,4141],{"className":130,"code":4140,"language":132,"meta":133,"style":133},"async def scrape_serp(page, query: str):\n    await page.goto(f\"https:\u002F\u002Fwww.google.com\u002Fsearch?q={query}\")\n    results = []\n    for el in await page.query_selector_all(\"div.g\"):\n        link = await el.query_selector(\"a\")\n        title = await el.query_selector(\"h3\")\n        if link and title:\n            results.append({\n                \"title\": (await title.text_content()),\n                \"link\": await link.get_attribute(\"href\"),\n            })\n    return results\n",[19,4142,4143,4148,4153,4158,4163,4168,4173,4178,4183,4188,4193,4198],{"__ignoreMap":133},[137,4144,4145],{"class":139,"line":140},[137,4146,4147],{},"async def scrape_serp(page, query: str):\n",[137,4149,4150],{"class":139,"line":146},[137,4151,4152],{},"    await page.goto(f\"https:\u002F\u002Fwww.google.com\u002Fsearch?q={query}\")\n",[137,4154,4155],{"class":139,"line":153},[137,4156,4157],{},"    results = []\n",[137,4159,4160],{"class":139,"line":159},[137,4161,4162],{},"    for el in await page.query_selector_all(\"div.g\"):\n",[137,4164,4165],{"class":139,"line":164},[137,4166,4167],{},"        link = await el.query_selector(\"a\")\n",[137,4169,4170],{"class":139,"line":170},[137,4171,4172],{},"        title = await el.query_selector(\"h3\")\n",[137,4174,4175],{"class":139,"line":175},[137,4176,4177],{},"        if link and title:\n",[137,4179,4180],{"class":139,"line":181},[137,4181,4182],{},"            results.append({\n",[137,4184,4185],{"class":139,"line":187},[137,4186,4187],{},"                \"title\": (await title.text_content()),\n",[137,4189,4190],{"class":139,"line":193},[137,4191,4192],{},"                \"link\": await link.get_attribute(\"href\"),\n",[137,4194,4195],{"class":139,"line":199},[137,4196,4197],{},"            })\n",[137,4199,4200],{"class":139,"line":205},[137,4201,4202],{},"    return results\n",[15,4204,4205,4206,4209],{},"This works, but expect to spend real time keeping it alive. The ",[19,4207,4208],{},"div.g"," selector and its siblings change, and Google rolls out layout tests continuously.",[24,4211,4213],{"id":4212},"the-cost-comparison-that-actually-matters","The cost comparison that actually matters",[15,4215,4216],{},"The per query fee of a SERP API looks expensive until you price the alternative. A custom scraper costs you residential proxy bandwidth, CAPTCHA solving fees, and most importantly engineering time to build and maintain it. For anything under tens of thousands of queries a day, the API is almost always cheaper once you count your time.",[15,4218,4219],{},"Build your own when the math flips: extreme volume, or a specialized need like scraping a niche Google product the APIs skip.",[24,4221,4223],{"id":4222},"a-note-on-compliance","A note on compliance",[15,4225,4226],{},"Google's terms prohibit automated scraping of search results. SERP API services operate in a legal gray area and take that risk on themselves, which is part of what you pay for. If compliance is critical to your business, evaluate Google's official APIs, like the Custom Search JSON API, which are limited but sanctioned.",[24,4228,4230],{"id":4229},"need-serp-or-search-data-for-your-project","Need SERP or search data for your project?",[15,4232,4233,4234,1651,4237,1108],{},"I build rank tracking and SERP data pipelines using the right mix of API services and custom scraping for your volume and budget. If you need Google search data, ",[639,4235,644],{"href":641,"rel":4236},[643],[639,4238,649],{"href":648},[652,4240,654],{},{"title":133,"searchDepth":146,"depth":146,"links":4242},[4243,4244,4245,4246,4247,4248,4249],{"id":3948,"depth":146,"text":3949},{"id":3972,"depth":146,"text":3973},{"id":4029,"depth":146,"text":4030},{"id":4109,"depth":146,"text":4110},{"id":4212,"depth":146,"text":4213},{"id":4222,"depth":146,"text":4223},{"id":4229,"depth":146,"text":4230},"2026-05-10","How to extract Google search data for SEO and research. Compares SERP API services with a custom scraper, covering cost, reliability, and when each makes sense.",{},"\u002Fblog\u002Fscrape-google-search-results-serpapi",{"title":3937,"description":4251},"blog\u002Fscrape-google-search-results-serpapi",[4257,4258,4259,676,4260],"serp scraping","google","seo","api",[4262,4263,4264,4265],"For most projects a SERP API is cheaper than building and maintaining your own scraper.","Build your own only at extreme volume or for data the APIs do not expose.","Custom Google scraping needs residential proxies plus CAPTCHA solving.","Always validate against block pages and expect frequent markup changes.","GNtBLD7qhGvYQ8VQ7dKeJORn2ReuNU82kLzZtr8Bg5Y",{"id":4268,"title":4269,"body":4270,"date":4656,"description":4657,"draft":668,"extension":669,"meta":4658,"navigation":149,"path":4659,"readingTime":672,"seo":4660,"stem":4661,"tags":4662,"takeaways":4665,"updated":685,"__hash__":4670},"blog\u002Fblog\u002Fscheduling-monitoring-scrapers-production.md","Running Scrapers in Production: Scheduling, Queues, and Monitoring",{"type":8,"value":4271,"toc":4645},[4272,4275,4278,4282,4285,4302,4305,4309,4312,4359,4362,4366,4369,4432,4435,4439,4442,4501,4510,4514,4517,4546,4549,4579,4583,4586,4590,4593,4620,4623,4627,4630,4634,4642],[11,4273,4269],{"id":4274},"running-scrapers-in-production-scheduling-queues-and-monitoring",[15,4276,4277],{},"A scraper that runs once on your laptop is a script. A scraper that runs every day, recovers from failures, and tells you when something breaks is a production system. The gap between the two is where most scraping projects fail. This guide covers the infrastructure that makes a scraper reliable.",[24,4279,4281],{"id":4280},"what-production-actually-means","What \"production\" actually means",[15,4283,4284],{},"A production scraper has to handle the messy reality that the script ignores:",[597,4286,4287,4290,4293,4296,4299],{},[497,4288,4289],{},"The target site goes down or changes its markup.",[497,4291,4292],{},"A proxy gets banned mid run.",[497,4294,4295],{},"A page times out or returns garbage.",[497,4297,4298],{},"The job needs to run on a schedule without you starting it.",[497,4300,4301],{},"You need to know when it breaks, before the client does.",[15,4303,4304],{},"None of this is the extraction logic. It is the infrastructure around it, and it is the actual deliverable for a paying client.",[24,4306,4308],{"id":4307},"scheduling-from-cron-to-schedulers","Scheduling: from cron to schedulers",[15,4310,4311],{},"The simplest scheduling is cron. For a single daily job it is fine.",[128,4313,4317],{"className":4314,"code":4315,"language":4316,"meta":133,"style":133},"language-bash shiki shiki-themes github-light github-dark","# Run the scraper every day at 6am\n0 6 * * * \u002Fusr\u002Fbin\u002Fpython3 \u002Fopt\u002Fscraper\u002Frun.py >> \u002Fvar\u002Flog\u002Fscraper.log 2>&1\n","bash",[19,4318,4319,4325],{"__ignoreMap":133},[137,4320,4321],{"class":139,"line":140},[137,4322,4324],{"class":4323},"sJ8bj","# Run the scraper every day at 6am\n",[137,4326,4327,4331,4335,4338,4340,4342,4346,4349,4353,4356],{"class":139,"line":146},[137,4328,4330],{"class":4329},"sScJk","0",[137,4332,4334],{"class":4333},"sj4cs"," 6",[137,4336,4337],{"class":4333}," *",[137,4339,4337],{"class":4333},[137,4341,4337],{"class":4333},[137,4343,4345],{"class":4344},"sZZnC"," \u002Fusr\u002Fbin\u002Fpython3",[137,4347,4348],{"class":4344}," \u002Fopt\u002Fscraper\u002Frun.py",[137,4350,4352],{"class":4351},"szBVR"," >>",[137,4354,4355],{"class":4344}," \u002Fvar\u002Flog\u002Fscraper.log",[137,4357,4358],{"class":4351}," 2>&1\n",[15,4360,4361],{},"Cron breaks down when you have many jobs, dependencies between them, or need visibility into runs. At that point move to a real scheduler like APScheduler for in process scheduling, or a workflow tool like Airflow or Prefect when jobs depend on each other and you want a dashboard of run history.",[24,4363,4365],{"id":4364},"task-queues-for-scale-and-isolation","Task queues for scale and isolation",[15,4367,4368],{},"When you scrape thousands of URLs, you do not want one process working through them serially, and you do not want one failure to kill the whole run. A task queue solves both. Celery with Redis is the common Python choice.",[128,4370,4372],{"className":130,"code":4371,"language":132,"meta":133,"style":133},"from celery import Celery\n\napp = Celery(\"scraper\", broker=\"redis:\u002F\u002Flocalhost:6379\u002F0\")\n\n@app.task(bind=True, max_retries=3, default_retry_delay=60)\ndef scrape_url(self, url: str):\n    try:\n        data = fetch_and_parse(url)\n        save(data)\n    except TemporaryError as exc:\n        # Retry with backoff on transient failures\n        raise self.retry(exc=exc)\n",[19,4373,4374,4379,4383,4388,4392,4397,4402,4407,4412,4417,4422,4427],{"__ignoreMap":133},[137,4375,4376],{"class":139,"line":140},[137,4377,4378],{},"from celery import Celery\n",[137,4380,4381],{"class":139,"line":146},[137,4382,150],{"emptyLinePlaceholder":149},[137,4384,4385],{"class":139,"line":153},[137,4386,4387],{},"app = Celery(\"scraper\", broker=\"redis:\u002F\u002Flocalhost:6379\u002F0\")\n",[137,4389,4390],{"class":139,"line":159},[137,4391,150],{"emptyLinePlaceholder":149},[137,4393,4394],{"class":139,"line":164},[137,4395,4396],{},"@app.task(bind=True, max_retries=3, default_retry_delay=60)\n",[137,4398,4399],{"class":139,"line":170},[137,4400,4401],{},"def scrape_url(self, url: str):\n",[137,4403,4404],{"class":139,"line":175},[137,4405,4406],{},"    try:\n",[137,4408,4409],{"class":139,"line":181},[137,4410,4411],{},"        data = fetch_and_parse(url)\n",[137,4413,4414],{"class":139,"line":187},[137,4415,4416],{},"        save(data)\n",[137,4418,4419],{"class":139,"line":193},[137,4420,4421],{},"    except TemporaryError as exc:\n",[137,4423,4424],{"class":139,"line":199},[137,4425,4426],{},"        # Retry with backoff on transient failures\n",[137,4428,4429],{"class":139,"line":205},[137,4430,4431],{},"        raise self.retry(exc=exc)\n",[15,4433,4434],{},"Each URL becomes an independent task. Workers process them in parallel, failures retry on their own, and one bad page does not stop the rest. You can also scale by adding workers without touching the code.",[24,4436,4438],{"id":4437},"retries-and-backoff-done-right","Retries and backoff done right",[15,4440,4441],{},"Transient failures are normal in scraping. The discipline is retrying the recoverable ones and giving up on the rest.",[128,4443,4445],{"className":130,"code":4444,"language":132,"meta":133,"style":133},"import time\n\ndef fetch_with_retry(url: str, max_attempts: int = 4):\n    for attempt in range(max_attempts):\n        try:\n            resp = fetch(url)  # rotates proxy internally\n            if resp.status_code == 200 and not is_blocked(resp.text):\n                return resp\n        except (TimeoutError, ConnectionError):\n            pass\n        time.sleep(min(2 ** attempt, 30))  # exponential backoff, capped\n    raise RuntimeError(f\"Failed after {max_attempts} attempts: {url}\")\n",[19,4446,4447,4451,4455,4460,4464,4468,4473,4478,4482,4487,4492,4497],{"__ignoreMap":133},[137,4448,4449],{"class":139,"line":140},[137,4450,544],{},[137,4452,4453],{"class":139,"line":146},[137,4454,150],{"emptyLinePlaceholder":149},[137,4456,4457],{"class":139,"line":153},[137,4458,4459],{},"def fetch_with_retry(url: str, max_attempts: int = 4):\n",[137,4461,4462],{"class":139,"line":159},[137,4463,558],{},[137,4465,4466],{"class":139,"line":164},[137,4467,303],{},[137,4469,4470],{"class":139,"line":170},[137,4471,4472],{},"            resp = fetch(url)  # rotates proxy internally\n",[137,4474,4475],{"class":139,"line":175},[137,4476,4477],{},"            if resp.status_code == 200 and not is_blocked(resp.text):\n",[137,4479,4480],{"class":139,"line":181},[137,4481,345],{},[137,4483,4484],{"class":139,"line":187},[137,4485,4486],{},"        except (TimeoutError, ConnectionError):\n",[137,4488,4489],{"class":139,"line":193},[137,4490,4491],{},"            pass\n",[137,4493,4494],{"class":139,"line":199},[137,4495,4496],{},"        time.sleep(min(2 ** attempt, 30))  # exponential backoff, capped\n",[137,4498,4499],{"class":139,"line":205},[137,4500,588],{},[15,4502,4503,4504,4507,4508,2055],{},"Retry on timeouts, connection errors, and soft blocks. Do not retry on a clean ",[19,4505,4506],{},"404",", which will never succeed. Cap the backoff so a struggling target does not stall the queue forever. The proxy rotation that pairs with this is covered in my ",[639,4509,3157],{"href":671},[24,4511,4513],{"id":4512},"monitoring-and-alerting","Monitoring and alerting",[15,4515,4516],{},"The single most important production feature is knowing when the scraper breaks. A scraper that silently returns empty data for a week is worse than one that crashes loudly. Track these signals:",[597,4518,4519,4525,4531,4540],{},[497,4520,4521,4524],{},[69,4522,4523],{},"Success rate."," The percentage of requests returning valid data. A drop means the site changed or your proxies are failing.",[497,4526,4527,4530],{},[69,4528,4529],{},"Null rate."," How often a field comes back empty. A spike means a selector broke even though requests succeed.",[497,4532,4533,4536,4537,4539],{},[69,4534,4535],{},"Block rate."," How often you hit CAPTCHAs or ",[19,4538,21],{},"s. Rising means your fingerprint or proxy pool needs attention.",[497,4541,4542,4545],{},[69,4543,4544],{},"Run duration."," A sudden change signals trouble.",[15,4547,4548],{},"Wire these to an alert. Even a simple Slack message on a threshold breach turns a silent failure into a same day fix.",[128,4550,4552],{"className":130,"code":4551,"language":132,"meta":133,"style":133},"def check_health(stats: dict):\n    if stats[\"success_rate\"] \u003C 0.85:\n        alert(f\"Scraper success rate dropped to {stats['success_rate']:.0%}\")\n    if stats[\"null_rate\"] > 0.2:\n        alert(f\"High null rate {stats['null_rate']:.0%}, a selector likely broke\")\n",[19,4553,4554,4559,4564,4569,4574],{"__ignoreMap":133},[137,4555,4556],{"class":139,"line":140},[137,4557,4558],{},"def check_health(stats: dict):\n",[137,4560,4561],{"class":139,"line":146},[137,4562,4563],{},"    if stats[\"success_rate\"] \u003C 0.85:\n",[137,4565,4566],{"class":139,"line":153},[137,4567,4568],{},"        alert(f\"Scraper success rate dropped to {stats['success_rate']:.0%}\")\n",[137,4570,4571],{"class":139,"line":159},[137,4572,4573],{},"    if stats[\"null_rate\"] > 0.2:\n",[137,4575,4576],{"class":139,"line":164},[137,4577,4578],{},"        alert(f\"High null rate {stats['null_rate']:.0%}, a selector likely broke\")\n",[24,4580,4582],{"id":4581},"proxy-health-monitoring","Proxy health monitoring",[15,4584,4585],{},"Your proxy pool degrades over time as IPs get flagged. Track per proxy success rates and drop the bad ones automatically, so a few burned IPs do not drag down the whole job. Many providers expose usage stats through an API you can poll, and rotating out underperformers keeps the success rate high.",[24,4587,4589],{"id":4588},"storing-data-idempotently","Storing data idempotently",[15,4591,4592],{},"Production scrapers re-run, so writes must be safe to repeat. Use an upsert keyed on a stable identifier so a re-run updates rather than duplicates.",[128,4594,4598],{"className":4595,"code":4596,"language":4597,"meta":133,"style":133},"language-sql shiki shiki-themes github-light github-dark","INSERT INTO products (url, price, scraped_at)\nVALUES (%s, %s, now())\nON CONFLICT (url) DO UPDATE\nSET price = EXCLUDED.price, scraped_at = now();\n","sql",[19,4599,4600,4605,4610,4615],{"__ignoreMap":133},[137,4601,4602],{"class":139,"line":140},[137,4603,4604],{},"INSERT INTO products (url, price, scraped_at)\n",[137,4606,4607],{"class":139,"line":146},[137,4608,4609],{},"VALUES (%s, %s, now())\n",[137,4611,4612],{"class":139,"line":153},[137,4613,4614],{},"ON CONFLICT (url) DO UPDATE\n",[137,4616,4617],{"class":139,"line":159},[137,4618,4619],{},"SET price = EXCLUDED.price, scraped_at = now();\n",[15,4621,4622],{},"This way a partial re-run after a crash is harmless, which is essential when jobs fail halfway and restart.",[24,4624,4626],{"id":4625},"the-maintenance-reality","The maintenance reality",[15,4628,4629],{},"Even a well built scraper needs ongoing care because the targets change. The difference between a script and a system is that the system tells you when it needs attention and recovers from the failures it can. Budget for maintenance, because a scraper is a living thing, not a one time build.",[24,4631,4633],{"id":4632},"need-a-production-grade-scraping-system","Need a production grade scraping system?",[15,4635,4636,4637,1651,4640,1108],{},"I build scraping systems with scheduling, queues, retries, and monitoring so they run reliably and alert you when something needs attention. If you need a scraper that runs in production, ",[639,4638,644],{"href":641,"rel":4639},[643],[639,4641,649],{"href":648},[652,4643,4644],{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html pre.shiki code .sJ8bj, html code.shiki .sJ8bj{--shiki-default:#6A737D;--shiki-dark:#6A737D}html pre.shiki code .sScJk, html code.shiki .sScJk{--shiki-default:#6F42C1;--shiki-dark:#B392F0}html pre.shiki code .sj4cs, html code.shiki .sj4cs{--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .sZZnC, html code.shiki .sZZnC{--shiki-default:#032F62;--shiki-dark:#9ECBFF}html pre.shiki code .szBVR, html code.shiki .szBVR{--shiki-default:#D73A49;--shiki-dark:#F97583}",{"title":133,"searchDepth":146,"depth":146,"links":4646},[4647,4648,4649,4650,4651,4652,4653,4654,4655],{"id":4280,"depth":146,"text":4281},{"id":4307,"depth":146,"text":4308},{"id":4364,"depth":146,"text":4365},{"id":4437,"depth":146,"text":4438},{"id":4512,"depth":146,"text":4513},{"id":4581,"depth":146,"text":4582},{"id":4588,"depth":146,"text":4589},{"id":4625,"depth":146,"text":4626},{"id":4632,"depth":146,"text":4633},"2026-05-05","How to take a scraper from a script to a reliable production system. Covers scheduling, task queues, retries, error alerting, and proxy health monitoring.",{},"\u002Fblog\u002Fscheduling-monitoring-scrapers-production",{"title":4269,"description":4657},"blog\u002Fscheduling-monitoring-scrapers-production",[1674,676,4663,4664,132],"production","monitoring",[4666,4667,4668,4669],"The difference between a script and a system is recovery and alerting.","Use a task queue like Celery so one failure does not kill the whole run.","Retry transient failures with capped exponential backoff, but do not retry hard 404s.","Monitor success rate, null rate, and block rate, and alert on threshold breaches.","ghrBBI5AG5ffIt71IeZ2583gtIpsFSLVGGGtVli40Kc",{"id":4672,"title":4673,"body":4674,"date":4995,"description":4996,"draft":668,"extension":669,"meta":4997,"navigation":149,"path":4998,"readingTime":672,"seo":4999,"stem":5000,"tags":5001,"takeaways":5003,"updated":685,"__hash__":5008},"blog\u002Fblog\u002Freverse-engineering-private-apis.md","Reverse Engineering Private APIs for Faster, Cleaner Scraping",{"type":8,"value":4675,"toc":4984},[4676,4679,4682,4686,4689,4715,4718,4722,4729,4732,4746,4749,4753,4756,4831,4834,4838,4841,4871,4874,4878,4881,4901,4941,4945,4948,4952,4960,4963,4967,4970,4974,4982],[11,4677,4673],{"id":4678},"reverse-engineering-private-apis-for-faster-cleaner-scraping",[15,4680,4681],{},"The fastest scraper is often not a scraper at all. Most modern websites load their data from an internal API, then render it in the browser. If you call that API directly, you skip the HTML parsing, the browser overhead, and a lot of the anti-bot friction. This guide shows how to find and use those private APIs.",[24,4683,4685],{"id":4684},"why-apis-beat-html-scraping","Why APIs beat HTML scraping",[15,4687,4688],{},"When a site fetches its data from a backend endpoint, that endpoint usually returns clean JSON. Calling it directly has big advantages over scraping rendered HTML:",[597,4690,4691,4697,4703,4709],{},[497,4692,4693,4696],{},[69,4694,4695],{},"Structured data."," JSON with named fields, no fragile CSS selectors.",[497,4698,4699,4702],{},[69,4700,4701],{},"Speed."," A single HTTP request instead of launching a browser and rendering a page.",[497,4704,4705,4708],{},[69,4706,4707],{},"Stability."," Internal APIs change less often than visual markup.",[497,4710,4711,4714],{},[69,4712,4713],{},"Efficiency."," No proxy bandwidth wasted on images, CSS, and fonts.",[15,4716,4717],{},"The catch is that these APIs are undocumented and not meant for public use, so you have to discover them and figure out how they work.",[24,4719,4721],{"id":4720},"finding-the-api-in-network-traffic","Finding the API in network traffic",[15,4723,4724,4725,4728],{},"Open the site in your browser, open DevTools, and go to the Network tab. Filter to ",[19,4726,4727],{},"Fetch\u002FXHR",". As you interact with the page, watch the requests that return the data you want.",[15,4730,4731],{},"What you are looking for:",[597,4733,4734,4737,4740,4743],{},[497,4735,4736],{},"A request whose response is JSON containing the data shown on the page.",[497,4738,4739],{},"The full URL, including query parameters.",[497,4741,4742],{},"The request method and any headers, especially auth tokens.",[497,4744,4745],{},"The payload, if it is a POST.",[15,4747,4748],{},"Once you find the right request, DevTools lets you right click and \"Copy as cURL,\" which gives you the exact request with all headers. That is your starting point.",[24,4750,4752],{"id":4751},"replicating-the-request-in-python","Replicating the request in Python",[15,4754,4755],{},"Translate the copied request into code. Start with everything the browser sent, then trim it down to what is actually required.",[128,4757,4759],{"className":130,"code":4758,"language":132,"meta":133,"style":133},"import requests\n\nresp = requests.get(\n    \"https:\u002F\u002Fexample.com\u002Fapi\u002Fv2\u002Fproducts\",\n    params={\"category\": \"electronics\", \"page\": 1},\n    headers={\n        \"Accept\": \"application\u002Fjson\",\n        \"User-Agent\": \"Mozilla\u002F5.0 ...\",\n        \"Referer\": \"https:\u002F\u002Fexample.com\u002Fcategory\u002Felectronics\",\n    },\n    timeout=20,\n)\ndata = resp.json()\nfor product in data[\"results\"]:\n    print(product[\"name\"], product[\"price\"])\n",[19,4760,4761,4765,4769,4773,4778,4783,4788,4793,4798,4803,4808,4812,4816,4821,4826],{"__ignoreMap":133},[137,4762,4763],{"class":139,"line":140},[137,4764,143],{},[137,4766,4767],{"class":139,"line":146},[137,4768,150],{"emptyLinePlaceholder":149},[137,4770,4771],{"class":139,"line":153},[137,4772,178],{},[137,4774,4775],{"class":139,"line":159},[137,4776,4777],{},"    \"https:\u002F\u002Fexample.com\u002Fapi\u002Fv2\u002Fproducts\",\n",[137,4779,4780],{"class":139,"line":164},[137,4781,4782],{},"    params={\"category\": \"electronics\", \"page\": 1},\n",[137,4784,4785],{"class":139,"line":170},[137,4786,4787],{},"    headers={\n",[137,4789,4790],{"class":139,"line":175},[137,4791,4792],{},"        \"Accept\": \"application\u002Fjson\",\n",[137,4794,4795],{"class":139,"line":181},[137,4796,4797],{},"        \"User-Agent\": \"Mozilla\u002F5.0 ...\",\n",[137,4799,4800],{"class":139,"line":187},[137,4801,4802],{},"        \"Referer\": \"https:\u002F\u002Fexample.com\u002Fcategory\u002Felectronics\",\n",[137,4804,4805],{"class":139,"line":193},[137,4806,4807],{},"    },\n",[137,4809,4810],{"class":139,"line":199},[137,4811,196],{},[137,4813,4814],{"class":139,"line":205},[137,4815,202],{},[137,4817,4818],{"class":139,"line":288},[137,4819,4820],{},"data = resp.json()\n",[137,4822,4823],{"class":139,"line":294},[137,4824,4825],{},"for product in data[\"results\"]:\n",[137,4827,4828],{"class":139,"line":300},[137,4829,4830],{},"    print(product[\"name\"], product[\"price\"])\n",[15,4832,4833],{},"The pagination is usually a simple parameter, so iterating every page is a clean loop instead of clicking through a rendered interface.",[24,4835,4837],{"id":4836},"trimming-headers-to-what-matters","Trimming headers to what matters",[15,4839,4840],{},"The browser sends many headers, but most are not required. Remove them one at a time and see what breaks. Usually only a few matter:",[597,4842,4843,4849,4855,4861],{},[497,4844,4845,4848],{},[69,4846,4847],{},"Authorization or API key headers."," These are mandatory if present.",[497,4850,4851,4854],{},[69,4852,4853],{},"Referer or Origin."," Some APIs check these to block off site calls.",[497,4856,4857,4860],{},[69,4858,4859],{},"User-Agent."," Some reject default library agents, so set a browser like value.",[497,4862,4863,4866,4867,4870],{},[69,4864,4865],{},"A custom token header."," Sites often add a header like ",[19,4868,4869],{},"x-api-token"," their frontend generates.",[15,4872,4873],{},"Knowing the minimal set keeps your requests clean and makes them less fragile.",[24,4875,4877],{"id":4876},"handling-authentication","Handling authentication",[15,4879,4880],{},"Private APIs use a few common auth patterns, and each has a way to handle it:",[597,4882,4883,4889,4895],{},[497,4884,4885,4888],{},[69,4886,4887],{},"Bearer token in a header."," Capture it from a logged in session and include it. Tokens expire, so you may need to refresh them.",[497,4890,4891,4894],{},[69,4892,4893],{},"Session cookies."," Log in once with a browser or a login request, then reuse the cookie jar for API calls.",[497,4896,4897,4900],{},[69,4898,4899],{},"A token generated by frontend JavaScript."," The hardest case. Sometimes you can replicate the token logic, and sometimes you need a real browser to generate it, then hand it to your API calls.",[128,4902,4904],{"className":130,"code":4903,"language":132,"meta":133,"style":133},"session = requests.Session()\nsession.post(\"https:\u002F\u002Fexample.com\u002Fapi\u002Flogin\", json={\n    \"email\": \"user@example.com\",\n    \"password\": \"secret\",\n})\n# The session now holds the auth cookie for subsequent calls\ndata = session.get(\"https:\u002F\u002Fexample.com\u002Fapi\u002Fv2\u002Forders\").json()\n",[19,4905,4906,4911,4916,4921,4926,4931,4936],{"__ignoreMap":133},[137,4907,4908],{"class":139,"line":140},[137,4909,4910],{},"session = requests.Session()\n",[137,4912,4913],{"class":139,"line":146},[137,4914,4915],{},"session.post(\"https:\u002F\u002Fexample.com\u002Fapi\u002Flogin\", json={\n",[137,4917,4918],{"class":139,"line":153},[137,4919,4920],{},"    \"email\": \"user@example.com\",\n",[137,4922,4923],{"class":139,"line":159},[137,4924,4925],{},"    \"password\": \"secret\",\n",[137,4927,4928],{"class":139,"line":164},[137,4929,4930],{},"})\n",[137,4932,4933],{"class":139,"line":170},[137,4934,4935],{},"# The session now holds the auth cookie for subsequent calls\n",[137,4937,4938],{"class":139,"line":175},[137,4939,4940],{},"data = session.get(\"https:\u002F\u002Fexample.com\u002Fapi\u002Fv2\u002Forders\").json()\n",[24,4942,4944],{"id":4943},"when-the-token-comes-from-javascript","When the token comes from JavaScript",[15,4946,4947],{},"Some sites sign each request with a token computed in obfuscated JavaScript. You have two options. Replicate the algorithm in your own code if it is simple enough to read, which is fragile but fast. Or run a real browser to load the page, extract the token it generates, and feed it into your direct API calls, which is more robust. The hybrid approach, a browser for the token and plain HTTP for the data, often gives the best of both.",[24,4949,4951],{"id":4950},"respecting-limits-and-staying-reliable","Respecting limits and staying reliable",[15,4953,4954,4955,1476,4957,2055],{},"A private API is still subject to rate limiting and can still ban you. Apply the same discipline as any scraper: rotate proxies if needed, throttle your rate, and add retries with backoff. See my guides on ",[639,4956,2722],{"href":671},[639,4958,4959],{"href":4659},"running scrapers in production",[15,4961,4962],{},"Watch for the API changing. Internal APIs are more stable than markup but not permanent. A version bump in the path or a new required header will break your client, so monitor your success rate.",[24,4964,4966],{"id":4965},"when-this-approach-does-not-fit","When this approach does not fit",[15,4968,4969],{},"Direct API access is not always possible. Some sites render everything server side with no JSON endpoint, in which case you are back to HTML scraping. Others sign requests so heavily that replicating the auth is more work than just driving a browser. Inspect first, and pick the cheaper path for that specific site.",[24,4971,4973],{"id":4972},"need-an-api-integration-or-reverse-engineering-done","Need an API integration or reverse engineering done?",[15,4975,4976,4977,1651,4980,1108],{},"I find and integrate private and undocumented APIs to build fast, clean data pipelines, and fall back to browser scraping when an API is not available. If you have a project, ",[639,4978,644],{"href":641,"rel":4979},[643],[639,4981,649],{"href":648},[652,4983,654],{},{"title":133,"searchDepth":146,"depth":146,"links":4985},[4986,4987,4988,4989,4990,4991,4992,4993,4994],{"id":4684,"depth":146,"text":4685},{"id":4720,"depth":146,"text":4721},{"id":4751,"depth":146,"text":4752},{"id":4836,"depth":146,"text":4837},{"id":4876,"depth":146,"text":4877},{"id":4943,"depth":146,"text":4944},{"id":4950,"depth":146,"text":4951},{"id":4965,"depth":146,"text":4966},{"id":4972,"depth":146,"text":4973},"2026-04-30","How to find and use a site's internal API instead of scraping HTML. Covers inspecting network traffic, replicating requests, handling auth, and why it beats browser scraping.",{},"\u002Fblog\u002Freverse-engineering-private-apis",{"title":4673,"description":4996},"blog\u002Freverse-engineering-private-apis",[4260,5002,676,1674,132],"reverse engineering",[5004,5005,5006,5007],"Many sites load their data from an internal JSON API you can call directly.","Find it in the browser Network tab, then replicate the request and trim headers.","Handle auth with captured tokens or reused session cookies.","APIs are faster and cleaner than HTML scraping but still need rate limiting and monitoring.","v7AztrDT_9Sc6w8JFtACzdAYRLIzy3RfMDe1-kcVQEY",1781254278073]