[{"data":1,"prerenderedAt":459},["ShallowReactive",2],{"blog-\u002Fblog\u002Fscheduling-monitoring-scrapers-production":3},{"id":4,"title":5,"body":6,"date":438,"description":439,"draft":440,"extension":441,"meta":442,"navigation":134,"path":443,"readingTime":444,"seo":445,"stem":446,"tags":447,"takeaways":452,"updated":457,"__hash__":458},"blog\u002Fblog\u002Fscheduling-monitoring-scrapers-production.md","Running Scrapers in Production: Scheduling, Queues, and Monitoring",{"type":7,"value":8,"toc":427},"minimark",[9,13,17,22,25,44,47,51,54,108,111,115,118,195,198,202,205,269,282,286,289,320,323,353,357,360,364,367,394,397,401,404,408,423],[10,11,5],"h1",{"id":12},"running-scrapers-in-production-scheduling-queues-and-monitoring",[14,15,16],"p",{},"A scraper that runs once on your laptop is a script. A scraper that runs every day, recovers from failures, and tells you when something breaks is a production system. The gap between the two is where most scraping projects fail. This guide covers the infrastructure that makes a scraper reliable.",[18,19,21],"h2",{"id":20},"what-production-actually-means","What \"production\" actually means",[14,23,24],{},"A production scraper has to handle the messy reality that the script ignores:",[26,27,28,32,35,38,41],"ul",{},[29,30,31],"li",{},"The target site goes down or changes its markup.",[29,33,34],{},"A proxy gets banned mid run.",[29,36,37],{},"A page times out or returns garbage.",[29,39,40],{},"The job needs to run on a schedule without you starting it.",[29,42,43],{},"You need to know when it breaks, before the client does.",[14,45,46],{},"None of this is the extraction logic. It is the infrastructure around it, and it is the actual deliverable for a paying client.",[18,48,50],{"id":49},"scheduling-from-cron-to-schedulers","Scheduling: from cron to schedulers",[14,52,53],{},"The simplest scheduling is cron. For a single daily job it is fine.",[55,56,61],"pre",{"className":57,"code":58,"language":59,"meta":60,"style":60},"language-bash shiki shiki-themes github-light github-dark","# Run the scraper every day at 6am\n0 6 * * * \u002Fusr\u002Fbin\u002Fpython3 \u002Fopt\u002Fscraper\u002Frun.py >> \u002Fvar\u002Flog\u002Fscraper.log 2>&1\n","bash","",[62,63,64,73],"code",{"__ignoreMap":60},[65,66,69],"span",{"class":67,"line":68},"line",1,[65,70,72],{"class":71},"sJ8bj","# Run the scraper every day at 6am\n",[65,74,76,80,84,87,89,91,95,98,102,105],{"class":67,"line":75},2,[65,77,79],{"class":78},"sScJk","0",[65,81,83],{"class":82},"sj4cs"," 6",[65,85,86],{"class":82}," *",[65,88,86],{"class":82},[65,90,86],{"class":82},[65,92,94],{"class":93},"sZZnC"," \u002Fusr\u002Fbin\u002Fpython3",[65,96,97],{"class":93}," \u002Fopt\u002Fscraper\u002Frun.py",[65,99,101],{"class":100},"szBVR"," >>",[65,103,104],{"class":93}," \u002Fvar\u002Flog\u002Fscraper.log",[65,106,107],{"class":100}," 2>&1\n",[14,109,110],{},"Cron breaks down when you have many jobs, dependencies between them, or need visibility into runs. At that point move to a real scheduler like APScheduler for in process scheduling, or a workflow tool like Airflow or Prefect when jobs depend on each other and you want a dashboard of run history.",[18,112,114],{"id":113},"task-queues-for-scale-and-isolation","Task queues for scale and isolation",[14,116,117],{},"When you scrape thousands of URLs, you do not want one process working through them serially, and you do not want one failure to kill the whole run. A task queue solves both. Celery with Redis is the common Python choice.",[55,119,123],{"className":120,"code":121,"language":122,"meta":60,"style":60},"language-python shiki shiki-themes github-light github-dark","from celery import Celery\n\napp = Celery(\"scraper\", broker=\"redis:\u002F\u002Flocalhost:6379\u002F0\")\n\n@app.task(bind=True, max_retries=3, default_retry_delay=60)\ndef scrape_url(self, url: str):\n    try:\n        data = fetch_and_parse(url)\n        save(data)\n    except TemporaryError as exc:\n        # Retry with backoff on transient failures\n        raise self.retry(exc=exc)\n","python",[62,124,125,130,136,142,147,153,159,165,171,177,183,189],{"__ignoreMap":60},[65,126,127],{"class":67,"line":68},[65,128,129],{},"from celery import Celery\n",[65,131,132],{"class":67,"line":75},[65,133,135],{"emptyLinePlaceholder":134},true,"\n",[65,137,139],{"class":67,"line":138},3,[65,140,141],{},"app = Celery(\"scraper\", broker=\"redis:\u002F\u002Flocalhost:6379\u002F0\")\n",[65,143,145],{"class":67,"line":144},4,[65,146,135],{"emptyLinePlaceholder":134},[65,148,150],{"class":67,"line":149},5,[65,151,152],{},"@app.task(bind=True, max_retries=3, default_retry_delay=60)\n",[65,154,156],{"class":67,"line":155},6,[65,157,158],{},"def scrape_url(self, url: str):\n",[65,160,162],{"class":67,"line":161},7,[65,163,164],{},"    try:\n",[65,166,168],{"class":67,"line":167},8,[65,169,170],{},"        data = fetch_and_parse(url)\n",[65,172,174],{"class":67,"line":173},9,[65,175,176],{},"        save(data)\n",[65,178,180],{"class":67,"line":179},10,[65,181,182],{},"    except TemporaryError as exc:\n",[65,184,186],{"class":67,"line":185},11,[65,187,188],{},"        # Retry with backoff on transient failures\n",[65,190,192],{"class":67,"line":191},12,[65,193,194],{},"        raise self.retry(exc=exc)\n",[14,196,197],{},"Each URL becomes an independent task. Workers process them in parallel, failures retry on their own, and one bad page does not stop the rest. You can also scale by adding workers without touching the code.",[18,199,201],{"id":200},"retries-and-backoff-done-right","Retries and backoff done right",[14,203,204],{},"Transient failures are normal in scraping. The discipline is retrying the recoverable ones and giving up on the rest.",[55,206,208],{"className":120,"code":207,"language":122,"meta":60,"style":60},"import time\n\ndef fetch_with_retry(url: str, max_attempts: int = 4):\n    for attempt in range(max_attempts):\n        try:\n            resp = fetch(url)  # rotates proxy internally\n            if resp.status_code == 200 and not is_blocked(resp.text):\n                return resp\n        except (TimeoutError, ConnectionError):\n            pass\n        time.sleep(min(2 ** attempt, 30))  # exponential backoff, capped\n    raise RuntimeError(f\"Failed after {max_attempts} attempts: {url}\")\n",[62,209,210,215,219,224,229,234,239,244,249,254,259,264],{"__ignoreMap":60},[65,211,212],{"class":67,"line":68},[65,213,214],{},"import time\n",[65,216,217],{"class":67,"line":75},[65,218,135],{"emptyLinePlaceholder":134},[65,220,221],{"class":67,"line":138},[65,222,223],{},"def fetch_with_retry(url: str, max_attempts: int = 4):\n",[65,225,226],{"class":67,"line":144},[65,227,228],{},"    for attempt in range(max_attempts):\n",[65,230,231],{"class":67,"line":149},[65,232,233],{},"        try:\n",[65,235,236],{"class":67,"line":155},[65,237,238],{},"            resp = fetch(url)  # rotates proxy internally\n",[65,240,241],{"class":67,"line":161},[65,242,243],{},"            if resp.status_code == 200 and not is_blocked(resp.text):\n",[65,245,246],{"class":67,"line":167},[65,247,248],{},"                return resp\n",[65,250,251],{"class":67,"line":173},[65,252,253],{},"        except (TimeoutError, ConnectionError):\n",[65,255,256],{"class":67,"line":179},[65,257,258],{},"            pass\n",[65,260,261],{"class":67,"line":185},[65,262,263],{},"        time.sleep(min(2 ** attempt, 30))  # exponential backoff, capped\n",[65,265,266],{"class":67,"line":191},[65,267,268],{},"    raise RuntimeError(f\"Failed after {max_attempts} attempts: {url}\")\n",[14,270,271,272,275,276,281],{},"Retry on timeouts, connection errors, and soft blocks. Do not retry on a clean ",[62,273,274],{},"404",", which will never succeed. Cap the backoff so a struggling target does not stall the queue forever. The proxy rotation that pairs with this is covered in my ",[277,278,280],"a",{"href":279},"\u002Fblog\u002Frotating-proxies-for-web-scraping","rotating proxies guide",".",[18,283,285],{"id":284},"monitoring-and-alerting","Monitoring and alerting",[14,287,288],{},"The single most important production feature is knowing when the scraper breaks. A scraper that silently returns empty data for a week is worse than one that crashes loudly. Track these signals:",[26,290,291,298,304,314],{},[29,292,293,297],{},[294,295,296],"strong",{},"Success rate."," The percentage of requests returning valid data. A drop means the site changed or your proxies are failing.",[29,299,300,303],{},[294,301,302],{},"Null rate."," How often a field comes back empty. A spike means a selector broke even though requests succeed.",[29,305,306,309,310,313],{},[294,307,308],{},"Block rate."," How often you hit CAPTCHAs or ",[62,311,312],{},"403","s. Rising means your fingerprint or proxy pool needs attention.",[29,315,316,319],{},[294,317,318],{},"Run duration."," A sudden change signals trouble.",[14,321,322],{},"Wire these to an alert. Even a simple Slack message on a threshold breach turns a silent failure into a same day fix.",[55,324,326],{"className":120,"code":325,"language":122,"meta":60,"style":60},"def check_health(stats: dict):\n    if stats[\"success_rate\"] \u003C 0.85:\n        alert(f\"Scraper success rate dropped to {stats['success_rate']:.0%}\")\n    if stats[\"null_rate\"] > 0.2:\n        alert(f\"High null rate {stats['null_rate']:.0%}, a selector likely broke\")\n",[62,327,328,333,338,343,348],{"__ignoreMap":60},[65,329,330],{"class":67,"line":68},[65,331,332],{},"def check_health(stats: dict):\n",[65,334,335],{"class":67,"line":75},[65,336,337],{},"    if stats[\"success_rate\"] \u003C 0.85:\n",[65,339,340],{"class":67,"line":138},[65,341,342],{},"        alert(f\"Scraper success rate dropped to {stats['success_rate']:.0%}\")\n",[65,344,345],{"class":67,"line":144},[65,346,347],{},"    if stats[\"null_rate\"] > 0.2:\n",[65,349,350],{"class":67,"line":149},[65,351,352],{},"        alert(f\"High null rate {stats['null_rate']:.0%}, a selector likely broke\")\n",[18,354,356],{"id":355},"proxy-health-monitoring","Proxy health monitoring",[14,358,359],{},"Your proxy pool degrades over time as IPs get flagged. Track per proxy success rates and drop the bad ones automatically, so a few burned IPs do not drag down the whole job. Many providers expose usage stats through an API you can poll, and rotating out underperformers keeps the success rate high.",[18,361,363],{"id":362},"storing-data-idempotently","Storing data idempotently",[14,365,366],{},"Production scrapers re-run, so writes must be safe to repeat. Use an upsert keyed on a stable identifier so a re-run updates rather than duplicates.",[55,368,372],{"className":369,"code":370,"language":371,"meta":60,"style":60},"language-sql shiki shiki-themes github-light github-dark","INSERT INTO products (url, price, scraped_at)\nVALUES (%s, %s, now())\nON CONFLICT (url) DO UPDATE\nSET price = EXCLUDED.price, scraped_at = now();\n","sql",[62,373,374,379,384,389],{"__ignoreMap":60},[65,375,376],{"class":67,"line":68},[65,377,378],{},"INSERT INTO products (url, price, scraped_at)\n",[65,380,381],{"class":67,"line":75},[65,382,383],{},"VALUES (%s, %s, now())\n",[65,385,386],{"class":67,"line":138},[65,387,388],{},"ON CONFLICT (url) DO UPDATE\n",[65,390,391],{"class":67,"line":144},[65,392,393],{},"SET price = EXCLUDED.price, scraped_at = now();\n",[14,395,396],{},"This way a partial re-run after a crash is harmless, which is essential when jobs fail halfway and restart.",[18,398,400],{"id":399},"the-maintenance-reality","The maintenance reality",[14,402,403],{},"Even a well built scraper needs ongoing care because the targets change. The difference between a script and a system is that the system tells you when it needs attention and recovers from the failures it can. Budget for maintenance, because a scraper is a living thing, not a one time build.",[18,405,407],{"id":406},"need-a-production-grade-scraping-system","Need a production grade scraping system?",[14,409,410,411,417,418,422],{},"I build scraping systems with scheduling, queues, retries, and monitoring so they run reliably and alert you when something needs attention. If you need a scraper that runs in production, ",[277,412,416],{"href":413,"rel":414},"https:\u002F\u002Fwww.upwork.com\u002Ffreelancers\u002Fphanvuong2",[415],"nofollow","hire me on Upwork"," or reach out through the ",[277,419,421],{"href":420},"\u002F#contact","contact form",". I respond within 24 hours.",[424,425,426],"style",{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html pre.shiki code .sJ8bj, html code.shiki .sJ8bj{--shiki-default:#6A737D;--shiki-dark:#6A737D}html pre.shiki code .sScJk, html code.shiki .sScJk{--shiki-default:#6F42C1;--shiki-dark:#B392F0}html pre.shiki code .sj4cs, html code.shiki .sj4cs{--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .sZZnC, html code.shiki .sZZnC{--shiki-default:#032F62;--shiki-dark:#9ECBFF}html pre.shiki code .szBVR, html code.shiki .szBVR{--shiki-default:#D73A49;--shiki-dark:#F97583}",{"title":60,"searchDepth":75,"depth":75,"links":428},[429,430,431,432,433,434,435,436,437],{"id":20,"depth":75,"text":21},{"id":49,"depth":75,"text":50},{"id":113,"depth":75,"text":114},{"id":200,"depth":75,"text":201},{"id":284,"depth":75,"text":285},{"id":355,"depth":75,"text":356},{"id":362,"depth":75,"text":363},{"id":399,"depth":75,"text":400},{"id":406,"depth":75,"text":407},"2026-05-05","How to take a scraper from a script to a reliable production system. Covers scheduling, task queues, retries, error alerting, and proxy health monitoring.",false,"md",{},"\u002Fblog\u002Fscheduling-monitoring-scrapers-production","8 min read",{"title":5,"description":439},"blog\u002Fscheduling-monitoring-scrapers-production",[448,449,450,451,122],"automation","web scraping","production","monitoring",[453,454,455,456],"The difference between a script and a system is recovery and alerting.","Use a task queue like Celery so one failure does not kill the whole run.","Retry transient failures with capped exponential backoff, but do not retry hard 404s.","Monitor success rate, null rate, and block rate, and alert on threshold breaches.",null,"ghrBBI5AG5ffIt71IeZ2583gtIpsFSLVGGGtVli40Kc",1781254278432]