[{"data":1,"prerenderedAt":602},["ShallowReactive",2],{"blog-\u002Fblog\u002Fscrapy-large-scale-scraping":3},{"id":4,"title":5,"body":6,"date":581,"description":582,"draft":583,"extension":584,"meta":585,"navigation":95,"path":586,"readingTime":587,"seo":588,"stem":589,"tags":590,"takeaways":595,"updated":600,"__hash__":601},"blog\u002Fblog\u002Fscrapy-large-scale-scraping.md","Building a Large-Scale Web Scraper with Scrapy",{"type":7,"value":8,"toc":570},"minimark",[9,13,26,31,34,63,66,70,73,191,194,198,201,268,275,300,304,307,347,350,354,357,410,419,423,426,522,529,533,540,544,547,551,566],[10,11,5],"h1",{"id":12},"building-a-large-scale-web-scraper-with-scrapy",[14,15,16,17,21,22,25],"p",{},"When a scraping job grows past a few thousand pages, a hand written script with ",[18,19,20],"code",{},"requests"," and a ",[18,23,24],{},"for"," loop starts to fall apart. Scrapy is the framework built for this scale. It handles concurrency, retries, throttling, and data export so you can focus on the extraction logic. This guide covers the parts that matter for production.",[27,28,30],"h2",{"id":29},"why-scrapy-over-a-plain-script","Why Scrapy over a plain script",[14,32,33],{},"A simple script does one request at a time and breaks on the first unexpected error. Scrapy gives you the infrastructure for free:",[35,36,37,45,51,57],"ul",{},[38,39,40,44],"li",{},[41,42,43],"strong",{},"Asynchronous by default."," It fetches many pages concurrently without you managing threads or async code by hand.",[38,46,47,50],{},[41,48,49],{},"Built in retries and throttling."," Failed requests retry automatically, and AutoThrottle adapts the request rate to the server.",[38,52,53,56],{},[41,54,55],{},"Middleware system."," Proxies, custom headers, and retry rules plug in cleanly.",[38,58,59,62],{},[41,60,61],{},"Item pipelines."," Clean, validate, and store scraped data in stages.",[14,64,65],{},"The tradeoff is a steeper learning curve. For a one off scrape of a single page, Scrapy is overkill. For a recurring job across many pages, it pays for itself quickly.",[27,67,69],{"id":68},"a-basic-spider","A basic spider",[14,71,72],{},"A spider defines where to start, how to follow links, and how to parse each page.",[74,75,80],"pre",{"className":76,"code":77,"language":78,"meta":79,"style":79},"language-python shiki shiki-themes github-light github-dark","import scrapy\n\nclass ProductSpider(scrapy.Spider):\n    name = \"products\"\n    start_urls = [\"https:\u002F\u002Fexample.com\u002Fcategory\u002Fpage\u002F1\"]\n\n    def parse(self, response):\n        for product in response.css(\"div.product\"):\n            yield {\n                \"name\": product.css(\"h2.title::text\").get(),\n                \"price\": product.css(\"span.price::text\").get(),\n                \"url\": product.css(\"a::attr(href)\").get(),\n            }\n\n        # Follow pagination\n        next_page = response.css(\"a.next::attr(href)\").get()\n        if next_page:\n            yield response.follow(next_page, callback=self.parse)\n","python","",[18,81,82,90,97,103,109,115,120,126,132,138,144,150,156,162,167,173,179,185],{"__ignoreMap":79},[83,84,87],"span",{"class":85,"line":86},"line",1,[83,88,89],{},"import scrapy\n",[83,91,93],{"class":85,"line":92},2,[83,94,96],{"emptyLinePlaceholder":95},true,"\n",[83,98,100],{"class":85,"line":99},3,[83,101,102],{},"class ProductSpider(scrapy.Spider):\n",[83,104,106],{"class":85,"line":105},4,[83,107,108],{},"    name = \"products\"\n",[83,110,112],{"class":85,"line":111},5,[83,113,114],{},"    start_urls = [\"https:\u002F\u002Fexample.com\u002Fcategory\u002Fpage\u002F1\"]\n",[83,116,118],{"class":85,"line":117},6,[83,119,96],{"emptyLinePlaceholder":95},[83,121,123],{"class":85,"line":122},7,[83,124,125],{},"    def parse(self, response):\n",[83,127,129],{"class":85,"line":128},8,[83,130,131],{},"        for product in response.css(\"div.product\"):\n",[83,133,135],{"class":85,"line":134},9,[83,136,137],{},"            yield {\n",[83,139,141],{"class":85,"line":140},10,[83,142,143],{},"                \"name\": product.css(\"h2.title::text\").get(),\n",[83,145,147],{"class":85,"line":146},11,[83,148,149],{},"                \"price\": product.css(\"span.price::text\").get(),\n",[83,151,153],{"class":85,"line":152},12,[83,154,155],{},"                \"url\": product.css(\"a::attr(href)\").get(),\n",[83,157,159],{"class":85,"line":158},13,[83,160,161],{},"            }\n",[83,163,165],{"class":85,"line":164},14,[83,166,96],{"emptyLinePlaceholder":95},[83,168,170],{"class":85,"line":169},15,[83,171,172],{},"        # Follow pagination\n",[83,174,176],{"class":85,"line":175},16,[83,177,178],{},"        next_page = response.css(\"a.next::attr(href)\").get()\n",[83,180,182],{"class":85,"line":181},17,[83,183,184],{},"        if next_page:\n",[83,186,188],{"class":85,"line":187},18,[83,189,190],{},"            yield response.follow(next_page, callback=self.parse)\n",[14,192,193],{},"Scrapy queues every yielded request and schedules it with the concurrency settings you choose, so following thousands of pagination links needs no extra code.",[27,195,197],{"id":196},"item-pipelines-for-clean-data","Item pipelines for clean data",[14,199,200],{},"Raw scraped fields are messy. Prices have currency symbols, whitespace creeps in, and duplicates appear. Pipelines process each item before it is stored.",[74,202,204],{"className":76,"code":203,"language":78,"meta":79,"style":79},"class CleanPricePipeline:\n    def process_item(self, item, spider):\n        if item.get(\"price\"):\n            item[\"price\"] = (\n                item[\"price\"].replace(\"$\", \"\").replace(\",\", \"\").strip()\n            )\n        return item\n\nclass DropEmptyPipeline:\n    def process_item(self, item, spider):\n        if not item.get(\"name\"):\n            raise scrapy.exceptions.DropItem(\"Missing name\")\n        return item\n",[18,205,206,211,216,221,226,231,236,241,245,250,254,259,264],{"__ignoreMap":79},[83,207,208],{"class":85,"line":86},[83,209,210],{},"class CleanPricePipeline:\n",[83,212,213],{"class":85,"line":92},[83,214,215],{},"    def process_item(self, item, spider):\n",[83,217,218],{"class":85,"line":99},[83,219,220],{},"        if item.get(\"price\"):\n",[83,222,223],{"class":85,"line":105},[83,224,225],{},"            item[\"price\"] = (\n",[83,227,228],{"class":85,"line":111},[83,229,230],{},"                item[\"price\"].replace(\"$\", \"\").replace(\",\", \"\").strip()\n",[83,232,233],{"class":85,"line":117},[83,234,235],{},"            )\n",[83,237,238],{"class":85,"line":122},[83,239,240],{},"        return item\n",[83,242,243],{"class":85,"line":128},[83,244,96],{"emptyLinePlaceholder":95},[83,246,247],{"class":85,"line":134},[83,248,249],{},"class DropEmptyPipeline:\n",[83,251,252],{"class":85,"line":140},[83,253,215],{},[83,255,256],{"class":85,"line":146},[83,257,258],{},"        if not item.get(\"name\"):\n",[83,260,261],{"class":85,"line":152},[83,262,263],{},"            raise scrapy.exceptions.DropItem(\"Missing name\")\n",[83,265,266],{"class":85,"line":158},[83,267,240],{},[14,269,270,271,274],{},"Register them in ",[18,272,273],{},"settings.py"," with a priority number that sets the order:",[74,276,278],{"className":76,"code":277,"language":78,"meta":79,"style":79},"ITEM_PIPELINES = {\n    \"myproject.pipelines.CleanPricePipeline\": 100,\n    \"myproject.pipelines.DropEmptyPipeline\": 200,\n}\n",[18,279,280,285,290,295],{"__ignoreMap":79},[83,281,282],{"class":85,"line":86},[83,283,284],{},"ITEM_PIPELINES = {\n",[83,286,287],{"class":85,"line":92},[83,288,289],{},"    \"myproject.pipelines.CleanPricePipeline\": 100,\n",[83,291,292],{"class":85,"line":99},[83,293,294],{},"    \"myproject.pipelines.DropEmptyPipeline\": 200,\n",[83,296,297],{"class":85,"line":105},[83,298,299],{},"}\n",[27,301,303],{"id":302},"tuning-concurrency-without-getting-banned","Tuning concurrency without getting banned",[14,305,306],{},"The default settings are conservative. For a large job you want more throughput, but pushing too hard gets you blocked. The key settings:",[74,308,310],{"className":76,"code":309,"language":78,"meta":79,"style":79},"# settings.py\nCONCURRENT_REQUESTS = 16\nCONCURRENT_REQUESTS_PER_DOMAIN = 8\nDOWNLOAD_DELAY = 0.5\nAUTOTHROTTLE_ENABLED = True\nAUTOTHROTTLE_TARGET_CONCURRENCY = 4.0\nRETRY_TIMES = 3\n",[18,311,312,317,322,327,332,337,342],{"__ignoreMap":79},[83,313,314],{"class":85,"line":86},[83,315,316],{},"# settings.py\n",[83,318,319],{"class":85,"line":92},[83,320,321],{},"CONCURRENT_REQUESTS = 16\n",[83,323,324],{"class":85,"line":99},[83,325,326],{},"CONCURRENT_REQUESTS_PER_DOMAIN = 8\n",[83,328,329],{"class":85,"line":105},[83,330,331],{},"DOWNLOAD_DELAY = 0.5\n",[83,333,334],{"class":85,"line":111},[83,335,336],{},"AUTOTHROTTLE_ENABLED = True\n",[83,338,339],{"class":85,"line":117},[83,340,341],{},"AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0\n",[83,343,344],{"class":85,"line":122},[83,345,346],{},"RETRY_TIMES = 3\n",[14,348,349],{},"AutoThrottle is the part most people miss. It watches response latency and slows down automatically when the server is under load, which keeps you below the rate that triggers bans. Start gentle and increase concurrency only while the block rate stays at zero.",[27,351,353],{"id":352},"adding-proxies-with-middleware","Adding proxies with middleware",[14,355,356],{},"For protected sites you need rotating proxies. Scrapy applies them through downloader middleware so every request goes through the pool.",[74,358,360],{"className":76,"code":359,"language":78,"meta":79,"style":79},"import random\n\nclass ProxyMiddleware:\n    PROXIES = [\n        \"http:\u002F\u002Fuser:pass@p1.provider.com:8000\",\n        \"http:\u002F\u002Fuser:pass@p2.provider.com:8000\",\n    ]\n\n    def process_request(self, request, spider):\n        request.meta[\"proxy\"] = random.choice(self.PROXIES)\n",[18,361,362,367,371,376,381,386,391,396,400,405],{"__ignoreMap":79},[83,363,364],{"class":85,"line":86},[83,365,366],{},"import random\n",[83,368,369],{"class":85,"line":92},[83,370,96],{"emptyLinePlaceholder":95},[83,372,373],{"class":85,"line":99},[83,374,375],{},"class ProxyMiddleware:\n",[83,377,378],{"class":85,"line":105},[83,379,380],{},"    PROXIES = [\n",[83,382,383],{"class":85,"line":111},[83,384,385],{},"        \"http:\u002F\u002Fuser:pass@p1.provider.com:8000\",\n",[83,387,388],{"class":85,"line":117},[83,389,390],{},"        \"http:\u002F\u002Fuser:pass@p2.provider.com:8000\",\n",[83,392,393],{"class":85,"line":122},[83,394,395],{},"    ]\n",[83,397,398],{"class":85,"line":128},[83,399,96],{"emptyLinePlaceholder":95},[83,401,402],{"class":85,"line":134},[83,403,404],{},"    def process_request(self, request, spider):\n",[83,406,407],{"class":85,"line":140},[83,408,409],{},"        request.meta[\"proxy\"] = random.choice(self.PROXIES)\n",[14,411,412,413,418],{},"For the rotation, retry, and geolocation details that make this reliable, see my guide on ",[414,415,417],"a",{"href":416},"\u002Fblog\u002Frotating-proxies-for-web-scraping","integrating rotating proxies",".",[27,420,422],{"id":421},"exporting-to-a-database","Exporting to a database",[14,424,425],{},"For a real pipeline you want the data in a database, not a CSV. A storage pipeline writes each item as it is scraped.",[74,427,429],{"className":76,"code":428,"language":78,"meta":79,"style":79},"import psycopg2\n\nclass PostgresPipeline:\n    def open_spider(self, spider):\n        self.conn = psycopg2.connect(\"dbname=scrape user=postgres\")\n        self.cur = self.conn.cursor()\n\n    def process_item(self, item, spider):\n        self.cur.execute(\n            \"INSERT INTO products (name, price, url) VALUES (%s, %s, %s) \"\n            \"ON CONFLICT (url) DO UPDATE SET price = EXCLUDED.price\",\n            (item[\"name\"], item[\"price\"], item[\"url\"]),\n        )\n        self.conn.commit()\n        return item\n\n    def close_spider(self, spider):\n        self.cur.close()\n        self.conn.close()\n",[18,430,431,436,440,445,450,455,460,464,468,473,478,483,488,493,498,502,506,511,516],{"__ignoreMap":79},[83,432,433],{"class":85,"line":86},[83,434,435],{},"import psycopg2\n",[83,437,438],{"class":85,"line":92},[83,439,96],{"emptyLinePlaceholder":95},[83,441,442],{"class":85,"line":99},[83,443,444],{},"class PostgresPipeline:\n",[83,446,447],{"class":85,"line":105},[83,448,449],{},"    def open_spider(self, spider):\n",[83,451,452],{"class":85,"line":111},[83,453,454],{},"        self.conn = psycopg2.connect(\"dbname=scrape user=postgres\")\n",[83,456,457],{"class":85,"line":117},[83,458,459],{},"        self.cur = self.conn.cursor()\n",[83,461,462],{"class":85,"line":122},[83,463,96],{"emptyLinePlaceholder":95},[83,465,466],{"class":85,"line":128},[83,467,215],{},[83,469,470],{"class":85,"line":134},[83,471,472],{},"        self.cur.execute(\n",[83,474,475],{"class":85,"line":140},[83,476,477],{},"            \"INSERT INTO products (name, price, url) VALUES (%s, %s, %s) \"\n",[83,479,480],{"class":85,"line":146},[83,481,482],{},"            \"ON CONFLICT (url) DO UPDATE SET price = EXCLUDED.price\",\n",[83,484,485],{"class":85,"line":152},[83,486,487],{},"            (item[\"name\"], item[\"price\"], item[\"url\"]),\n",[83,489,490],{"class":85,"line":158},[83,491,492],{},"        )\n",[83,494,495],{"class":85,"line":164},[83,496,497],{},"        self.conn.commit()\n",[83,499,500],{"class":85,"line":169},[83,501,240],{},[83,503,504],{"class":85,"line":175},[83,505,96],{"emptyLinePlaceholder":95},[83,507,508],{"class":85,"line":181},[83,509,510],{},"    def close_spider(self, spider):\n",[83,512,513],{"class":85,"line":187},[83,514,515],{},"        self.cur.close()\n",[83,517,519],{"class":85,"line":518},19,[83,520,521],{},"        self.conn.close()\n",[14,523,524,525,528],{},"The ",[18,526,527],{},"ON CONFLICT"," clause makes re-runs idempotent, so scraping the same page twice updates the price instead of creating a duplicate row.",[27,530,532],{"id":531},"handling-javascript-heavy-pages","Handling JavaScript heavy pages",[14,534,535,536,539],{},"Scrapy fetches raw HTML and does not run JavaScript. For pages that render content client side, pair Scrapy with a browser using ",[18,537,538],{},"scrapy-playwright",", which lets a spider request a fully rendered page only when needed while keeping the fast path for static pages.",[27,541,543],{"id":542},"when-scrapy-is-the-right-call","When Scrapy is the right call",[14,545,546],{},"Reach for Scrapy when the job is recurring, spans many pages, and needs reliability: price monitoring, catalog extraction, or any pipeline that runs on a schedule. For a quick one time grab of a single page, a small script is simpler. Match the tool to the job.",[27,548,550],{"id":549},"need-a-production-scraping-pipeline-built","Need a production scraping pipeline built?",[14,552,553,554,560,561,565],{},"I build Scrapy based pipelines with proxy rotation, retry logic, and database export that run on a schedule and stay reliable at scale. If you have a recurring scraping need, ",[414,555,559],{"href":556,"rel":557},"https:\u002F\u002Fwww.upwork.com\u002Ffreelancers\u002Fphanvuong2",[558],"nofollow","hire me on Upwork"," or reach out through the ",[414,562,564],{"href":563},"\u002F#contact","contact form",". I respond within 24 hours.",[567,568,569],"style",{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}",{"title":79,"searchDepth":92,"depth":92,"links":571},[572,573,574,575,576,577,578,579,580],{"id":29,"depth":92,"text":30},{"id":68,"depth":92,"text":69},{"id":196,"depth":92,"text":197},{"id":302,"depth":92,"text":303},{"id":352,"depth":92,"text":353},{"id":421,"depth":92,"text":422},{"id":531,"depth":92,"text":532},{"id":542,"depth":92,"text":543},{"id":549,"depth":92,"text":550},"2026-06-05","How to use Scrapy for production scraping at scale. Covers spiders, item pipelines, concurrency tuning, proxy and retry middleware, and exporting to databases.",false,"md",{},"\u002Fblog\u002Fscrapy-large-scale-scraping","9 min read",{"title":5,"description":582},"blog\u002Fscrapy-large-scale-scraping",[591,592,78,593,594],"scrapy","web scraping","data pipeline","automation",[596,597,598,599],"Use Scrapy when the job is recurring and spans many pages, not for a one-off scrape.","Item pipelines clean, validate, and store data in stages.","AutoThrottle adapts the request rate to avoid bans; raise concurrency only while block rate stays at zero.","Use ON CONFLICT upserts so re-runs update existing rows instead of duplicating.",null,"3Rf6J8LSUfGb3ScqiCyjd2tMEH31AVwQjz9pvRjjTaY",1781254278206]