Skip to main content
Scrapy is a powerful framework for large-scale web scraping. The best way to manage proxies in Scrapy is by using a custom Downloader Middleware. This allows you to centralize your proxy logic and keep your spiders clean.

Step 1: Create the Proxy Middleware

First, let’s create a middleware that will attach our proxy credentials to every request.
  1. In your Scrapy project, open the middlewares.py file.
  2. Add the following class. It reads credentials from your settings.py and can be easily extended to handle dynamic parameters.
middlewares.py
# my_project/middlewares.py
import base64

class Scrapy2extractProxyMiddleware:
    def process_request(self, request, spider):
        # By default, use the static proxy settings
        proxy_user = spider.settings.get('PROXY_USER')
        proxy_pass = spider.settings.get('PROXY_PASS')

        # Allow spiders to dynamically override proxy settings per-request
        if 'proxy_username_params' in request.meta:
            params = request.meta['proxy_username_params']
            proxy_user = f"{spider.settings.get('PROXY_USER')}{params}"

        # Set the proxy for the request
        request.meta['proxy'] = f"http://{spider.settings.get('PROXY_HOST')}:{spider.settings.get('PROXY_PORT')}"

        # Set the Proxy-Authorization header for authentication
        proxy_auth = f"{proxy_user}:{proxy_pass}"
        auth_header = 'Basic ' + base64.b64encode(proxy_auth.encode()).decode()
        request.headers['Proxy-Authorization'] = auth_header

        spider.logger.info(f"Using proxy user: {proxy_user} for {request.url}")
This advanced middleware allows you to pass dynamic parameters (like -country-de) from your spider to the middleware using request.meta['proxy_username_params'].

Step 2: Configure settings.py

Now, let’s enable the middleware and add your credentials.
  1. Open your settings.py file.
  2. Add your base proxy credentials.
  3. Enable the Scrapy2extractProxyMiddleware in DOWNLOADER_MIDDLEWARES.
settings.py
# my_project/settings.py

# --- 2extract.com Proxy Credentials ---
PROXY_HOST = "proxy.2extract.net"
PROXY_PORT = 5555
PROXY_USER = "PROXY_USERNAME" # Without any parameters
PROXY_PASS = "PROXY_PASSWORD"

# --- Enable the Middleware ---
# The number (e.g., 610) should be high enough to run after default middlewares,
# but before any middlewares that might depend on the proxy.
DOWNLOADER_MIDDLEWARES = {
   'my_project.middlewares.Scrapy2extractProxyMiddleware': 610,
}
Your basic setup is now complete! Any request from any spider in your project will now automatically use your 2extract.com proxy.

Real-World Example: Scraping Steam Specials

Now, let’s use the middleware we just created to scrape all discounted games from the Steam store. This spider is now much cleaner because all the proxy logic lives in the middleware.

The Spider (steam_specials_spider.py)

Notice how clean this spider is. It only focuses on scraping logic and passing dynamic parameters to the middleware.
spiders/steam_specials_spider.py
import scrapy
import random

class SteamSpecialsSpider(scrapy.Spider):
    name = "steam_specials"
    start_urls = ["https://store.steampowered.com/specials"]

    def start_requests(self):
        # Create a unique session ID for this job
        session_id = f"steam-special-{random.randint(1000, 9999)}"

        # Define the parameters we want to pass to our middleware
        # We will use the same sticky session for all pages
        proxy_params = f"-country-us-session-{session_id}"

        cookies = {'birthtime': '568022401', 'wants_mature_content': '1', 'steamCountry': 'US|...'}

        for url in self.start_urls:
            yield scrapy.Request(
                url,
                callback=self.parse,
                # Pass our dynamic parameters to the middleware
                meta={'proxy_username_params': proxy_params},
                cookies=cookies,
                headers={
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'
                }
            )

    def parse(self, response):
        # Extract data from the current page
        games = response.css('a.salepreviewwidgets_SaleItemBrowserRow_y9MSd')

        if not games:
            self.logger.warning("No games found on page. The CSS selector might be outdated.")
            return

        for game in games:
            yield {
                'title': game.css('div.salepreviewwidgets_StoreSaleWidgetTitle_3jI46::text').get(),
                'original_price': game.css('div.salepreviewwidgets_StoreOriginalPrice_2e_2G::text').get(),
                'discount_price': game.css('div.salepreviewwidgets_StoreSalePriceBox_Wh0L8::text').get(),
            }

        # Find and follow the "Next" page link
        next_page_link = response.css('a.paging_Arrow_2JZe3[href*="specials?page="]:last-of-type').get()
        if next_page_link:
            # Scrapy automatically carries over the `meta` dictionary,
            # so our middleware will continue to use the same session!
            yield response.follow(next_page_link, callback=self.parse)
        else:
            self.logger.info("Last page reached. Finishing crawl.")

CSS Selectors change! Websites like Steam frequently update their design. The CSS selectors in this example are correct at the time of writing but may need to be updated in the future.

What This Example Demonstrates

  • Best Practice: The spider is clean and focused on what to scrape. The middleware handles how to connect. This is a robust and scalable approach.
  • Dynamic Control: The spider can easily change its geo-location or session for any request by simply changing the proxy_username_params value in the meta dictionary.
  • Code Reusability: The Scrapy2extractProxyMiddleware can now be reused by all spiders in your project without duplicating code.
I