Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium. It’s essential for scraping modern, JavaScript-driven websites (Single Page Applications).
Integrating 2extract.com proxies with Puppeteer requires passing a special --proxy-server
argument when launching the browser.
Basic Setup
Here’s how to launch a Puppeteer instance that routes all its traffic through our proxy gateway.
You’ll need puppeteer
installed in your project: npm install puppeteer
const puppeteer = require('puppeteer');
// 1. Get these from your proxy's "Connection Details" page
const proxyHost = "proxy.2extract.net";
const proxyPort = 5555;
const proxyUser = "PROXY_USERNAME";
const proxyPass = "PROXY_PASSWORD";
const proxyServer = `http://${proxyHost}:${proxyPort}`;
(async () => {
console.log('Launching browser with proxy...');
const browser = await puppeteer.launch({
headless: false, // Set to true for production, false for debugging
// 2. Pass the proxy server URL as an argument
args: [`--proxy-server=${proxyServer}`]
});
const page = await browser.newPage();
// 3. Authenticate the proxy for this page
await page.authenticate({
username: proxyUser,
password: proxyPass
});
console.log('Navigating to IP checker...');
await page.goto('https://api.ipify.org?format=json', { waitUntil: 'networkidle0' });
// 4. Get the content and verify the IP
const content = await page.evaluate(() => document.body.textContent);
console.log('Success! Your proxy IP is:', JSON.parse(content).ip);
await browser.close();
})();
Important: Puppeteer requires you to authenticate on a per-page basis using page.authenticate()
. You must call this method for every new page (browser.newPage()
) you open.
Real-World Example: Taking a Screenshot of Amazon Search Results
A common use case for Puppeteer is to render a full page with dynamic content and take a screenshot, for example, to monitor product rankings or search results on a major eCommerce site like Amazon.
Let’s take a screenshot of the search results for “web scraping books” on amazon.com
, making the request appear as if it’s coming from the United States.
const puppeteer = require('puppeteer');
// --- Your Base Credentials ---
const BASE_USERNAME = "PROXY_USERNAME";
const PASSWORD = "PROXY_PASSWORD";
const PROXY_HOST = "proxy.2extract.net";
const PROXY_PORT = 5555;
const proxyServer = `http://${PROXY_HOST}:${PROXY_PORT}`;
// --- Target Information ---
const SEARCH_QUERY = "web scraping books";
const TARGET_URL = `https://www.amazon.com/s?k=${encodeURIComponent(SEARCH_QUERY)}`;
const REGION = "us"; // United States
(async () => {
console.log("Launching Puppeteer browser...");
const browser = await puppeteer.launch({
headless: true, // Use true for automated scripts
args: [`--proxy-server=${proxyServer}`]
});
const page = await browser.newPage();
// Dynamically construct the username for the target region
const proxyUsername = `${BASE_USERNAME}-country-${REGION}`;
// Authenticate using the dynamically created username
await page.authenticate({
username: proxyUsername,
password: PASSWORD
});
// Set a realistic viewport and user agent to mimic a real user
await page.setViewport({ width: 1920, height: 1080 });
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36');
console.log(`Navigating to Amazon search results for "${SEARCH_QUERY}" via a ${REGION.toUpperCase()} proxy...`);
try {
// Navigate to the page and wait for the network to be mostly idle
await page.goto(TARGET_URL, { waitUntil: 'networkidle2', timeout: 60000 });
// (Optional) A small delay to ensure all dynamic content loads
await new Promise(resolve => setTimeout(resolve, 2000));
// Take a screenshot and save it
await page.screenshot({ path: `amazon_search_${REGION}.png`, fullPage: true });
console.log(`Success! Screenshot saved as 'amazon_search_${REGION}.png'`);
} catch (error) {
console.error(`An error occurred: ${error.message}`);
// In case of an error, it's useful to save the HTML for debugging
const errorHtml = await page.content();
require('fs').writeFileSync('error.html', errorHtml);
console.log("Saved the page's HTML to error.html for debugging.");
} finally {
await browser.close();
console.log("Browser closed.");
}
})();
Amazon is a challenging target! They employ sophisticated anti-bot measures. While this script works, you may need to implement more advanced techniques for large-scale scraping, such as rotating User-Agents, handling CAPTCHAs, and mimicking human-like browsing behavior (e.g., random delays, mouse movements).
What This Example Demonstrates
- Scraping a Real, Complex Website: How to approach a major target like Amazon.
- Proxy Authentication in Puppeteer: The correct two-step process of setting
--proxy-server
and using page.authenticate()
.
- Dynamic Geo-Targeting: Shows how to change your perceived location by modifying the username.
- Mimicking a Real User: Demonstrates best practices like setting a realistic viewport and
User-Agent
to reduce the chance of being detected by anti-bot systems.
- Error Handling: Includes a
try...catch
block and saves the page’s HTML on failure, which is a crucial technique for debugging scraping issues.