How to parse sites with Puppeteer Stealth

Puppeteer Stealth

Those who parsing sites know that resources don't like their servers to be overloaded with requests. That's why parsing add-ons are needed.

Puppeteer works with a site like a normal browser, and its Stealth plugin masks bots. You need this for parsing, especially if you work with sites that parsers really dislike. This list includes all major resources and applications.

In this article we will tell you how to use Puppeteer Stealth for parsing.

What is parsing and why do you need Puppeteer Stealth?

 

Parsing is when programs automatically collect data. For example, information about prices, discounts, and news. Many websites are protected from bots.

Puppeteer Stealth solves this problem. It disguises the script as the work of a real user. That is, it looks like an ordinary user came to the site, clicked and searched for information in the search.

Proxies distribute these actions to several users. This way you will not overload the server and distribute the load. The sites will appreciate it.

Parser: what it is and how it works with Puppeteer

Parser is a tool that automates data extraction. Puppeteer and Stealth plugin make the parser invisible to anti-bot systems. You can continue working with minimal risk of getting banned.

Data is the key to successful parsing

Data is the main purpose of parsing. Puppeteer collects information from text and images to tables and lists. Stealth helps you do this discreetly. This comes in handy for regularly collecting information from a single site.

How to parse: install Puppeteer and Puppeteer Stealth

Step 1: Install Puppeteer and Stealth

The first step to successfully parsing sites is to install Puppeteer and Stealth. These components work together. Open a command prompt (cmd) on Windows and type:

npm install puppeteer

npm install puppeteer-extra puppeteer-extra-plugin-stealth

puppeteer

You will see how the script downloads and installs the required files.

Step 2: Customize the code

After installation, you need to add the plugin to the script. Open the project file (your parser, for example, written in Python) and add the code:

const puppeteer = require('puppeteer-extra');
const stealthPlugin = require('puppeteer-extra-plugin-stealth');

// Connecting Stealth
puppeteer.use(stealthPlugin());

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  
  // Go to the website

  await page.goto('https://example.com');
  
  // Perform the necessary actions on the site

  await page.screenshot({ path: 'example.png' });

  await browser.close();
})();

We are:

  1. Connect Puppeteer and Stealth plugin.
  2. Launch the browser in headless mode (i.e. without interface), which is convenient for most tasks.
  3. Open a new tab, go to the site and take a screenshot for an example.

How do I connect a proxy to a script with Puppeteer Stealth?

To parse sites and distribute the load on several IPs, you need to set up a proxy. This way you will avoid bans from sites that monitor frequent requests from one IP.

Step 1: Choose a proxy

You can choose server proxies of any protocol: SOCKS5 and HTTP/HTTPS.

proxy

Copy the data for connection to the proxy from your personal account: IP address, port, as well as login and password.

proxy server

Step 2: Add proxy to the script

Now, the proxy needs to be added to the script on Puppeteer. Here is an example of how the code looks like:

puppeteer stealth

Let's break down the code to make it clearer:

  1. Starting a browser with a proxy. The address of the proxy server is passed in the args parameter. For example, for an HTTP proxy it will look like --proxy-server=HTTP://123.45.67.89:8080, where 123.45.67.89 is the IP and 8080 is the proxy port.
  2. Proxy authorization. If the proxy requires a login and password, use the page.authenticate() method, passing your credentials there.
  3. Working with Puppeteer. The script works as usual: the page is opened, actions are performed (in this case a screenshot is taken).

Step 3: Testing the proxy

To check that the proxy is working correctly, you can run the script and check the IP address through a service that shows the IP:

await page.goto('https://www.whatismyip.com/');

If the IP proxy is displayed on the site, then everything is configured correctly.

Step 4: Check Stealth operation

Now let's check if Stealth is working. One way is to go to a site that recognizes bots, such as https://bot.sannysoft.com/. Run the script with Puppeteer Stealth and see how the system reacts. If everything is correct, the site won't notice that you are a bot.

Stealth

Step 5: Optimize settings

Now that the basic setup is ready, you can optimize the script for specific tasks. Puppeteer Stealth customizes the bot's behavior:

  1. Faking mouse traces. The browser will generate random mouse movements to make the site think it is being watched by a real person.
  2. Hiding signs of automation. Puppeteer removes navigator.webdriver tags that tell the site that a bot is in front of it.

Parser: what it is and how to choose

A good parser is one that runs fast and without crashes. Puppeteer in combination with Stealth avoids bans and collects data. Even if you're not familiar with programming, it's easy to understand how the tool works - follow the instructions and adapt the code to your needs.

Once you set up Puppeteer Stealth once, you'll have a tool that parses data or tests sites without attracting attention.