Parsing with Crawlee
Parsing is when you extract data from websites and save it in convenient tables. The information collected varies:
- Prices;
- Reviews;
- Ratings;
- Product names.
It can be any kind of information. The essence is the mechanism: there is a lot of data on the site, and you get the necessary data from this array.
Crawlee is a library that helps to collect data quickly. In this article, we'll show you a template that is used to write parsers to gather information from websites. But first, let's clarify the difference between parsers and crawlers. Without this, you can't understand the essence of Crawlee.
The difference between crawlers and parsers
In the topic of data collection, the terms “crawler” and “parser” are used as synonyms. However, this is a mistake.
Crawler
A crawler is a program that “traverses” websites. To collect data, it follows links within pages.
The task of the crawler is to find pages from which to collect data. Crawler moves through sites like a spider on the web (hence the name web crawler), moving from one page to another, based on the logic set by the developer.
For example: a crawler starts on the home page of a site, follows links to subpages, and loads HTML code.
Crawling is what search engines do when indexing pages.
Parser
Parser is a program that takes crawler data (a list of website pages) and structures it. The parser's task is to extract the necessary information from the HTML code. It works only with what the crawler has collected.
Example: the parser extracts product names, prices and reviews from an online store page. It takes only the necessary information and only from the pages that the crawler has given it.
Is Crawlee a "crawler" or a "parser" after all?
The thing about Crawlee is that it combines both roles:
- Runs crawlers to traverse websites.
- Uses built-in parsers or custom scripts to extract data.
Example:
- Crawler finds all pages of a product catalog.
- The parser extracts the title, price, and rating from each page.
- Crawlee stores the data in the database.
What Crawlee is and why it's worth parsing with it
Crawlee is a JavaScript library for creating parsers. Its features are:
- Task queue management;
- Automatic data saving;
- Ability to parse complex sites. Crawlee is indispensable when you need to collect data from dynamic sites with content that is loaded via JavaScript. As an example: extract reviews from Amazon, ignoring ad blocks that are loaded.
Crawlee is also useful for those who collect large amounts of data. Let's see how to install it and get started.
How to install Crawlee
To install Crawlee, do the following.
- Install Node.js. You can download it from the official website.
- In the terminal, run the command:
npm install crawlee
- Create a project:
mkdir crawlee-project && cd crawlee-project
npm init -y
As a result, you will have a working environment for parsing.
Getting started with Crawlee
Create an index.js file in a text editor (even in Notepad). In it write a basic script for parsing. Something like this:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
requestHandler: async ({ request, $ }) => {
const title = $('title').text();
console.log(`Title of ${request.url}: ${title}`);
},
});
await crawler.run(['https://example.com']);
This code sends a request to the site and displays the page title. Crawlee automatically processes the responses and uses the Cheerio crawler.
Cheerio extracts data from HTML pages, and Crawlee solves the problem with dynamic sites by processing them through embedded browsers. What this means for you is that with Crawlee, you'll collect data even from complex sites.
Handling dynamic sites
For sites that use JavaScript, standard queries are not enough. Crawlee supports Puppeteer and Playwright to handle such resources.
Example:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
const data = await page.evaluate(() => document.querySelector('h1').innerText);
console.log(data);
},
});
await crawler.run(['https://example.com']);
This script uses the Playwright browser to load the site, wait for the scripts to execute, and extract the data.
Handling captcha and blocking
Crawlee supports working with proxies. Rent and add a proxy pool to your code:
const crawler = new CheerioCrawler({
proxyConfiguration: { proxyUrls: ['http://proxy1', 'http://proxy2'] },
requestHandler: async ({ request, $ }) => {
console.log(`Processing ${request.url}`);
},
});
To use your own proxies in the code, you need to replace ['http://proxy1', 'http://proxy2'] with a list of proxy URLs composed from the data provided when renting: IP, port, login and password.
That is, each proxy in your pool must be written in the format:
http://login:password@IP:port
For example, if:
- IP: 192.168.0.1
- Port: 8080
- Login: user123
- Password: pass123
Then the string will be:
http://user123:pass123@192.168.0.1:8080
If you have multiple proxies, record them in an array. Example:
const proxies = [
'http://user1:pass1@192.168.0.1:8080',
'http://user2:pass2@192.168.0.2:8080',
'http://user3:pass3@192.168.0.3:8080',
];
Now substitute the proxies array into the proxyUrls parameter:
const crawler = new CheerioCrawler({
proxyConfiguration: { proxyUrls: proxies },
requestHandler: async ({ request, $ }) => {
console.log(`Processing ${request.url}`);
},
});
Here's what the result will be
- Crawlee will take a random proxy from the proxyUrls array for each request.
- The proxies will automatically be added to the requests, and the username and password will be passed to the server for authorization.
Data storage
The collected data can be saved to a file or database. Crawlee supports a built-in Dataset. Example:
import { Dataset } from 'crawlee';
await Dataset.pushData({ title: 'Example', url: 'https://example.com' });
With this feature, hundreds of thousands of records can be collected without loss.
Crawlee turns a complex parsing process into a sequence of understandable steps. Of course, in this article we will not tell you how to write a working parser, but we have described the sequence and possibilities of using Crawlee.