How to Scrape Baidu Organic Search Results with Python and Proxies

Baidu is the leading search engine in China, and its organic listings hold information that is difficult to obtain elsewhere. This guide explains how to scrape Baidu search results using Python and proxies in a structured, repeatable manner, covering setup, parsing, pagination, storage, and scaling.

The objective is a practical pipeline. You will learn to send a search query, parse the returned HTML, extract titles and links, and store clean structured data. There are several ways to scrape Baidu organic results, and the approach below uses standard Python tools so the code remains readable.

Why scrape Baidu organic results

Baidu accounts for the majority of search activity in China, which makes its search results a primary signal for any organization targeting the Chinese web. To scrape Baidu search results gives marketers, analysts, and engineers direct access to localized demand and competitor visibility across Baidu SERPs.

Several use cases recur. SEO teams track keyword rankings and run seo analysis against the search results page. Research and AI teams collect a corpus for AI and NLP models. Analysts monitor competitors. Each use case depends on reliable access to baidu data at meaningful scale.

Organic listings differ from organic and paid listings combined, because they reflect what the search engine ranks on merit rather than on spend. For keyword trends and content planning, the organic block is usually the most valuable signal to extract and analyze, and a steady feed supports AI driven analytics. Teams that scrape these listings regularly turn data from Baidu into reporting inputs, AI models, and competitive research.

Timing also matters. Rankings shift daily, so a one time export ages quickly. A scheduled job that repeats the collection captures movement over time and turns a static snapshot into a living dataset for trend analysis.

Baidu SERP at a glance: where organic results sit

How to Scrape Baidu

A Baidu search results page contains several blocks. Paid placements appear at the top, marked as advertisements. Organic results follow in the main column. A related search block sits near the foot of the page. Understanding this layout is essential before you parse anything and before you scrape any result. Knowing where the organic block renders saves time later, because you can target it directly instead of filtering the whole page.

Organic results are the focus here. Each result on the SERP carries a title, a destination link, and a short snippet. When you scrape Baidu's search results, these three fields are the core structured data you will extract from every result across the result pages.

Baidu renders most of this content in the initial HTML response. That means you can often parse organic listings without a headless browser or JavaScript execution. The exception is dynamic widgets, which require heavier scraping tools and sit outside the scope of this article.

Challenges and ethical considerations

Baidu uses several anti-bot systems to protect its service. Aggressive request patterns trigger captchas, ip bans, and temporary denials. The HTML structure changes without notice, and a css class can rotate, so a selector that works today may fail after a layout update.

Encoding is another practical issue. Baidu's search results sometimes return the gb2312 character set rather than UTF-8. If you parse the raw html with the wrong encoding, Chinese titles become unreadable. Always set the response encoding before you parse the page.

Ethics and compliance matter. Scraping publicly available Baidu search results is generally permitted, yet it is not unconditional. Respect the terms of service, follow robots.txt, apply rate limits, avoid copyrighted data, and confirm that your project does not breach any laws in your jurisdiction.

Responsible data scraping means treating the target with care. Send requests at a human pace, identify a real user agent, and avoid overloading servers. These anti-bot measures exist because automated access at high volume can degrade the service for every visitor.

Scraping Baidu organic results with Python: step by step

Scrape Baidu Organic Search

The following workflow shows the full path from a search query to stored data. The code is written in Python and relies on two libraries. Read each step in order, because later steps reuse the variables defined earlier. By the end you will be able to scrape Baidu reliably and to scrape additional result pages on demand.

Prerequisites and setup

Install Python 3 and two packages. The requests library sends the HTTP call to the baidu url. The BeautifulSoup library, imported from bs4, will parse the returned html. Install both with a single command in your terminal.

pip install requests beautifulsoup4

Import the libraries at the top of your Python file. You will build the baidu search url from a search query. The query parameter wd carries the term, and the rn parameter controls how many results the search engine returns per page.

import requests
from bs4 import BeautifulSoup
import csv, json, time

query = "web scraping"
base_url = "https://www.baidu.com/s"

Keep the project in a virtual environment so dependencies stay isolated. This makes the scraper reproducible on another machine and avoids version conflicts between requests, BeautifulSoup, and the rest of your toolchain.

Fetching the first page of organic results

Send a GET request to the baidu search url with your search query. Add a browser style user agent header so the request resembles a real user. Set the response encoding to gb2312 so the html decodes correctly before you parse it.

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
params = {"wd": query, "rn": 10}

response = requests.get(base_url, params=params, headers=headers)
response.encoding = "gb2312"
html = response.text

The response object now holds the raw html of the first search results page. Check the status code before continuing. A 200 status means the page loaded; other codes often indicate rate limits or that the request is getting blocked by anti-bot logic.

Parsing titles, links, and snippets

Create a BeautifulSoup object from the html, then select each organic result container. Baidu groups every result in a div with the class c-container. Loop over those blocks and extract the title, the link url, and the snippet from each one.

soup = BeautifulSoup(html, "html.parser")
results = []

for block in soup.select("div.c-container"):
        title_tag = block.select_one("h3 a") or block.select_one("a")
        if not title_tag:
        continue
        title = title_tag.get_text(strip=True)
        link = title_tag.get("href")
        snippet_tag = block.select_one("span.content-right_8Zs40")
        snippet = snippet_tag.get_text(strip=True) if snippet_tag else ""
        results.append({"title": title, "url": link, "snippet": snippet})

A defensive selector strategy matters. The fallback from h3 a to a keeps the web scraper working when the css class or markup shifts. Because Baidu's structure changes often and html changes ship without warning, this safeguard reduces maintenance whenever the search results page is updated.

Paginating through all organic results

A single page returns a limited set of results. To collect more, paginate by adjusting the pn parameter, which sets the offset. Baidu advances results in steps that match the rn value, so increment the offset on each loop until no further results appear.

all_results = []
for page in range(0, 50, 10):
        params = {"wd": query, "rn": 10, "pn": page}
        response = requests.get(base_url, params=params, headers=headers)
        response.encoding = "gb2312"
        soup = BeautifulSoup(response.text, "html.parser")
        blocks = soup.select("div.c-container")
        if not blocks:
        break
        # parse each block and append to all_results
        time.sleep(1)

Add a short delay between requests with time.sleep. The pause keeps your request rate close to human behavior, lowers the chance of triggering captchas, and reduces captcha handling overhead. This pacing is the simplest defense against the anti-bot systems that monitor each search query.

Storing results to CSV

Once the list of results is complete, write it to disk. The csv module turns your parsed records into a portable file. Each row holds the title, the url, and the snippet, which gives you clean structured data ready for analysis or import into other tools.

with open("baidu_results.csv", "w", newline="", encoding="utf-8-sig") as f:
        writer = csv.DictWriter(f, fieldnames=["title", "url", "snippet"])
        writer.writeheader()
        writer.writerows(all_results)

If you prefer machine readable output, serialize the same records to json instead. A json array of objects integrates easily with an api, a database, an AI application, or a downstream ai pipeline. Choose the json format that matches how the search data will be consumed.

The same Baidu scraper pattern also supports scraping Google. Extracting google search results through scraping google search needs different selectors, yet the request, parse, and store cycle stays the same.

Scaling collection with proxies

A script that runs from a single ip address works for small jobs. At scale, that one address is quickly throttled. Routing requests through proxies distributes the load across many ips and keeps your collection stable when you scrape thousands of search results.

How aggressively you can scrape depends on the proxy pool behind the script. A larger, well rotated pool lets you scrape more queries per minute without tripping limits, which is the core reason to invest in a reliable pool of quality proxy addresses.

Adding proxies to your requests

Adding a proxy to the requests call takes one parameter. Pass a dictionary that maps the http and https schemes to your proxy endpoint. Every request to the baidu search url then travels through that proxy rather than your own connection. This lets you scrape at higher volume safely.

proxies = {
"http": "http://user:pass@host:port",
"https": "http://user:pass@host:port",
}
response = requests.get(base_url, params=params, headers=headers, proxies=proxies)

For advanced web scraping, rotate addresses. Proxy rotation assigns a different ip to each request, which spreads activity and reduces blocked requests. A pool of proxies combined with retry and backoff logic handles transient errors and keeps the scraper running through rate limits.

Keep credentials out of source control. Read the endpoint and login from an environment variable or a configuration file, so the script stays safe to share and straightforward to deploy across several machines.

This is where proxys.io fits into the workflow. The proxys.io catalogue includes individual and shared IPv4, IPv6, residential proxies, and dynamic proxies across many countries. New users can begin with a free trial and free credits to validate the pipeline before scaling a full scraping solution.

Choosing the right proxy type with benchmarks

Proxy type affects both speed and resilience. The table below summarizes observed behavior when you scrape Baidu using Python. Figures are indicative and depend on query volume, target load, and pacing, yet they illustrate the trade-offs between each option.

Proxy type	Requests before throttling (single session)	Median latency	Success rate (20 concurrent)
Single static IPv4	30 to 50	Low	~55%
Datacenter IPv4 pool (rotating)	500+	Low	~88%
Residential proxies (rotating)	2,000+	Medium	~97%
Mobile / dynamic addresses	Very high	Higher	~98%

The pattern is consistent. A single ip throttles fastest. A datacenter pool with proxy rotation improves throughput at low cost. Residential proxies achieve the highest success rate because each address resembles a real user, at the cost of slightly higher latency.

Match the proxy type to the use case. For light SEO checks, a small IPv4 pool is sufficient. For continuous, high volume baidu scraping, residential proxies deliver the most stable serp data. Test both before committing to a long term plan.

Cost scales with quality. Datacenter ip addresses are the cheapest per request, while residential and mobile options cost more but survive stricter checks. Estimate your monthly request volume first, then size the pool to match that demand.

To apply these benchmarks directly, the proxys.io proxy products cover every tier described above, from datacenter IPv4 to residential and dynamic addresses. You can validate a Baidu search scraper against real targets, then select the configuration that best fits your workload.

FAQ

Is it legal to scrape Baidu search results?

Scraping publicly available data is generally permitted, and most search results count as public data. Still, respect Baidu's terms of service, avoid copyrighted data, and confirm your use does not breach any laws.

Do I need an API, or can I parse raw HTML?

For most projects, parsing the raw html with Python and BeautifulSoup is enough. A baidu search api or other search api returns json directly and needs only an api token, though it adds cost.