When using crawlers to collect data, sometimes we need to use proxy IPs to hide the real IP address or circumvent restrictions on certain websites. At the same time, through HTML and CSS selectors, we can locate and extract specific data in the page. The following is a basic step-by-step description, taking Python's requests and BeautifulSoup libraries as examples:


Step 1: Install necessary libraries

First, you need to install the requests and BeautifulSoup libraries. You can use pip to install them:


bash


pip install requests beautifulsoup4


Step 2: Set the proxy IP

When sending HTTP requests, you can set the proxy IP through the proxies parameter. Here is an example:


python


import requests

proxies = {

'http': 'http://your_proxy_ip:port',

'https': 'https://your_proxy_ip:port',

}


response = requests.get('http://example.com', proxies=proxies)


In the above code, you need to replace 'your_proxy_ip:port' with your proxy IP and port.


Step 3: Parse HTML and extract data

You can use the BeautifulSoup library to parse HTML and extract data. Here is an example:


python


from bs4 import BeautifulSoup


soup = BeautifulSoup(response.text,'html.parser')


#Use CSS selector to extract data


data = soup.select('css_selector')


for item in data:


print(item.text)


In the above code, you need to replace 'css_selector' with the actual CSS selector. CSS selectors are used to locate elements in HTML pages. For example, if you want to extract all paragraph text, you can use 'p' as the CSS selector.


Note: When using crawlers, please make sure to comply with the website's robots.txt file and relevant laws and regulations, and do not put too much pressure on the website or perform malicious crawling. At the same time, some proxy IPs may not be stable or require payment, and you need to choose a suitable proxy IP service according to your needs.

[email protected]