When using crawlers to collect data, sometimes we need to use proxy IPs to hide the real IP address or circumvent restrictions on certain websites. At the same time, through HTML and CSS selectors, we can locate and extract specific data in the page. The following is a basic step-by-step description, taking Python's requests and BeautifulSoup libraries as examples:
Step 1: Install necessary libraries
First, you need to install the requests and BeautifulSoup libraries. You can use pip to install them:
bash
pip install requests beautifulsoup4
Step 2: Set the proxy IP
When sending HTTP requests, you can set the proxy IP through the proxies parameter. Here is an example:
python
import requests
proxies = {
'http': 'http://your_proxy_ip:port',
'https': 'https://your_proxy_ip:port',
}
response = requests.get('http://example.com', proxies=proxies)
In the above code, you need to replace 'your_proxy_ip:port' with your proxy IP and port.
Step 3: Parse HTML and extract data
You can use the BeautifulSoup library to parse HTML and extract data. Here is an example:
python
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text,'html.parser')
#Use CSS selector to extract data
data = soup.select('css_selector')
for item in data:
print(item.text)
In the above code, you need to replace 'css_selector' with the actual CSS selector. CSS selectors are used to locate elements in HTML pages. For example, if you want to extract all paragraph text, you can use 'p' as the CSS selector.
Note: When using crawlers, please make sure to comply with the website's robots.txt file and relevant laws and regulations, and do not put too much pressure on the website or perform malicious crawling. At the same time, some proxy IPs may not be stable or require payment, and you need to choose a suitable proxy IP service according to your needs.