In terms of web crawling, people often discuss two issues: one is how to avoid being blocked by the target server, and the other is how to improve the quality of retrieved data. At the current stage, effective technologies can prevent being blocked by the target website, such as the proxy commonly used by users and practical IP address rotation. However, there is actually another technology that can play a similar role, but it is often overlooked, that is, the use and optimization of HTTP headers. This method can also reduce the possibility of web crawlers being blocked by various data sources and ensure the retrieval of high-quality data. Next, let's take a look at the five commonly used headers:


HTTP Header User-Agent

The User-Agent Header conveys information including application type, operating system, software and version information, and allows the data target to decide what type of HTML layout to use to respond. Mobile phones, tablets or PCs can display different HTML layouts.

Web servers often verify the User-Agent Header, which is the first layer of protection for website servers. This step allows the data source to identify suspicious requests. Therefore, experienced crawlers will modify the User-Agent Header to different strings so that the server can identify that multiple natural users are making requests.


HTTP Header Accept-Language

The Accept-Language Header conveys information to the web server about what languages ​​the client has and which specific language is preferred when the web server sends back a response. Specific headers are usually used when the web server cannot identify the preferred language.


HTTP Header Accept-Encoding

The Accept-Encoding Header informs the web server which compression algorithm to use when processing the request. In other words, when sent from the web server to the client, it confirms that the information can be compressed if the server can handle it. It can save traffic after optimization by using this header, which is better for both the client and the web server from a traffic load perspective.


HTTP Header Accept

The Accept Header belongs to the content negotiation category, and its purpose is to inform the web server what type of data format can be returned to the client. If the Accept Header is configured properly, it will make the communication between the client and the server more like real user behavior, thereby reducing the possibility of web crawlers being blocked.


HTTP Header Referer

Before sending the request to the web server, the Referer Header provides the address of the web page where the user was before the request. The Referer Header actually has little effect when the website tries to prevent the crawling process. A random real user is likely to be online for hours at a time.

[email protected]