When we crawl the target data, especially when the amount of data is large, we always feel that the crawling efficiency is slow. So, what methods can be used to improve the crawling efficiency of the crawler? How to improve the crawling efficiency of the crawler? Let's briefly discuss how to improve the crawling efficiency of the crawler.


1. Streamline the crawling process and avoid repeated visits.

In the process of crawling data, a large part of the time is spent waiting for the response of the network request, so reducing the number of unnecessary visits can save time and improve the crawling efficiency. Then you need to optimize the process, streamline the process as much as possible, and avoid repeated visits to multiple pages. Then weight loss is also a very important means. Generally, the uniqueness is judged according to the URL or ID, and those who have already climbed up do not need to continue to climb.


2. Multi-threaded distributed crawling, more people, more strength, and the same is true for crawling. If one machine is not enough, build a few more, and if it is not enough, build a few more.

The first step of distribution is not the essence of the crawler, nor is it necessary. For tasks that are independent of each other and have no communication, you can manually divide the tasks and then execute them on multiple machines, which reduces the workload of each machine and doubles the time. For example, if there are 2 million web pages to crawl, 5 machines can crawl 400,000 non-repetitive web pages. Relatively speaking, the time taken by a single machine is shortened by 5 times.


If there is a need for communication, such as the queue to be crawled is changing, then this queue will change every time it is crawled. Even if the task is divided, there will be cross-repetition, because the queue to be crawled by each machine is different when the program is running. In this case, there is only a distributed, main storage queue, and other slave storage queues can be taken separately, so that a queue can be shared and mutually exclusive crawls will not be repeated.

[email protected]