HTTP architects generally use a variety of complex mechanisms to combine multiple submodules into an HTTP service. Four basic patterns have been formed in web crawlers today. If you have written a web crawler Python code for generating dynamic content and have chosen an API or framework that supports WSGI, how should you deploy the HTTP service online?


The first step is to run a server written in Python for web crawlers, and the WSGI interface can be directly called in the server code. The Green Unicorn (Gunicorn) server is popular now, but there are other pure Python servers that can be used in production environments.


The second step is to configure mod_wsgi and run Apache, run Python code in a separate WSFIDaemonProcess, and start the daemon process with mod_wsgi.


The third step is to run a Python HTTP server similar to Gunicorn (or any server that supports the selected asynchronous framework) on the back end, and then run a web server on the front end that can return static files and reverse proxy dynamic resource services written in Python.


Step 4: Run a pure reverse proxy (such as Varnish) on the front end, run Apache or nginx on the back end of the reverse proxy, and run an HTTP server written in Python on the back end. This is a three-tier architecture. These reverse proxies can be distributed in different geographical locations, so that the cache resources on the reverse proxy close to the client can be returned to the client that sends the request.


For a long time, the choice of these four architectures was mainly based on the three runtime characteristics of CPython, namely, the interpreter occupies a large amount of memory, the interpreter runs slowly, and the global interpreter (GIL, Global Interpreter Lock) prohibits multiple threads from running Python bytecode at the same time. But at the same time, only a certain number of Python instances can be loaded into the memory. Provide HTTP proxy, HTTPS proxy, Socks5 proxy, etc., residential proxy responds quickly to ensure the security of user information.

[email protected]