The essence of a crawler is to capture valuable data from the target websites. Therefore, every website will more or less adopt some anti crawler technology to prevent crawlers. For example, the user-agent request header is used to verify whether it is a browser, and JavaScript is used to load resources dynamically. These are the conventional anti crawler methods. This article will tell you some solutions for stronger anti crawler technology and how to bypass these anti web crawler technologies in the Scrapy project.
1. Use Dynamic Proxy Server To Bypass IP Address Verification.
Some websites will use IP address verification for anti crawler processing, and the website program will check the IP address of the client. If the client with the same IP address frequently requests data, the website will judge that the client is a crawler.
In view of this situation, we can let Scrapy change the IP address of a proxy server randomly so that we can cheat the target website. In order to change the proxy server randomly, you can customize a download middleware to change the proxy server randomly. Below are the steps to change the proxy server randomly in Scrapy program.
- Open the middlewares.py file in the Scrapy project, and add the below classe in the file.
class RandomProxyMiddleware: # Dynamically setting IP address of proxy server def process_request(self, request, spider): # The get_random_proxy() function randomly returns the IP address and port number of the proxy server. # This requires the developer to prepare a series of proxy servers in advance. The function can randomly select one of these proxy servers. request.meta["proxy"] = get_random_proxy()
- Add below configuration text in the Scrapy project settings.py file to enable the above custom middleware.
# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'TestScrapyProject.middlewares.RandomProxyMiddleware': 543, }
2. Disable Cookies
Some websites can identify whether it is the same client by tracking cookies. Scrapy turns on cookies by default, so that the target website can identify the crawler program as the same client according to cookies.
If the request of the same client is too frequent in unit time, then the target website can judge that the client is not a normal user, and it is likely to be a program operation (such as a crawler). At this time, the target website can disable the access of the client.
In this situation, you can disable the cookie in the Scrapy project settings.py file by uncommenting the following code in it. Before doing it you should make sure the Scrapy program does not need to log in to the target website because the login process always needs a cookie to be enabled.
# Disable cookies (enabled by default) COOKIES_ENABLED = False
If you think this article is useful, we will continue to write other technologies such as how to bypass robots.txt crawler rules, bypass restricts access frequency, bypass graphic verification code in a later chapter.