Web scrape troubleshooting: Difference between revisions

Jump to navigation Jump to search
m
Line 15: Line 15:
2. The IP was banned from server
2. The IP was banned from server


* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
* Random Delays: Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
* Smart Retry: '''Automatic retry''' or '''Exponential Backoff'''<ref>[https://en.wikipedia.org/wiki/Exponential_backoff Exponential backoff - Wikipedia]: "Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate."</ref> on network errors (up to 3 times)


3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]
3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]
Line 84: Line 86:
* ''target website'' + download / downloader site:github.com
* ''target website'' + download / downloader site:github.com
* ''target website'' + browser client site:github.com
* ''target website'' + browser client site:github.com


== Common Web Scraping Issues and Solutions ==
== Common Web Scraping Issues and Solutions ==

Navigation menu