14,953
edits
| Line 15: | Line 15: | ||
2. The IP was banned from server | 2. The IP was banned from server | ||
* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]]. | * Random Delays: Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]]. | ||
* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP | * The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP | ||
* Smart Retry: '''Automatic retry''' or '''Exponential Backoff'''<ref>[https://en.wikipedia.org/wiki/Exponential_backoff Exponential backoff - Wikipedia]: "Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate."</ref> on network errors (up to 3 times) | |||
3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] | 3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] | ||
| Line 84: | Line 86: | ||
* ''target website'' + download / downloader site:github.com | * ''target website'' + download / downloader site:github.com | ||
* ''target website'' + browser client site:github.com | * ''target website'' + browser client site:github.com | ||
== Common Web Scraping Issues and Solutions == | == Common Web Scraping Issues and Solutions == | ||