Web scrape troubleshooting: Difference between revisions

Jump to navigation Jump to search
Line 8: Line 8:


1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
#* Multiple sources of same column such as different HTML DOM but have the same column value.
 
#* Backup the HTML text of parent DOM element
* Multiple sources of same column such as different HTML DOM but have the same column value.
#* (optional) Complete HTML file backup
* Backup the HTML text of parent DOM element
* (optional) Complete HTML file backup


2. The IP was banned from server
2. The IP was banned from server
#* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
 
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP


3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]
3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]


4. AJAX
4. AJAX
#* [https://chrome.google.com/webstore/detail/autoscroll/kgkaecolmndcecnchojbndeanmiokofl/related Autoscroll] on {{Chrome}} or {{Edge}} written by [https://twitter.com/PeterLegierski Peter Legierski (@PeterLegierski) / Twitter]
 
* [https://chrome.google.com/webstore/detail/autoscroll/kgkaecolmndcecnchojbndeanmiokofl/related Autoscroll] on {{Chrome}} or {{Edge}} written by [https://twitter.com/PeterLegierski Peter Legierski (@PeterLegierski) / Twitter]


5. The web page needed to signed in
5. The web page needed to signed in

Navigation menu