Web scrape troubleshooting: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
No edit summary
mNo edit summary
Line 5: Line 5:
#* (optional) Complete HTML file backup
#* (optional) Complete HTML file backup
# The IP was banned from server
# The IP was banned from server
#* Setting the temporization (sleep time) between pages ex: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation]
#* Setting the temporization (sleep time) between each request ex: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation]
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
# CATCHA
# CATCHA

Revision as of 17:06, 6 March 2018

list of technical issues

  1. Content of web page was changed (revision): Th expected web content (of specified DOM element) became empty.
    • Multiple sources of same column such as different HTML DOM but have the same column value.
    • Backup the HTML text of parent DOM element
    • (optional) Complete HTML file backup
  2. The IP was banned from server
  3. CATCHA
  4. AJAX

Further reading