Web scrape troubleshooting: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
Line 1: Line 1:
list of technical issues
list of technical issues
# website revision: expected web content (of DOM element) was empty
# Content of web page was changed (revision): Th expected web content (of specified DOM element) became empty.
#* Multiple sources of same column such as different HTML DOM but have the same column value.
#* Multiple sources of same column such as different HTML DOM but have the same column value.
#* Backup the HTML text of parent DOM element
#* Backup the HTML text of parent DOM element
#* (optional) complete HTML file backup
#* (optional) Complete HTML file backup
# server ip ban
# Server ip ban
#* setting the temporization (sleep time) between pages ex: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation]
#* setting the temporization (sleep time) between pages ex: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation]
# CATCHA
# CATCHA
Line 10: Line 10:


Further reading
Further reading
* stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow]
* Stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow]
* stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]


[[Category:Programming]]
[[Category:Programming]]
[[Category:Data science]]
[[Category:Data science]]
[[Category:Data collecting]]
[[Category:Data collecting]]

Revision as of 10:37, 22 August 2017

list of technical issues

  1. Content of web page was changed (revision): Th expected web content (of specified DOM element) became empty.
    • Multiple sources of same column such as different HTML DOM but have the same column value.
    • Backup the HTML text of parent DOM element
    • (optional) Complete HTML file backup
  2. Server ip ban
  3. CATCHA
  4. AJAX

Further reading