Web scrape troubleshooting: Difference between revisions
Jump to navigation
Jump to search
mNo edit summary |
mNo edit summary |
||
| Line 1: | Line 1: | ||
list of technical issues | list of technical issues | ||
# | # Content of web page was changed (revision): Th expected web content (of specified DOM element) became empty. | ||
#* Multiple sources of same column such as different HTML DOM but have the same column value. | #* Multiple sources of same column such as different HTML DOM but have the same column value. | ||
#* Backup the HTML text of parent DOM element | #* Backup the HTML text of parent DOM element | ||
#* (optional) | #* (optional) Complete HTML file backup | ||
# | # Server ip ban | ||
#* setting the temporization (sleep time) between pages ex: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] | #* setting the temporization (sleep time) between pages ex: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] | ||
# CATCHA | # CATCHA | ||
| Line 10: | Line 10: | ||
Further reading | Further reading | ||
* | * Stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow] | ||
* | * Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition] | ||
[[Category:Programming]] | [[Category:Programming]] | ||
[[Category:Data science]] | [[Category:Data science]] | ||
[[Category:Data collecting]] | [[Category:Data collecting]] | ||
Revision as of 10:37, 22 August 2017
list of technical issues
- Content of web page was changed (revision): Th expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
- Server ip ban
- setting the temporization (sleep time) between pages ex: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation
- CATCHA
- AJAX
Further reading