Web scrape troubleshooting: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
mNo edit summary |
||
| Line 10: | Line 10: | ||
Further reading | Further reading | ||
* [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow] | * stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow] | ||
* stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition] | |||
[[Category:Programming]] | [[Category:Programming]] | ||
[[Category:Data science]] | [[Category:Data science]] | ||
[[Category:Data collecting]] | [[Category:Data collecting]] | ||
Revision as of 10:17, 13 June 2017
list of technical issues
- website revision: expected web content (of DOM element) was empty
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) complete HTML file backup
- server ip ban
- setting the temporization (sleep time) between pages ex: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation
- CATCHA
- AJAX
Further reading