Web scrape troubleshooting: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
mNo edit summary |
||
Line 5: | Line 5: | ||
#* (optional) Complete HTML file backup | #* (optional) Complete HTML file backup | ||
# The IP was banned from server | # The IP was banned from server | ||
#* Setting the temporization (sleep time) between | #* Setting the temporization (sleep time) between each request ex: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] | ||
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP | #* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP | ||
# CATCHA | # CATCHA |
Revision as of 17:06, 6 March 2018
list of technical issues
- Content of web page was changed (revision): Th expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
- The IP was banned from server
- Setting the temporization (sleep time) between each request ex: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
- CATCHA
- AJAX
Further reading