Web scrape troubleshooting: Difference between revisions
Jump to navigation
Jump to search
m
no edit summary
mNo edit summary |
mNo edit summary |
||
(19 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
#* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]]. | #* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]]. | ||
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP | #* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP | ||
# | # [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] | ||
# AJAX | # AJAX | ||
# The web page needed to signed in | # The web page needed to signed in | ||
Line 46: | Line 46: | ||
</div> | </div> | ||
== Before start to web | == Before start to writing the scirpt of web scraping (crawler) == | ||
* Are | * Are the website offer datasets or files? e.g. open data | ||
* Are | * Are the website offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)? | ||
'''How to find the unofficial (3rd-party) web crawler?''' Search keyword strategy suggested as follows: | |||
* ''target website'' + crawler | |||
* ''target website'' + bot | |||
* ''target website'' + download / downloader | |||
== Further reading == | == Further reading == | ||
Line 55: | Line 60: | ||
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition] | * Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition] | ||
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia] | * [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia] | ||
* [[Skill tree of web scraping]] | |||
== References == | == References == | ||
Line 63: | Line 69: | ||
[[Category:Programming]] | [[Category:Programming]] | ||
[[Category:Data | [[Category:Data Science]] | ||
[[Category:Data collecting]] | [[Category:Data collecting]] | ||
[[Category:web scraping]] |