Web scrape troubleshooting: Difference between revisions

Jump to navigation Jump to search
m
no edit summary
mNo edit summary
mNo edit summary
(19 intermediate revisions by the same user not shown)
Line 8: Line 8:
#* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
#* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
# CATCHA
# [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]
# AJAX
# AJAX
# The web page needed to signed in
# The web page needed to signed in
Line 46: Line 46:
</div>
</div>


== Before start to web scrpae ==
== Before start to writing the scirpt of web scraping (crawler) ==


* Are they offer datasets?
* Are the website offer datasets or files? e.g. open data
* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
* Are the website offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
 
'''How to find the unofficial (3rd-party) web crawler?''' Search keyword strategy suggested as follows:
* ''target website'' + crawler
* ''target website'' + bot
* ''target website'' + download / downloader


== Further reading ==
== Further reading ==
Line 55: Line 60:
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia]
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia]
* [[Skill tree of web scraping]]


== References ==
== References ==
Line 63: Line 69:


[[Category:Programming]]
[[Category:Programming]]
[[Category:Data science]]
[[Category:Data Science]]
[[Category:Data collecting]]
[[Category:Data collecting]]
[[Category:web scraping]]
Anonymous user

Navigation menu