Web scrape troubleshooting: Difference between revisions

Jump to navigation Jump to search
mNo edit summary
Line 50: Line 50:
* Are they offer datasets?
* Are they offer datasets?
* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
== Skill tree of web scraping ==
Data extraction
* How they build the website
** Understanding the navigation system ★★☆☆☆
** Parse the sitemap XML file ★★☆☆☆
* Understnding the web technology
** HTTP GET/POST ★★☆☆☆
** CSS seletor and DOM (Document Object Model) elements ★★☆☆☆
** AJAX (Asynchronous JavaScript and XML) ★★★☆☆
* Using API to retrieve data ★★☆☆☆
* Parse remote files to retrieve data ★★☆☆☆
* Using unofficial method to retrieve data
** [https://curl.haxx.se/ curl] command or [https://www.gnu.org/software/wget/manual/wget.html GNU Wget] command ★★☆☆☆
** [https://www.selenium.dev/ SeleniumHQ Browser Automation] ★★★☆☆
* Forum submit
** submit the from without loggin ★★☆☆☆
** submit the from after logged the account ★★★☆☆
* Etiquette of web scraping
** Limit ot web request ★★☆☆☆
* Tom and Jerry
** VPN and proxy ★★☆☆☆
** Decode the [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] ★★★★☆
** Decentralized web scraping ★★★★☆
Data transforming
* Data cleaning e.g. unprintable characters ★★☆☆☆
* Selection of database engine ★★★☆☆


== Further reading ==
== Further reading ==
Anonymous user

Navigation menu