Web scrape troubleshooting: Difference between revisions
Jump to navigation
Jump to search
→Before start to web scrpae
mNo edit summary |
|||
| Line 50: | Line 50: | ||
* Are they offer datasets? | * Are they offer datasets? | ||
* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)? | * Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)? | ||
== Skill tree of web scraping == | |||
Data extraction | |||
* How they build the website | |||
** Understanding the navigation system ★★☆☆☆ | |||
** Parse the sitemap XML file ★★☆☆☆ | |||
* Understnding the web technology | |||
** HTTP GET/POST ★★☆☆☆ | |||
** CSS seletor and DOM (Document Object Model) elements ★★☆☆☆ | |||
** AJAX (Asynchronous JavaScript and XML) ★★★☆☆ | |||
* Using API to retrieve data ★★☆☆☆ | |||
* Parse remote files to retrieve data ★★☆☆☆ | |||
* Using unofficial method to retrieve data | |||
** [https://curl.haxx.se/ curl] command or [https://www.gnu.org/software/wget/manual/wget.html GNU Wget] command ★★☆☆☆ | |||
** [https://www.selenium.dev/ SeleniumHQ Browser Automation] ★★★☆☆ | |||
* Forum submit | |||
** submit the from without loggin ★★☆☆☆ | |||
** submit the from after logged the account ★★★☆☆ | |||
* Etiquette of web scraping | |||
** Limit ot web request ★★☆☆☆ | |||
* Tom and Jerry | |||
** VPN and proxy ★★☆☆☆ | |||
** Decode the [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] ★★★★☆ | |||
** Decentralized web scraping ★★★★☆ | |||
Data transforming | |||
* Data cleaning e.g. unprintable characters ★★☆☆☆ | |||
* Selection of database engine ★★★☆☆ | |||
== Further reading == | == Further reading == | ||