Skill tree of web scraping: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
mNo edit summary
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 19: Line 19:
** submit the from after logged the account ★★★☆☆
** submit the from after logged the account ★★★☆☆
* Detection of abnormal data
* Detection of abnormal data
** [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes HTTP status codes] ★★☆☆☆
** [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes HTTP status codes] ([[Troubleshooting of HTTP errors]]) ★★☆☆☆
** Data is wrong even the server throw HTTP 200 status code ★★★☆☆
** Data is wrong even the server throw HTTP 200 status code ★★★☆☆
* Etiquette of web scraping
* Etiquette of web scraping
Line 30: Line 30:
Data transforming
Data transforming
* Character encoding ★☆☆☆☆
* Character encoding ★☆☆☆☆
* Data cleaning e.g. unprintable characters ★★☆☆☆
* [[Data cleaning]] e.g. unprintable characters ★★☆☆☆
* [https://en.wikipedia.org/wiki/Regular_expression Regular expression]  ★★★☆☆
* [[Regular expression]]  ★★★☆☆
* Selection of database engine ★★★★☆
* Selection of database engine ★★★★☆


Line 38: Line 38:
[[Category:Data Science]]
[[Category:Data Science]]
[[Category:Data collecting]]
[[Category:Data collecting]]
[[Category:web scraping]]

Latest revision as of 00:02, 29 April 2025

Data extraction

  • How they build the website & information architecture
    • Understanding the navigation system e.g. pagination ★★☆☆☆
      • Understanding the classfication system ★★☆☆☆
      • Parse the sitemap XML file ★★☆☆☆
  • Understnding the web technology
    • HTTP GET/POST ★★☆☆☆ Related page: HTTP request and response data tool
    • HTTP/CSS/Javascript ★★☆☆☆
    • CSS seletor, XPath expressions and DOM (Document Object Model) elements ★★☆☆☆ Related page: Xpath tools
    • AJAX (Asynchronous JavaScript and XML) ★★★★☆
  • Using API to retrieve data ★★☆☆☆
  • Parse remote files to retrieve data ★★☆☆☆
  • Using unofficial method to retrieve data
  • Forum submit
    • submit the from without loggin ★★☆☆☆
    • submit the from after logged the account ★★★☆☆
  • Detection of abnormal data
  • Etiquette of web scraping
    • Limit of web request ★★☆☆☆
  • Tom and Jerry
    • VPN and proxy ★★☆☆☆
    • Decode the CAPTCHA ★★★★☆
    • Decentralized web scraping ★★★★☆

Data transforming

  • Character encoding ★☆☆☆☆
  • Data cleaning e.g. unprintable characters ★★☆☆☆
  • Regular expression ★★★☆☆
  • Selection of database engine ★★★★☆