Skill tree of web scraping

From LemonWiki共筆
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Data extraction

  • How they build the website & information architecture
    • Understanding the navigation system e.g. pagination ★★☆☆☆
      • Understanding the classfication system ★★☆☆☆
      • Parse the sitemap XML file ★★☆☆☆
  • Understnding the web technology
    • HTTP GET/POST ★★☆☆☆ Related page: HTTP request and response data tool
    • HTTP/CSS/Javascript ★★☆☆☆
    • CSS seletor, XPath expressions and DOM (Document Object Model) elements ★★☆☆☆ Related page: Xpath tools
    • AJAX (Asynchronous JavaScript and XML) ★★★★☆
  • Using API to retrieve data ★★☆☆☆
  • Parse remote files to retrieve data ★★☆☆☆
  • Using unofficial method to retrieve data
  • Forum submit
    • submit the from without loggin ★★☆☆☆
    • submit the from after logged the account ★★★☆☆
  • Detection of abnormal data
    • HTTP status codes ★★☆☆☆
    • Data is wrong even the server throw HTTP 200 status code ★★★☆☆
  • Etiquette of web scraping
    • Limit of web request ★★☆☆☆
  • Tom and Jerry
    • VPN and proxy ★★☆☆☆
    • Decode the CAPTCHA ★★★★☆
    • Decentralized web scraping ★★★★☆

Data transforming

  • Character encoding ★☆☆☆☆
  • Data cleaning e.g. unprintable characters ★★☆☆☆
  • Regular expression ★★★☆☆
  • Selection of database engine ★★★★☆