Skill tree of web scraping

From LemonWiki共筆
Jump to navigation Jump to search

Data extraction

  • How they build the website & information architecture
    • Understanding the navigation system e.g. pagination ★★☆☆☆
      • Understanding the classfication system ★★☆☆☆
      • Parse the sitemap XML file ★★☆☆☆
  • Understnding the web technology
    • HTTP GET/POST ★★☆☆☆ Related page: HTTP request and response data tool
    • HTTP/CSS/Javascript ★★☆☆☆
    • CSS seletor, XPath expressions and DOM (Document Object Model) elements ★★☆☆☆ Related page: Xpath tools
    • AJAX (Asynchronous JavaScript and XML) ★★★★☆
  • Using API to retrieve data ★★☆☆☆
  • Parse remote files to retrieve data ★★☆☆☆
  • Using unofficial method to retrieve data
  • Forum submit
    • submit the from without loggin ★★☆☆☆
    • submit the from after logged the account ★★★☆☆
  • Detection of abnormal data
    • HTTP status codes ★★☆☆☆
    • Data is wrong even the server throw HTTP 200 status code ★★★☆☆
  • Etiquette of web scraping
    • Limit of web request ★★☆☆☆
  • Tom and Jerry
    • VPN and proxy ★★☆☆☆
    • Decode the CAPTCHA ★★★★☆
    • Decentralized web scraping ★★★★☆

Data transforming

  • Character encoding ★☆☆☆☆
  • Data cleaning e.g. unprintable characters ★★☆☆☆
  • Regular expression ★★★☆☆
  • Selection of database engine ★★★★☆