Skill tree of web scraping: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
(Created page with "Data extraction * How they build the website & information architecture ** Understanding the navigation system e.g. pagination ★★☆☆☆ *...")
 
mNo edit summary
Line 6: Line 6:


* Understnding the web technology
* Understnding the web technology
** HTTP GET/POST ★★☆☆☆ [Related page: [HTTP request and response data tool]]
** HTTP GET/POST ★★☆☆☆ Related page: [[HTTP request and response data tool]]
** HTTP/CSS/Javascript ★★☆☆☆
** HTTP/CSS/Javascript ★★☆☆☆
** CSS seletor and DOM (Document Object Model) elements ★★☆☆☆ Related page: [[Xpath tools]]
** CSS seletor, XPath expressions and DOM (Document Object Model) elements ★★☆☆☆ Related page: [[Xpath tools]]
** AJAX (Asynchronous JavaScript and XML) ★★★★☆
** AJAX (Asynchronous JavaScript and XML) ★★★★☆
* Using API to retrieve data ★★☆☆☆
* Using API to retrieve data ★★☆☆☆

Revision as of 10:20, 25 August 2020

Data extraction

  • How they build the website & information architecture
    • Understanding the navigation system e.g. pagination ★★☆☆☆
      • Understanding the classfication system ★★☆☆☆
      • Parse the sitemap XML file ★★☆☆☆
  • Understnding the web technology
    • HTTP GET/POST ★★☆☆☆ Related page: HTTP request and response data tool
    • HTTP/CSS/Javascript ★★☆☆☆
    • CSS seletor, XPath expressions and DOM (Document Object Model) elements ★★☆☆☆ Related page: Xpath tools
    • AJAX (Asynchronous JavaScript and XML) ★★★★☆
  • Using API to retrieve data ★★☆☆☆
  • Parse remote files to retrieve data ★★☆☆☆
  • Using unofficial method to retrieve data
  • Forum submit
    • submit the from without loggin ★★☆☆☆
    • submit the from after logged the account ★★★☆☆
  • Detection of abnormal data
    • HTTP status codes ★★☆☆☆
    • Data is wrong even the server throw HTTP 200 status code ★★★☆☆
  • Etiquette of web scraping
    • Limit of web request ★★☆☆☆
  • Tom and Jerry
    • VPN and proxy ★★☆☆☆
    • Decode the CAPTCHA ★★★★☆
    • Decentralized web scraping ★★★★☆

Data transforming

  • Character encoding ★☆☆☆☆
  • Data cleaning e.g. unprintable characters ★★☆☆☆
  • Regular expression ★★★☆☆
  • Selection of database engine ★★★★☆