Skill tree of web scraping
Jump to navigation
Jump to search
Data extraction
- How they build the website & information architecture
- Understanding the navigation system e.g. pagination ★★☆☆☆
- Understanding the classfication system ★★☆☆☆
- Parse the sitemap XML file ★★☆☆☆
- Understanding the navigation system e.g. pagination ★★☆☆☆
- Understnding the web technology
- HTTP GET/POST ★★☆☆☆ Related page: HTTP request and response data tool
- HTTP/CSS/Javascript ★★☆☆☆
- CSS seletor, XPath expressions and DOM (Document Object Model) elements ★★☆☆☆ Related page: Xpath tools
- AJAX (Asynchronous JavaScript and XML) ★★★★☆
- Using API to retrieve data ★★☆☆☆
- Parse remote files to retrieve data ★★☆☆☆
- Using unofficial method to retrieve data
- curl command or GNU Wget command ★★☆☆☆ Related page: Troubleshooting of curl errors
- SeleniumHQ Browser Automation ★★★☆☆
- Forum submit
- submit the from without loggin ★★☆☆☆
- submit the from after logged the account ★★★☆☆
- Detection of abnormal data
- HTTP status codes ★★☆☆☆
- Data is wrong even the server throw HTTP 200 status code ★★★☆☆
- Etiquette of web scraping
- Limit of web request ★★☆☆☆
- Tom and Jerry
- VPN and proxy ★★☆☆☆
- Decode the CAPTCHA ★★★★☆
- Decentralized web scraping ★★★★☆
Data transforming
- Character encoding ★☆☆☆☆
- Data cleaning e.g. unprintable characters ★★☆☆☆
- Regular expression ★★★☆☆
- Selection of database engine ★★★★☆