Skill tree of web scraping: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
| (One intermediate revision by the same user not shown) | |||
| Line 19: | Line 19: | ||
** submit the from after logged the account ★★★☆☆ | ** submit the from after logged the account ★★★☆☆ | ||
* Detection of abnormal data | * Detection of abnormal data | ||
** [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes HTTP status codes] ★★☆☆☆ | ** [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes HTTP status codes] ([[Troubleshooting of HTTP errors]]) ★★☆☆☆ | ||
** Data is wrong even the server throw HTTP 200 status code ★★★☆☆ | ** Data is wrong even the server throw HTTP 200 status code ★★★☆☆ | ||
* Etiquette of web scraping | * Etiquette of web scraping | ||
| Line 31: | Line 31: | ||
* Character encoding ★☆☆☆☆ | * Character encoding ★☆☆☆☆ | ||
* [[Data cleaning]] e.g. unprintable characters ★★☆☆☆ | * [[Data cleaning]] e.g. unprintable characters ★★☆☆☆ | ||
* [ | * [[Regular expression]] ★★★☆☆ | ||
* Selection of database engine ★★★★☆ | * Selection of database engine ★★★★☆ | ||
Latest revision as of 00:02, 29 April 2025
Data extraction
- How they build the website & information architecture
- Understanding the navigation system e.g. pagination ★★☆☆☆
- Understanding the classfication system ★★☆☆☆
- Parse the sitemap XML file ★★☆☆☆
- Understanding the navigation system e.g. pagination ★★☆☆☆
- Understnding the web technology
- HTTP GET/POST ★★☆☆☆ Related page: HTTP request and response data tool
- HTTP/CSS/Javascript ★★☆☆☆
- CSS seletor, XPath expressions and DOM (Document Object Model) elements ★★☆☆☆ Related page: Xpath tools
- AJAX (Asynchronous JavaScript and XML) ★★★★☆
- Using API to retrieve data ★★☆☆☆
- Parse remote files to retrieve data ★★☆☆☆
- Using unofficial method to retrieve data
- curl command or GNU Wget command ★★☆☆☆ Related page: Troubleshooting of curl errors
- SeleniumHQ Browser Automation ★★★☆☆
- Forum submit
- submit the from without loggin ★★☆☆☆
- submit the from after logged the account ★★★☆☆
- Detection of abnormal data
- HTTP status codes (Troubleshooting of HTTP errors) ★★☆☆☆
- Data is wrong even the server throw HTTP 200 status code ★★★☆☆
- Etiquette of web scraping
- Limit of web request ★★☆☆☆
- Tom and Jerry
- VPN and proxy ★★☆☆☆
- Decode the CAPTCHA ★★★★☆
- Decentralized web scraping ★★★★☆
Data transforming
- Character encoding ★☆☆☆☆
- Data cleaning e.g. unprintable characters ★★☆☆☆
- Regular expression ★★★☆☆
- Selection of database engine ★★★★☆