Web scrape troubleshooting: Difference between revisions

Jump to navigation Jump to search
m
Line 53: Line 53:
== Skill tree of web scraping ==
== Skill tree of web scraping ==


Data extraction
[[Skill tree of web scraping]]
* How they build the website & [[Information Architecture | information architecture]]
** Understanding the navigation system ★★☆☆☆
*** Understanding the classfication system ★★☆☆☆
*** Parse the sitemap XML file ★★☆☆☆
 
* Understnding the web technology
** HTTP GET/POST ★★☆☆☆ [Related page: [HTTP request and response data tool]]
** HTTP/CSS/Javascript ★★☆☆☆
** CSS seletor and DOM (Document Object Model) elements ★★☆☆☆ Related page: [[Xpath tools]]
** AJAX (Asynchronous JavaScript and XML) ★★★★☆
* Using API to retrieve data ★★☆☆☆
* Parse remote files to retrieve data ★★☆☆☆
* Using unofficial method to retrieve data
** [https://curl.haxx.se/ curl] command or [https://www.gnu.org/software/wget/manual/wget.html GNU Wget] command ★★☆☆☆ Related page: [[Troubleshooting of curl errors]]
** [https://www.selenium.dev/ SeleniumHQ Browser Automation] ★★★☆☆
* Forum submit
** submit the from without loggin ★★☆☆☆
** submit the from after logged the account ★★★☆☆
* Detection of abnormal data
** [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes HTTP status codes] ★★☆☆☆
** Data is wrong even the server throw HTTP 200 status code ★★★☆☆
* Etiquette of web scraping
** Limit of web request ★★☆☆☆
* Tom and Jerry
** VPN and proxy ★★☆☆☆
** Decode the [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] ★★★★☆
** Decentralized web scraping ★★★★☆
 
Data transforming
* Character encoding ★☆☆☆☆
* Data cleaning e.g. unprintable characters ★★☆☆☆
* [https://en.wikipedia.org/wiki/Regular_expression Regular expression] ★★★☆☆
* Selection of database engine ★★★★☆


== Further reading ==
== Further reading ==
Anonymous user

Navigation menu