Web scrape troubleshooting: Difference between revisions
Jump to navigation
Jump to search
mNo edit summary |
|||
| Line 50: | Line 50: | ||
* Are they offer datasets? | * Are they offer datasets? | ||
* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)? | * Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)? | ||
== Skill tree of web scraping == | |||
Data extraction | |||
* How they build the website | |||
** Understanding the navigation system ★★☆☆☆ | |||
** Parse the sitemap XML file ★★☆☆☆ | |||
* Understnding the web technology | |||
** HTTP GET/POST ★★☆☆☆ | |||
** CSS seletor and DOM (Document Object Model) elements ★★☆☆☆ | |||
** AJAX (Asynchronous JavaScript and XML) ★★★☆☆ | |||
* Using API to retrieve data ★★☆☆☆ | |||
* Parse remote files to retrieve data ★★☆☆☆ | |||
* Using unofficial method to retrieve data | |||
** [https://curl.haxx.se/ curl] command or [https://www.gnu.org/software/wget/manual/wget.html GNU Wget] command ★★☆☆☆ | |||
** [https://www.selenium.dev/ SeleniumHQ Browser Automation] ★★★☆☆ | |||
* Forum submit | |||
** submit the from without loggin ★★☆☆☆ | |||
** submit the from after logged the account ★★★☆☆ | |||
* Etiquette of web scraping | |||
** Limit ot web request ★★☆☆☆ | |||
* Tom and Jerry | |||
** VPN and proxy ★★☆☆☆ | |||
** Decode the [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] ★★★★☆ | |||
** Decentralized web scraping ★★★★☆ | |||
Data transforming | |||
* Data cleaning e.g. unprintable characters ★★☆☆☆ | |||
* Selection of database engine ★★★☆☆ | |||
== Further reading == | == Further reading == | ||
Revision as of 12:03, 23 August 2020
List of technical issues
- Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
- The IP was banned from server
- Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
- CATCHA
- AJAX
- The web page needed to signed in
- Blocking the request without Referer or other headers.
- Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[1][2].
- Language and URL-encodes string
- Data cleaning issues e.g. Non-breaking space or other Whitespace character
| Difficulty in implementing | Approach | Comments |
|---|---|---|
| easy | Url is the resource of dataset | |
| more difficult | Url is the resource of dataset. Require to simulate post form submit with the form data or user agent | Using HTTP request and response data tool or PHP: cURL |
| more difficult | Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. | Using Selenium or Headless Chrome |
| difficult | Ajax |
Before start to web scrpae
- Are they offer datasets?
- Are they offer API (Application programming interface)?
Skill tree of web scraping
Data extraction
- How they build the website
- Understanding the navigation system ★★☆☆☆
- Parse the sitemap XML file ★★☆☆☆
- Understnding the web technology
- HTTP GET/POST ★★☆☆☆
- CSS seletor and DOM (Document Object Model) elements ★★☆☆☆
- AJAX (Asynchronous JavaScript and XML) ★★★☆☆
- Using API to retrieve data ★★☆☆☆
- Parse remote files to retrieve data ★★☆☆☆
- Using unofficial method to retrieve data
- curl command or GNU Wget command ★★☆☆☆
- SeleniumHQ Browser Automation ★★★☆☆
- Forum submit
- submit the from without loggin ★★☆☆☆
- submit the from after logged the account ★★★☆☆
- Etiquette of web scraping
- Limit ot web request ★★☆☆☆
- Tom and Jerry
- VPN and proxy ★★☆☆☆
- Decode the CAPTCHA ★★★★☆
- Decentralized web scraping ★★★★☆
Data transforming
- Data cleaning e.g. unprintable characters ★★☆☆☆
- Selection of database engine ★★★☆☆
Further reading
- Stateless: Why say that HTTP is a stateless protocol? - Stack Overflow
- Stateful: What is stateful? Webopedia Definition
- List of HTTP status codes - Wikipedia
References
Troubleshooting of ...
- PHP, cUrl, Python, selenium, HTTP status code errors
- Database: SQL syntax debug, MySQL errors, MySQLTuner errors or PostgreSQL errors
- HTML/Javascript: Troubleshooting of javascript, XPath
- Software: Mediawiki, Docker, FTP problems, online conference software
- Test connectivity for the web service, Web Ping, Network problem, Web user behavior, Web scrape troubleshooting
Template