Web scrape troubleshooting: Difference between revisions

Revision as of 18:03, 16 January 2024

Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup

The IP was banned from server
- Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
- The server responded with a status of 403: '403 forbidden' --> Change the network IP

Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds^[1]^[2].

Difficulty in implementing	Descriptioin	Approach	Comments
Easy	Well-formatted HTML elements	Url is the resource of dataset.
Advanced	Interactive websites	Url is the resource of dataset. Require to simulate post form submit with the form data or user agent	Using HTTP request and response data tool or PHP: cURL
more difficult	Interactive websites	Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.	Using Selenium or Headless Chrome
Difficult	Interactive websites	Ajax

How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:

Troubleshooting of ...

Template

@@ Line 11: / Line 11: @@
 #* Backup the HTML text of parent DOM element
 #* (optional) Complete HTML file backup
 # The IP was banned from server
 #* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
 #* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
 # [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]
 # AJAX
 #* [https://chrome.google.com/webstore/detail/autoscroll/kgkaecolmndcecnchojbndeanmiokofl/related Autoscroll] on {{Chrome}} or {{Edge}} written by [https://twitter.com/PeterLegierski Peter Legierski (@PeterLegierski) / Twitter]
@@ Line 26: / Line 29: @@
 # Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string]
-# [[Extract article text from web page]]
+# [[How to extract content from websites]]
 # [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]