Web scrape troubleshooting: Difference between revisions

Revision as of 11:59, 20 November 2020

Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
The IP was banned from server
- Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
CATCHA
AJAX
The web page needed to signed in
Blocking the request without Referer or other headers.
Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds^[1]^[2].
Language and URL-encodes string
Data cleaning issues e.g. Non-breaking space or other Whitespace character

Difficulty in implementing	Approach	Comments
easy	Url is the resource of dataset
more difficult	Url is the resource of dataset. Require to simulate post form submit with the form data or user agent	Using HTTP request and response data tool or PHP: cURL
more difficult	Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.	Using Selenium or Headless Chrome
difficult	Ajax

How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:

Troubleshooting of ...

Template

@@ Line 46: / Line 46: @@
 </div>
-== Before start to web scrpae ==
+== Before start to writing the scirpt of web scraping (crawler) ==
-* Are they offer datasets? e.g. open data
+* Are the website offer datasets or files? e.g. open data
-* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
+* Are the website offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
 '''How to find the unofficial (3rd-party) web crawler?''' Search keyword strategy suggested as follows:
 * ''target website'' + crawler
 * ''target website'' + bot
-* ''target website'' + download
+* ''target website'' + download / downloader
 == Further reading ==