Web scrape troubleshooting: Difference between revisions

Revision as of 11:33, 10 January 2025

Before start to writing the scirpt of web scraping (crawler)

Is the website offer datasets or files? e.g. open data
Is the website offer API (Application programming interface)?

List of technical issues

1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.

Multiple sources of same column such as different HTML DOM but have the same column value.
Backup the HTML text of parent DOM element
(optional) Complete HTML file backup

2. The IP was banned from server

Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
The server responded with a status of 403: '403 forbidden' --> Change the network IP

3. CAPTCHA

4. AJAX

Autoscroll on Chrome or Edge written by Peter Legierski (@PeterLegierski) / Twitter

5. The web page needed to signed in

6. Blocking the request without Referer or other headers.

7. Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds^[1]^[2].

8. Language and URL-encodes string

9. How to extract content from websites

10. Data cleaning issues e.g. Non-breaking space or other Whitespace character

11. Is link a permanent link?

12. Enable/Disable the CSS or JavaScript

Difficulty in implementing	Descriptioin	Approach	Comments
Easy	Well-formatted HTML elements	Url is the resource of dataset.
Advanced	Interactive websites	Url is the resource of dataset. Require to simulate post form submit with the form data or user agent	Using HTTP request and response data tool or PHP: cURL
more difficult	Interactive websites	Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.	Using Selenium or Headless Chrome
Difficult	Interactive websites	Ajax

Search keyword strategy

How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:

target website + crawler site:github.com
target website + scraper site:github.com
target website + bot site:github.com
target website + download / downloader site:github.com
target website + browser client site:github.com

References

Troubleshooting of ...

Troubleshooting of Excel errors

PHP, cUrl, Python, selenium, HTTP status code errors

Database: SQL syntax debug, MySQL errors, MySQLTuner errors or PostgreSQL errors

Troubleshooting of regular expression

HTML/Javascript: Troubleshooting of javascript, XPath

Software: Mediawiki, Docker, FTP problems, online conference software

Test connectivity for the web service, Web Ping, Network problem, Web user behavior, Web scrape troubleshooting

Template

Bug report template

[1] PHP: Runtime Configuration - Manual

[2] url - Error Codes

[1]

[2]

@@ Line 8: / Line 8: @@
 . Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
-#* Multiple sources of same column such as different HTML DOM but have the same column value.
-#* Backup the HTML text of parent DOM element
+* Multiple sources of same column such as different HTML DOM but have the same column value.
-#* (optional) Complete HTML file backup
+* Backup the HTML text of parent DOM element
+* (optional) Complete HTML file backup
 . The IP was banned from server
-#* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
-#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
+* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
+* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
 . [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]
 . AJAX
-#* [https://chrome.google.com/webstore/detail/autoscroll/kgkaecolmndcecnchojbndeanmiokofl/related Autoscroll] on {{Chrome}} or {{Edge}} written by [https://twitter.com/PeterLegierski Peter Legierski (@PeterLegierski) / Twitter]
+* [https://chrome.google.com/webstore/detail/autoscroll/kgkaecolmndcecnchojbndeanmiokofl/related Autoscroll] on {{Chrome}} or {{Edge}} written by [https://twitter.com/PeterLegierski Peter Legierski (@PeterLegierski) / Twitter]
 . The web page needed to signed in

Web scrape troubleshooting: Difference between revisions

Revision as of 11:33, 10 January 2025

Contents

Before start to writing the scirpt of web scraping (crawler)

List of technical issues

Search keyword strategy

Further reading

References

Navigation menu