Web scrape troubleshooting: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
Tags: Mobile edit Mobile web edit
 
(2 intermediate revisions by the same user not shown)
Line 15: Line 15:
2. The IP was banned from server
2. The IP was banned from server


* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
* Random Delays: Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
* Smart Retry: '''Automatic retry''' or '''Exponential Backoff'''<ref>[https://en.wikipedia.org/wiki/Exponential_backoff Exponential backoff - Wikipedia]: "Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate."</ref> on network errors


3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]
3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]
Line 84: Line 86:
* ''target website'' + download / downloader site:github.com
* ''target website'' + download / downloader site:github.com
* ''target website'' + browser client site:github.com
* ''target website'' + browser client site:github.com


== Common Web Scraping Issues and Solutions ==
== Common Web Scraping Issues and Solutions ==
Line 92: Line 93:


'''Solution: Find Alternative Page Versions'''
'''Solution: Find Alternative Page Versions'''
Look for simpler versions of the same webpage content through:


1. Mobile versions of the site
Look for simpler versions of the same webpage content through 1. Mobile versions of the site 2. AMP (Accelerated Mobile Pages) versions
2. AMP (Accelerated Mobile Pages) versions


'''Example:'''
'''Example:'''
Line 102: Line 101:


The AMP version typically offers a more streamlined structure that's easier to parse, while containing the same core content.
The AMP version typically offers a more streamlined structure that's easier to parse, while containing the same core content.


== Further reading ==
== Further reading ==

Latest revision as of 06:23, 23 December 2025

Before start to writing the scirpt of web scraping (crawler)[edit]

  • Is the website offer datasets or files? e.g. open data
  • Is the website offer API (Application programming interface)?


List of technical issues[edit]

1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.

  • Multiple sources of same column such as different HTML DOM but have the same column value.
  • Backup the HTML text of parent DOM element
  • (optional) Complete HTML file backup

2. The IP was banned from server


3. CAPTCHA

4. AJAX

5. The web page needed to signed in

6. Blocking the request without Referer or other headers.

7. Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[2][3].

8. Language and URL-encodes string

9. How to extract content from websites

10. Data cleaning issues e.g. Non-breaking space or other Whitespace character

11. Is link a permanent link?

12. Enable/Disable the CSS or JavaScript

Difficulty in implementing Descriptioin Approach Comments
Easy Well-formatted HTML elements Url is the resource of dataset.
Advanced Interactive websites Url is the resource of dataset. Require to simulate post form submit with the form data or user agent Using HTTP request and response data tool or PHP: cURL
more difficult Interactive websites Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. Using Selenium or Headless Chrome
Difficult Interactive websites Ajax


Search keyword strategy[edit]

How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:

  • target website + crawler site:github.com
  • target website + scraper site:github.com
  • target website + bot site:github.com
  • target website + download / downloader site:github.com
  • target website + browser client site:github.com

Common Web Scraping Issues and Solutions[edit]

Complex Webpage Structure[edit]

One frequent challenge in web scraping is dealing with overly complex webpage structures that are difficult to parse. Here's how to address this:

Solution: Find Alternative Page Versions

Look for simpler versions of the same webpage content through 1. Mobile versions of the site 2. AMP (Accelerated Mobile Pages) versions

Example:

The AMP version typically offers a more streamlined structure that's easier to parse, while containing the same core content.

Further reading[edit]

References[edit]

  1. Exponential backoff - Wikipedia: "Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate."
  2. PHP: Runtime Configuration - Manual
  3. libcurl - Error Codes


Troubleshooting of ...

Template