Web scrape troubleshooting: Difference between revisions

Latest revision as of 06:23, 23 December 2025

Before start to writing the scirpt of web scraping (crawler)[edit]

Is the website offer datasets or files? e.g. open data
Is the website offer API (Application programming interface)?

List of technical issues[edit]

1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.

Multiple sources of same column such as different HTML DOM but have the same column value.
Backup the HTML text of parent DOM element
(optional) Complete HTML file backup

2. The IP was banned from server

Random Delays: Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
The server responded with a status of 403: '403 forbidden' --> Change the network IP
Smart Retry: Automatic retry or Exponential Backoff^[1] on network errors

3. CAPTCHA

4. AJAX

Autoscroll on Chrome or Edge written by Peter Legierski (@PeterLegierski) / Twitter

5. The web page needed to signed in

6. Blocking the request without Referer or other headers.

7. Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds^[2]^[3].

8. Language and URL-encodes string

9. How to extract content from websites

10. Data cleaning issues e.g. Non-breaking space or other Whitespace character

11. Is link a permanent link?

12. Enable/Disable the CSS or JavaScript

Difficulty in implementing	Descriptioin	Approach	Comments
Easy	Well-formatted HTML elements	Url is the resource of dataset.
Advanced	Interactive websites	Url is the resource of dataset. Require to simulate post form submit with the form data or user agent	Using HTTP request and response data tool or PHP: cURL
more difficult	Interactive websites	Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.	Using Selenium or Headless Chrome
Difficult	Interactive websites	Ajax

Search keyword strategy[edit]

How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:

target website + crawler site:github.com
target website + scraper site:github.com
target website + bot site:github.com
target website + download / downloader site:github.com
target website + browser client site:github.com

Common Web Scraping Issues and Solutions[edit]

Complex Webpage Structure[edit]

One frequent challenge in web scraping is dealing with overly complex webpage structures that are difficult to parse. Here's how to address this:

Solution: Find Alternative Page Versions

Look for simpler versions of the same webpage content through 1. Mobile versions of the site 2. AMP (Accelerated Mobile Pages) versions

Example:

Standard webpage: `https://www.ettoday.net/news/20250107/2888050.htm`
AMP version: `https://www.ettoday.net/amp/amp_news.php7?news_id=2888050&ref=mw&from=google.com`

The AMP version typically offers a more streamlined structure that's easier to parse, while containing the same core content.

References[edit]

↑ Exponential backoff - Wikipedia: "Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate."
↑ PHP: Runtime Configuration - Manual
↑ libcurl - Error Codes

Troubleshooting of ...

Troubleshooting of Excel errors

PHP, cUrl, Python, selenium, HTTP status code errors

Database: SQL syntax debug, MySQL errors, MySQLTuner errors or PostgreSQL errors

Troubleshooting of regular expression

HTML/Javascript: Troubleshooting of javascript, XPath

Software: Mediawiki, Docker, FTP problems, online conference software

Test connectivity for the web service, Web Ping, Network problem, Web user behavior, Web scrape troubleshooting

Template

Bug report template

[1] Exponential backoff - Wikipedia: "Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate."

[2] PHP: Runtime Configuration - Manual

[3] url - Error Codes

[1]

[2]

[3]

@@ Line 1: / Line 1: @@
+== Before start to writing the scirpt of web scraping (crawler) ==
+* Is the website offer datasets or files? e.g. open data
+* Is the website offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
 == List of technical issues ==
-# Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
+. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
-#* Multiple sources of same column such as different HTML DOM but have the same column value.
-#* Backup the HTML text of parent DOM element
+* Multiple sources of same column such as different HTML DOM but have the same column value.
-#* (optional) Complete HTML file backup
+* Backup the HTML text of parent DOM element
-# The IP was banned from server
+* (optional) Complete HTML file backup
-#* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
-#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
+. The IP was banned from server
-# CATCHA
-# AJAX
+* Random Delays: Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
-# The web page needed to signed in
+* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
-# Blocking the request without {{kbd | key= Referer}} or other headers.
+* Smart Retry: '''Automatic retry''' or '''Exponential Backoff'''<ref>[https://en.wikipedia.org/wiki/Exponential_backoff Exponential backoff - Wikipedia]: "Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate."</ref> on network errors
-# Connection timeout during a http request. e.g. In PHP {{kbd | key=default_socket_timeout}} is 30 seconds<ref>[https://www.php.net/manual/en/filesystem.configuration.php PHP: Runtime Configuration - Manual]</ref><ref>[https://curl.haxx.se/libcurl/c/libcurl-errors.html libcurl - Error Codes]</ref>.
-# Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string]
-# [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
+. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]
+. AJAX
+* [https://chrome.google.com/webstore/detail/autoscroll/kgkaecolmndcecnchojbndeanmiokofl/related Autoscroll] on {{Chrome}} or {{Edge}} written by [https://twitter.com/PeterLegierski Peter Legierski (@PeterLegierski) / Twitter]
+. The web page needed to signed in
+. Blocking the request without {{kbd | key= Referer}} or other headers.
+. Connection timeout during a http request. e.g. In PHP {{kbd | key=default_socket_timeout}} is 30 seconds<ref>[https://www.php.net/manual/en/filesystem.configuration.php PHP: Runtime Configuration - Manual]</ref><ref>[https://curl.haxx.se/libcurl/c/libcurl-errors.html libcurl - Error Codes]</ref>.
+. Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string]
+. [[How to extract content from websites]]
+. [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
+. Is link a permanent link?
+. Enable/Disable the CSS or JavaScript
 <div class="table-responsive" style="width:100%;     min-height: .01%;    overflow-x: auto;">
@@ Line 20: / Line 46: @@
    <tr>
      <th>Difficulty in implementing</th>
+    <th>Descriptioin</th>
      <th>Approach</th>
      <th>Comments</th>
    </tr>
    <tr>
-     <td>easy</td>
+     <td>Easy</td>
-     <td>Url is the resource of dataset</td>
+    <td>Well-formatted HTML elements</td>
+     <td>Url is the resource of dataset.</td>
      <td></td>
    </tr>
    <tr>
-     <td>more difficult</td>
+     <td>Advanced</td>
+    <td>Interactive websites</td>
      <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td>
      <td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td>
@@ Line 35: / Line 64: @@
    <tr>
      <td>more difficult</td>
+    <td>Interactive websites</td>
      <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td>
      <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td>
    </tr>
    <tr>
-     <td>difficult</td>
+     <td>Difficult</td>
+    <td>Interactive websites</td>
      <td>Ajax</td>
      <td></td>
@@ Line 46: / Line 77: @@
 </div>
-== Before start to web scrpae ==
-* Are they offer datasets?
-* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
-== Skill tree of web scraping ==
+=== Search keyword strategy ===
+'''How to find the unofficial (3rd-party) web crawler?''' Search keyword strategy suggested as follows:
+* ''target website'' + crawler site:github.com
+* ''target website'' + scraper site:github.com
+* ''target website'' + bot site:github.com
+* ''target website'' + download / downloader site:github.com
+* ''target website'' + browser client site:github.com
+== Common Web Scraping Issues and Solutions ==
+=== Complex Webpage Structure ===
+One frequent challenge in web scraping is dealing with overly complex webpage structures that are difficult to parse. Here's how to address this:
+'''Solution: Find Alternative Page Versions'''
+Look for simpler versions of the same webpage content through 1. Mobile versions of the site 2. AMP (Accelerated Mobile Pages) versions
-Data extraction
+'''Example:'''
-* How they build the website
+* Standard webpage: `https://www.ettoday.net/news/20250107/2888050.htm`
-** Understanding the navigation system ★★☆☆☆
+* AMP version: `https://www.ettoday.net/amp/amp_news.php7?news_id=2888050&ref=mw&from=google.com`
-** Parse the sitemap XML file ★★☆☆☆
-* Understnding the web technology
-** HTTP GET/POST ★★☆☆☆
-** CSS seletor and DOM (Document Object Model) elements ★★☆☆☆
-** AJAX (Asynchronous JavaScript and XML) ★★★☆☆
-* Using API to retrieve data ★★☆☆☆
-* Parse remote files to retrieve data ★★☆☆☆
-* Using unofficial method to retrieve data
-** [https://curl.haxx.se/ curl] command or [https://www.gnu.org/software/wget/manual/wget.html GNU Wget] command ★★☆☆☆
-** [https://www.selenium.dev/ SeleniumHQ Browser Automation] ★★★☆☆
-* Forum submit
-** submit the from without loggin ★★☆☆☆
-** submit the from after logged the account ★★★☆☆
-* Etiquette of web scraping
-** Limit ot web request ★★☆☆☆
-* Tom and Jerry
-** VPN and proxy ★★☆☆☆
-** Decode the [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] ★★★★☆
-** Decentralized web scraping ★★★★☆
-Data transforming
+The AMP version typically offers a more streamlined structure that's easier to parse, while containing the same core content.
-* Data cleaning e.g. unprintable characters ★★☆☆☆
-* Selection of database engine ★★★☆☆
 == Further reading ==
@@ Line 84: / Line 106: @@
 * Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]
 * [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia]
+* [[Skill tree of web scraping]]
+* [https://blog.gslin.org/archives/2022/08/22/10850/%e5%8d%97%e9%9f%93%e6%9c%80%e9%ab%98%e6%b3%95%e9%99%a2%e4%b9%9f%e5%b0%8d-web-scraping-%e7%b5%a6%e5%87%ba%e4%ba%86%e9%a1%9e%e4%bc%bc%e7%be%8e%e5%9c%8b%e7%9a%84%e5%88%a4%e4%be%8b/ 南韓最高法院也對 Web Scraping 給出了類似美國的判例 – Gea-Suan Lin's BLOG]
+* [https://www.yzu.edu.tw/library/index.php/tw/news-tw/943-20181226-1 元智大學 Yuan Ze University - 圖書館 - 疑似侵權，你不可不知的 小撇步]
+* [http://www.naipo.com/Portals/1/web_tw/Knowledge_Center/Infringement_Case/IPNC_170125_0501.htm 北美智權報 第177期：大數據與著作權之合理使用]
 == References ==
@@ Line 92: / Line 118: @@
 [[Category:Programming]]
-[[Category:Data science]]
+[[Category:Data Science]]
 [[Category:Data collecting]]
+[[Category:web scraping]]