Web scrape troubleshooting: Difference between revisions

← Older edit

Web scrape troubleshooting (edit)

Revision as of 06:23, 23 December 2025

2,759 bytes added , 23 December 2025

m

→‎List of technical issues

Planetoid

Bureaucrats, Administrators

14,953

edits

@@ Line 1: / Line 1: @@
+== Before start to writing the scirpt of web scraping (crawler) ==
+* Is the website offer datasets or files? e.g. open data
+* Is the website offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
 == List of technical issues ==
-# Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
+. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
-#* Multiple sources of same column such as different HTML DOM but have the same column value.
-#* Backup the HTML text of parent DOM element
+* Multiple sources of same column such as different HTML DOM but have the same column value.
-#* (optional) Complete HTML file backup
+* Backup the HTML text of parent DOM element
-# The IP was banned from server
+* (optional) Complete HTML file backup
-#* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
-#* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
+. The IP was banned from server
-# CATCHA
-# AJAX
+* Random Delays: Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]].
-# The web page needed to signed in
+* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP
-# Blocking the request without {{kbd | key= Referer}} or other headers.
+* Smart Retry: '''Automatic retry''' or '''Exponential Backoff'''<ref>[https://en.wikipedia.org/wiki/Exponential_backoff Exponential backoff - Wikipedia]: "Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate."</ref> on network errors
-# Connection timeout during a http request. e.g. In PHP {{kbd | key=default_socket_timeout}} is 30 seconds<ref>[https://www.php.net/manual/en/filesystem.configuration.php PHP: Runtime Configuration - Manual]</ref><ref>[https://curl.haxx.se/libcurl/c/libcurl-errors.html libcurl - Error Codes]</ref>.
-# Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string]
-# [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
+. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA]
+. AJAX
+* [https://chrome.google.com/webstore/detail/autoscroll/kgkaecolmndcecnchojbndeanmiokofl/related Autoscroll] on {{Chrome}} or {{Edge}} written by [https://twitter.com/PeterLegierski Peter Legierski (@PeterLegierski) / Twitter]
+. The web page needed to signed in
+. Blocking the request without {{kbd | key= Referer}} or other headers.
+. Connection timeout during a http request. e.g. In PHP {{kbd | key=default_socket_timeout}} is 30 seconds<ref>[https://www.php.net/manual/en/filesystem.configuration.php PHP: Runtime Configuration - Manual]</ref><ref>[https://curl.haxx.se/libcurl/c/libcurl-errors.html libcurl - Error Codes]</ref>.
+. Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string]
+. [[How to extract content from websites]]
+. [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character]
+. Is link a permanent link?
+. Enable/Disable the CSS or JavaScript
 <div class="table-responsive" style="width:100%;     min-height: .01%;    overflow-x: auto;">
@@ Line 20: / Line 46: @@
    <tr>
      <th>Difficulty in implementing</th>
+    <th>Descriptioin</th>
      <th>Approach</th>
      <th>Comments</th>
    </tr>
    <tr>
-     <td>easy</td>
+     <td>Easy</td>
-     <td>Url is the resource of dataset</td>
+    <td>Well-formatted HTML elements</td>
+     <td>Url is the resource of dataset.</td>
      <td></td>
    </tr>
    <tr>
-     <td>more difficult</td>
+     <td>Advanced</td>
+    <td>Interactive websites</td>
      <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td>
      <td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td>
@@ Line 35: / Line 64: @@
    <tr>
      <td>more difficult</td>
+    <td>Interactive websites</td>
      <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td>
      <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td>
    </tr>
    <tr>
-     <td>difficult</td>
+     <td>Difficult</td>
+    <td>Interactive websites</td>
      <td>Ajax</td>
      <td></td>
@@ Line 46: / Line 77: @@
 </div>
-== Before start to web scrpae ==
-* Are they offer datasets?
-* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
+=== Search keyword strategy ===
+'''How to find the unofficial (3rd-party) web crawler?''' Search keyword strategy suggested as follows:
+* ''target website'' + crawler site:github.com
+* ''target website'' + scraper site:github.com
+* ''target website'' + bot site:github.com
+* ''target website'' + download / downloader site:github.com
+* ''target website'' + browser client site:github.com
+== Common Web Scraping Issues and Solutions ==
+=== Complex Webpage Structure ===
+One frequent challenge in web scraping is dealing with overly complex webpage structures that are difficult to parse. Here's how to address this:
+'''Solution: Find Alternative Page Versions'''
+Look for simpler versions of the same webpage content through 1. Mobile versions of the site 2. AMP (Accelerated Mobile Pages) versions
+'''Example:'''
+* Standard webpage: `https://www.ettoday.net/news/20250107/2888050.htm`
+* AMP version: `https://www.ettoday.net/amp/amp_news.php7?news_id=2888050&ref=mw&from=google.com`
+The AMP version typically offers a more streamlined structure that's easier to parse, while containing the same core content.
 == Further reading ==
@@ Line 57: / Line 107: @@
 * [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia]
 * [[Skill tree of web scraping]]
+* [https://blog.gslin.org/archives/2022/08/22/10850/%e5%8d%97%e9%9f%93%e6%9c%80%e9%ab%98%e6%b3%95%e9%99%a2%e4%b9%9f%e5%b0%8d-web-scraping-%e7%b5%a6%e5%87%ba%e4%ba%86%e9%a1%9e%e4%bc%bc%e7%be%8e%e5%9c%8b%e7%9a%84%e5%88%a4%e4%be%8b/ 南韓最高法院也對 Web Scraping 給出了類似美國的判例 – Gea-Suan Lin's BLOG]
+* [https://www.yzu.edu.tw/library/index.php/tw/news-tw/943-20181226-1 元智大學 Yuan Ze University - 圖書館 - 疑似侵權，你不可不知的 小撇步]
+* [http://www.naipo.com/Portals/1/web_tw/Knowledge_Center/Infringement_Case/IPNC_170125_0501.htm 北美智權報 第177期：大數據與著作權之合理使用]
 == References ==

Web scrape troubleshooting: Difference between revisions

Web scrape troubleshooting (edit)

Revision as of 06:23, 23 December 2025

Navigation menu

Search