Web scrape troubleshooting: Difference between revisions
mNo edit summary |
Tags: Mobile edit Mobile web edit |
||
| (55 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
== Before start to writing the scirpt of web scraping (crawler) == | |||
Further reading | * Is the website offer datasets or files? e.g. open data | ||
* Is the website offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)? | |||
== List of technical issues == | |||
1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty. | |||
* Multiple sources of same column such as different HTML DOM but have the same column value. | |||
* Backup the HTML text of parent DOM element | |||
* (optional) Complete HTML file backup | |||
2. The IP was banned from server | |||
* Random Delays: Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]]. | |||
* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP | |||
* Smart Retry: '''Automatic retry''' or '''Exponential Backoff'''<ref>[https://en.wikipedia.org/wiki/Exponential_backoff Exponential backoff - Wikipedia]: "Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate."</ref> on network errors | |||
3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] | |||
4. AJAX | |||
* [https://chrome.google.com/webstore/detail/autoscroll/kgkaecolmndcecnchojbndeanmiokofl/related Autoscroll] on {{Chrome}} or {{Edge}} written by [https://twitter.com/PeterLegierski Peter Legierski (@PeterLegierski) / Twitter] | |||
5. The web page needed to signed in | |||
6. Blocking the request without {{kbd | key= Referer}} or other headers. | |||
7. Connection timeout during a http request. e.g. In PHP {{kbd | key=default_socket_timeout}} is 30 seconds<ref>[https://www.php.net/manual/en/filesystem.configuration.php PHP: Runtime Configuration - Manual]</ref><ref>[https://curl.haxx.se/libcurl/c/libcurl-errors.html libcurl - Error Codes]</ref>. | |||
8. Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string] | |||
9. [[How to extract content from websites]] | |||
10. [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character] | |||
11. Is link a permanent link? | |||
12. Enable/Disable the CSS or JavaScript | |||
<div class="table-responsive" style="width:100%; min-height: .01%; overflow-x: auto;"> | |||
<table class="wikitable sortable" style="width:100%"> | |||
<tr> | |||
<th>Difficulty in implementing</th> | |||
<th>Descriptioin</th> | |||
<th>Approach</th> | |||
<th>Comments</th> | |||
</tr> | |||
<tr> | |||
<td>Easy</td> | |||
<td>Well-formatted HTML elements</td> | |||
<td>Url is the resource of dataset.</td> | |||
<td></td> | |||
</tr> | |||
<tr> | |||
<td>Advanced</td> | |||
<td>Interactive websites</td> | |||
<td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td> | |||
<td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td> | |||
</tr> | |||
<tr> | |||
<td>more difficult</td> | |||
<td>Interactive websites</td> | |||
<td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td> | |||
<td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td> | |||
</tr> | |||
<tr> | |||
<td>Difficult</td> | |||
<td>Interactive websites</td> | |||
<td>Ajax</td> | |||
<td></td> | |||
</tr> | |||
</table> | |||
</div> | |||
=== Search keyword strategy === | |||
'''How to find the unofficial (3rd-party) web crawler?''' Search keyword strategy suggested as follows: | |||
* ''target website'' + crawler site:github.com | |||
* ''target website'' + scraper site:github.com | |||
* ''target website'' + bot site:github.com | |||
* ''target website'' + download / downloader site:github.com | |||
* ''target website'' + browser client site:github.com | |||
== Common Web Scraping Issues and Solutions == | |||
=== Complex Webpage Structure === | |||
One frequent challenge in web scraping is dealing with overly complex webpage structures that are difficult to parse. Here's how to address this: | |||
'''Solution: Find Alternative Page Versions''' | |||
Look for simpler versions of the same webpage content through 1. Mobile versions of the site 2. AMP (Accelerated Mobile Pages) versions | |||
'''Example:''' | |||
* Standard webpage: `https://www.ettoday.net/news/20250107/2888050.htm` | |||
* AMP version: `https://www.ettoday.net/amp/amp_news.php7?news_id=2888050&ref=mw&from=google.com` | |||
The AMP version typically offers a more streamlined structure that's easier to parse, while containing the same core content. | |||
== Further reading == | |||
* Stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow] | * Stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow] | ||
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition] | * Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition] | ||
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia] | |||
* [[Skill tree of web scraping]] | |||
* [https://blog.gslin.org/archives/2022/08/22/10850/%e5%8d%97%e9%9f%93%e6%9c%80%e9%ab%98%e6%b3%95%e9%99%a2%e4%b9%9f%e5%b0%8d-web-scraping-%e7%b5%a6%e5%87%ba%e4%ba%86%e9%a1%9e%e4%bc%bc%e7%be%8e%e5%9c%8b%e7%9a%84%e5%88%a4%e4%be%8b/ 南韓最高法院也對 Web Scraping 給出了類似美國的判例 – Gea-Suan Lin's BLOG] | |||
* [https://www.yzu.edu.tw/library/index.php/tw/news-tw/943-20181226-1 元智大學 Yuan Ze University - 圖書館 - 疑似侵權,你不可不知的 小撇步] | |||
* [http://www.naipo.com/Portals/1/web_tw/Knowledge_Center/Infringement_Case/IPNC_170125_0501.htm 北美智權報 第177期:大數據與著作權之合理使用] | |||
== References == | |||
<references /> | |||
{{Template:Troubleshooting}} | |||
[[Category:Programming]] | [[Category:Programming]] | ||
[[Category:Data | [[Category:Data Science]] | ||
[[Category:Data collecting]] | [[Category:Data collecting]] | ||
[[Category:web scraping]] | |||
Latest revision as of 06:23, 23 December 2025
Before start to writing the scirpt of web scraping (crawler)[edit]
- Is the website offer datasets or files? e.g. open data
- Is the website offer API (Application programming interface)?
List of technical issues[edit]
1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
2. The IP was banned from server
- Random Delays: Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
- Smart Retry: Automatic retry or Exponential Backoff[1] on network errors
3. CAPTCHA
4. AJAX
- Autoscroll on Chrome
or Edge written by Peter Legierski (@PeterLegierski) / Twitter
5. The web page needed to signed in
6. Blocking the request without Referer or other headers.
7. Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[2][3].
8. Language and URL-encodes string
9. How to extract content from websites
10. Data cleaning issues e.g. Non-breaking space or other Whitespace character
11. Is link a permanent link?
12. Enable/Disable the CSS or JavaScript
| Difficulty in implementing | Descriptioin | Approach | Comments |
|---|---|---|---|
| Easy | Well-formatted HTML elements | Url is the resource of dataset. | |
| Advanced | Interactive websites | Url is the resource of dataset. Require to simulate post form submit with the form data or user agent | Using HTTP request and response data tool or PHP: cURL |
| more difficult | Interactive websites | Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. | Using Selenium or Headless Chrome |
| Difficult | Interactive websites | Ajax |
Search keyword strategy[edit]
How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:
- target website + crawler site:github.com
- target website + scraper site:github.com
- target website + bot site:github.com
- target website + download / downloader site:github.com
- target website + browser client site:github.com
Common Web Scraping Issues and Solutions[edit]
Complex Webpage Structure[edit]
One frequent challenge in web scraping is dealing with overly complex webpage structures that are difficult to parse. Here's how to address this:
Solution: Find Alternative Page Versions
Look for simpler versions of the same webpage content through 1. Mobile versions of the site 2. AMP (Accelerated Mobile Pages) versions
Example:
- Standard webpage: `https://www.ettoday.net/news/20250107/2888050.htm`
- AMP version: `https://www.ettoday.net/amp/amp_news.php7?news_id=2888050&ref=mw&from=google.com`
The AMP version typically offers a more streamlined structure that's easier to parse, while containing the same core content.
Further reading[edit]
- Stateless: Why say that HTTP is a stateless protocol? - Stack Overflow
- Stateful: What is stateful? Webopedia Definition
- List of HTTP status codes - Wikipedia
- Skill tree of web scraping
- 南韓最高法院也對 Web Scraping 給出了類似美國的判例 – Gea-Suan Lin's BLOG
- 元智大學 Yuan Ze University - 圖書館 - 疑似侵權,你不可不知的 小撇步
- 北美智權報 第177期:大數據與著作權之合理使用
References[edit]
- ↑ Exponential backoff - Wikipedia: "Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate."
- ↑ PHP: Runtime Configuration - Manual
- ↑ libcurl - Error Codes
Troubleshooting of ...
- PHP, cUrl, Python, selenium, HTTP status code errors
- Database: SQL syntax debug, MySQL errors, MySQLTuner errors or PostgreSQL errors
- HTML/Javascript: Troubleshooting of javascript, XPath
- Software: Mediawiki, Docker, FTP problems, online conference software
- Test connectivity for the web service, Web Ping, Network problem, Web user behavior, Web scrape troubleshooting
Template