Web scrape troubleshooting
Before start to writing the scirpt of web scraping (crawler)
- Is the website offer datasets or files? e.g. open data
- Is the website offer API (Application programming interface)?
List of technical issues
1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
2. The IP was banned from server
- Random Delays: Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
- Smart Retry: Automatic retry or Exponential Backoff[1] on network errors (up to 3 times)
3. CAPTCHA
4. AJAX
- Autoscroll on Chrome
or Edge written by Peter Legierski (@PeterLegierski) / Twitter
5. The web page needed to signed in
6. Blocking the request without Referer or other headers.
7. Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[2][3].
8. Language and URL-encodes string
9. How to extract content from websites
10. Data cleaning issues e.g. Non-breaking space or other Whitespace character
11. Is link a permanent link?
12. Enable/Disable the CSS or JavaScript
| Difficulty in implementing | Descriptioin | Approach | Comments |
|---|---|---|---|
| Easy | Well-formatted HTML elements | Url is the resource of dataset. | |
| Advanced | Interactive websites | Url is the resource of dataset. Require to simulate post form submit with the form data or user agent | Using HTTP request and response data tool or PHP: cURL |
| more difficult | Interactive websites | Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. | Using Selenium or Headless Chrome |
| Difficult | Interactive websites | Ajax |
Search keyword strategy
How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:
- target website + crawler site:github.com
- target website + scraper site:github.com
- target website + bot site:github.com
- target website + download / downloader site:github.com
- target website + browser client site:github.com
Common Web Scraping Issues and Solutions
Complex Webpage Structure
One frequent challenge in web scraping is dealing with overly complex webpage structures that are difficult to parse. Here's how to address this:
Solution: Find Alternative Page Versions
Look for simpler versions of the same webpage content through 1. Mobile versions of the site 2. AMP (Accelerated Mobile Pages) versions
Example:
- Standard webpage: `https://www.ettoday.net/news/20250107/2888050.htm`
- AMP version: `https://www.ettoday.net/amp/amp_news.php7?news_id=2888050&ref=mw&from=google.com`
The AMP version typically offers a more streamlined structure that's easier to parse, while containing the same core content.
Further reading
- Stateless: Why say that HTTP is a stateless protocol? - Stack Overflow
- Stateful: What is stateful? Webopedia Definition
- List of HTTP status codes - Wikipedia
- Skill tree of web scraping
- 南韓最高法院也對 Web Scraping 給出了類似美國的判例 – Gea-Suan Lin's BLOG
- 元智大學 Yuan Ze University - 圖書館 - 疑似侵權,你不可不知的 小撇步
- 北美智權報 第177期:大數據與著作權之合理使用
References
- ↑ Exponential backoff - Wikipedia: "Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate."
- ↑ PHP: Runtime Configuration - Manual
- ↑ libcurl - Error Codes
Troubleshooting of ...
- PHP, cUrl, Python, selenium, HTTP status code errors
- Database: SQL syntax debug, MySQL errors, MySQLTuner errors or PostgreSQL errors
- HTML/Javascript: Troubleshooting of javascript, XPath
- Software: Mediawiki, Docker, FTP problems, online conference software
- Test connectivity for the web service, Web Ping, Network problem, Web user behavior, Web scrape troubleshooting
Template