Web scrape troubleshooting: Difference between revisions
Jump to navigation
Jump to search
| Line 8: | Line 8: | ||
1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty. | 1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty. | ||
* Multiple sources of same column such as different HTML DOM but have the same column value. | |||
* Backup the HTML text of parent DOM element | |||
* (optional) Complete HTML file backup | |||
2. The IP was banned from server | 2. The IP was banned from server | ||
* Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]]. | |||
* The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP | |||
3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] | 3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] | ||
4. AJAX | 4. AJAX | ||
* [https://chrome.google.com/webstore/detail/autoscroll/kgkaecolmndcecnchojbndeanmiokofl/related Autoscroll] on {{Chrome}} or {{Edge}} written by [https://twitter.com/PeterLegierski Peter Legierski (@PeterLegierski) / Twitter] | |||
5. The web page needed to signed in | 5. The web page needed to signed in | ||
Revision as of 11:33, 10 January 2025
Before start to writing the scirpt of web scraping (crawler)
- Is the website offer datasets or files? e.g. open data
- Is the website offer API (Application programming interface)?
List of technical issues
1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
2. The IP was banned from server
- Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
3. CAPTCHA
4. AJAX
- Autoscroll on Chrome
or Edge written by Peter Legierski (@PeterLegierski) / Twitter
5. The web page needed to signed in
6. Blocking the request without Referer or other headers.
7. Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[1][2].
8. Language and URL-encodes string
9. How to extract content from websites
10. Data cleaning issues e.g. Non-breaking space or other Whitespace character
11. Is link a permanent link?
12. Enable/Disable the CSS or JavaScript
| Difficulty in implementing | Descriptioin | Approach | Comments |
|---|---|---|---|
| Easy | Well-formatted HTML elements | Url is the resource of dataset. | |
| Advanced | Interactive websites | Url is the resource of dataset. Require to simulate post form submit with the form data or user agent | Using HTTP request and response data tool or PHP: cURL |
| more difficult | Interactive websites | Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. | Using Selenium or Headless Chrome |
| Difficult | Interactive websites | Ajax |
Search keyword strategy
How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:
- target website + crawler site:github.com
- target website + scraper site:github.com
- target website + bot site:github.com
- target website + download / downloader site:github.com
- target website + browser client site:github.com
Further reading
- Stateless: Why say that HTTP is a stateless protocol? - Stack Overflow
- Stateful: What is stateful? Webopedia Definition
- List of HTTP status codes - Wikipedia
- Skill tree of web scraping
- 南韓最高法院也對 Web Scraping 給出了類似美國的判例 – Gea-Suan Lin's BLOG
- 元智大學 Yuan Ze University - 圖書館 - 疑似侵權,你不可不知的 小撇步
- 北美智權報 第177期:大數據與著作權之合理使用
References
Troubleshooting of ...
- PHP, cUrl, Python, selenium, HTTP status code errors
- Database: SQL syntax debug, MySQL errors, MySQLTuner errors or PostgreSQL errors
- HTML/Javascript: Troubleshooting of javascript, XPath
- Software: Mediawiki, Docker, FTP problems, online conference software
- Test connectivity for the web service, Web Ping, Network problem, Web user behavior, Web scrape troubleshooting
Template