Editing
Web scrape troubleshooting
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Before start to writing the scirpt of web scraping (crawler) == * Is the website offer datasets or files? e.g. open data * Is the website offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)? == List of technical issues == 1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty. * Multiple sources of same column such as different HTML DOM but have the same column value. * Backup the HTML text of parent DOM element * (optional) Complete HTML file backup 2. The IP was banned from server * Setting the temporization (sleep time) between each request e.g.: [http://php.net/manual/en/function.sleep.php PHP: sleep - Manual], [http://doc.scrapy.org/en/1.0/topics/autothrottle.html#topics-autothrottle AutoThrottle extension — Scrapy 1.0.3 documentation] or [[Sleep | Sleep random seconds in programming]]. * The server responded with a status of 403: '[https://zh.wikipedia.org/wiki/HTTP_403 403 forbidden]' --> Change the network IP 3. [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] 4. AJAX * [https://chrome.google.com/webstore/detail/autoscroll/kgkaecolmndcecnchojbndeanmiokofl/related Autoscroll] on {{Chrome}} or {{Edge}} written by [https://twitter.com/PeterLegierski Peter Legierski (@PeterLegierski) / Twitter] 5. The web page needed to signed in 6. Blocking the request without {{kbd | key= Referer}} or other headers. 7. Connection timeout during a http request. e.g. In PHP {{kbd | key=default_socket_timeout}} is 30 seconds<ref>[https://www.php.net/manual/en/filesystem.configuration.php PHP: Runtime Configuration - Manual]</ref><ref>[https://curl.haxx.se/libcurl/c/libcurl-errors.html libcurl - Error Codes]</ref>. 8. Language and [http://php.net/manual/en/function.urlencode.php URL-encodes string] 9. [[How to extract content from websites]] 10. [[Data cleaning#Data_handling | Data cleaning]] issues e.g. [https://en.wikipedia.org/wiki/Non-breaking_space Non-breaking space] or other [https://en.wikipedia.org/wiki/Whitespace_character Whitespace character] 11. Is link a permanent link? 12. Enable/Disable the CSS or JavaScript <div class="table-responsive" style="width:100%; min-height: .01%; overflow-x: auto;"> <table class="wikitable sortable" style="width:100%"> <tr> <th>Difficulty in implementing</th> <th>Descriptioin</th> <th>Approach</th> <th>Comments</th> </tr> <tr> <td>Easy</td> <td>Well-formatted HTML elements</td> <td>Url is the resource of dataset.</td> <td></td> </tr> <tr> <td>Advanced</td> <td>Interactive websites</td> <td>Url is the resource of dataset. Require to simulate post form submit with the form data or [[User agent|user agent]]</td> <td>Using [[HTTP request and response data tool]] or [http://php.net/curl PHP: cURL]</td> </tr> <tr> <td>more difficult</td> <td>Interactive websites</td> <td>Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.</td> <td>Using [https://www.seleniumhq.org/ Selenium] or [https://developers.google.com/web/updates/2017/04/headless-chrome Headless Chrome]</td> </tr> <tr> <td>Difficult</td> <td>Interactive websites</td> <td>Ajax</td> <td></td> </tr> </table> </div> === Search keyword strategy === '''How to find the unofficial (3rd-party) web crawler?''' Search keyword strategy suggested as follows: * ''target website'' + crawler site:github.com * ''target website'' + scraper site:github.com * ''target website'' + bot site:github.com * ''target website'' + download / downloader site:github.com * ''target website'' + browser client site:github.com == Common Web Scraping Issues and Solutions == === Complex Webpage Structure === One frequent challenge in web scraping is dealing with overly complex webpage structures that are difficult to parse. Here's how to address this: '''Solution: Find Alternative Page Versions''' Look for simpler versions of the same webpage content through 1. Mobile versions of the site 2. AMP (Accelerated Mobile Pages) versions '''Example:''' * Standard webpage: `https://www.ettoday.net/news/20250107/2888050.htm` * AMP version: `https://www.ettoday.net/amp/amp_news.php7?news_id=2888050&ref=mw&from=google.com` The AMP version typically offers a more streamlined structure that's easier to parse, while containing the same core content. == Further reading == * Stateless: [https://stackoverflow.com/questions/13200152/why-say-that-http-is-a-stateless-protocol Why say that HTTP is a stateless protocol? - Stack Overflow] * Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition] * [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia] * [[Skill tree of web scraping]] * [https://blog.gslin.org/archives/2022/08/22/10850/%e5%8d%97%e9%9f%93%e6%9c%80%e9%ab%98%e6%b3%95%e9%99%a2%e4%b9%9f%e5%b0%8d-web-scraping-%e7%b5%a6%e5%87%ba%e4%ba%86%e9%a1%9e%e4%bc%bc%e7%be%8e%e5%9c%8b%e7%9a%84%e5%88%a4%e4%be%8b/ 南韓最高法院也對 Web Scraping 給出了類似美國的判例 – Gea-Suan Lin's BLOG] * [https://www.yzu.edu.tw/library/index.php/tw/news-tw/943-20181226-1 元智大學 Yuan Ze University - 圖書館 - 疑似侵權,你不可不知的 小撇步] * [http://www.naipo.com/Portals/1/web_tw/Knowledge_Center/Infringement_Case/IPNC_170125_0501.htm 北美智權報 第177期:大數據與著作權之合理使用] == References == <references /> {{Template:Troubleshooting}} [[Category:Programming]] [[Category:Data Science]] [[Category:Data collecting]] [[Category:web scraping]]
Summary:
Please note that all contributions to LemonWiki共筆 are considered to be released under the Creative Commons Attribution-NonCommercial-ShareAlike (see
LemonWiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Templates used on this page:
Template:Chrome
(
edit
)
Template:Edge
(
edit
)
Template:Kbd
(
edit
)
Template:Troubleshooting
(
edit
)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Current events
Recent changes
Random page
Help
Categories
Tools
What links here
Related changes
Special pages
Page information