Web scrape troubleshooting: Difference between revisions

From LemonWiki共筆
Jump to navigation Jump to search
(12 intermediate revisions by the same user not shown)
Line 46: Line 46:
</div>
</div>


== Before start to web scrpae ==
== Before start to writing the scirpt of web scraping (crawler) ==


* Are they offer datasets?
* Are the website offer datasets or files? e.g. open data
* Are they offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?
* Are the website offer [https://en.wikipedia.org/wiki/Application_programming_interface API] (Application programming interface)?


== Skill tree of web scraping ==
'''How to find the unofficial (3rd-party) web crawler?''' Search keyword strategy suggested as follows:
 
* ''target website'' + crawler
Data extraction
* ''target website'' + bot
* How they build the website & [[Information Architecture | information architecture]]
* ''target website'' + download / downloader
** Understanding the navigation system ★★☆☆☆
*** Understanding the classfication ★★☆☆☆
*** Parse the sitemap XML file ★★☆☆☆
 
* Understnding the web technology
** HTTP GET/POST ★★☆☆☆
** CSS seletor and DOM (Document Object Model) elements ★★☆☆☆
** AJAX (Asynchronous JavaScript and XML) ★★★★☆
* Using API to retrieve data ★★☆☆☆
* Parse remote files to retrieve data ★★☆☆☆
* Using unofficial method to retrieve data
** [https://curl.haxx.se/ curl] command or [https://www.gnu.org/software/wget/manual/wget.html GNU Wget] command ★★☆☆☆
** [https://www.selenium.dev/ SeleniumHQ Browser Automation] ★★★☆☆
* Forum submit
** submit the from without loggin ★★☆☆☆
** submit the from after logged the account ★★★☆☆
* Detection of abnormal data
** [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes HTTP status codes] ★★☆☆☆
** Data is wrong even they show HTTP 200 ★★★☆☆
* Etiquette of web scraping
** Limit ot web request ★★☆☆☆
* Tom and Jerry
** VPN and proxy ★★☆☆☆
** Decode the [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] ★★★★☆
** Decentralized web scraping ★★★★☆
 
Data transforming
* Character encoding ★★☆☆☆
* Data cleaning e.g. unprintable characters ★★★☆☆
* [https://en.wikipedia.org/wiki/Regular_expression Regular expression]  ★★★☆☆
* Selection of database engine ★★★★☆


== Further reading ==
== Further reading ==
Line 91: Line 60:
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]
* Stateful: [http://www.webopedia.com/TERM/S/stateful.html What is stateful? Webopedia Definition]
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia]
* [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes List of HTTP status codes - Wikipedia]
* [[Skill tree of web scraping]]


== References ==
== References ==
Line 99: Line 69:


[[Category:Programming]]
[[Category:Programming]]
[[Category:Data science]]
[[Category:Data Science]]
[[Category:Data collecting]]
[[Category:Data collecting]]
[[Category:web scraping]]

Revision as of 11:59, 20 November 2020

List of technical issues

  1. Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
    • Multiple sources of same column such as different HTML DOM but have the same column value.
    • Backup the HTML text of parent DOM element
    • (optional) Complete HTML file backup
  2. The IP was banned from server
  3. CATCHA
  4. AJAX
  5. The web page needed to signed in
  6. Blocking the request without Referer or other headers.
  7. Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[1][2].
  8. Language and URL-encodes string
  9. Data cleaning issues e.g. Non-breaking space or other Whitespace character
Difficulty in implementing Approach Comments
easy Url is the resource of dataset
more difficult Url is the resource of dataset. Require to simulate post form submit with the form data or user agent Using HTTP request and response data tool or PHP: cURL
more difficult Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. Using Selenium or Headless Chrome
difficult Ajax

Before start to writing the scirpt of web scraping (crawler)

  • Are the website offer datasets or files? e.g. open data
  • Are the website offer API (Application programming interface)?

How to find the unofficial (3rd-party) web crawler? Search keyword strategy suggested as follows:

  • target website + crawler
  • target website + bot
  • target website + download / downloader

Further reading

References


Troubleshooting of ...

Template