Web scrape troubleshooting: Difference between revisions

Revision as of 17:44, 24 August 2020

List of technical issues

Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
The IP was banned from server
- Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
CATCHA
AJAX
The web page needed to signed in
Blocking the request without Referer or other headers.
Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds^[1]^[2].
Language and URL-encodes string
Data cleaning issues e.g. Non-breaking space or other Whitespace character

Difficulty in implementing	Approach	Comments
easy	Url is the resource of dataset
more difficult	Url is the resource of dataset. Require to simulate post form submit with the form data or user agent	Using HTTP request and response data tool or PHP: cURL
more difficult	Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally.	Using Selenium or Headless Chrome
difficult	Ajax

Before start to web scrpae

Are they offer datasets?
Are they offer API (Application programming interface)?

Skill tree of web scraping

References

Troubleshooting of ...

Troubleshooting of Excel errors

PHP, cUrl, Python, selenium, HTTP status code errors

Database: SQL syntax debug, MySQL errors, MySQLTuner errors or PostgreSQL errors

Troubleshooting of regular expression

HTML/Javascript: Troubleshooting of javascript, XPath

Software: Mediawiki, Docker, FTP problems, online conference software

Test connectivity for the web service, Web Ping, Network problem, Web user behavior, Web scrape troubleshooting

Template

Bug report template

[1] PHP: Runtime Configuration - Manual

[2] url - Error Codes

[1]

[2]

@@ Line 53: / Line 53: @@
 == Skill tree of web scraping ==
-Data extraction
+[[Skill tree of web scraping]]
-* How they build the website & [[Information Architecture | information architecture]]
-** Understanding the navigation system ★★☆☆☆
-*** Understanding the classfication system ★★☆☆☆
-*** Parse the sitemap XML file ★★☆☆☆
-* Understnding the web technology
-** HTTP GET/POST ★★☆☆☆ [Related page: [HTTP request and response data tool]]
-** HTTP/CSS/Javascript ★★☆☆☆
-** CSS seletor and DOM (Document Object Model) elements ★★☆☆☆ Related page: [[Xpath tools]]
-** AJAX (Asynchronous JavaScript and XML) ★★★★☆
-* Using API to retrieve data ★★☆☆☆
-* Parse remote files to retrieve data ★★☆☆☆
-* Using unofficial method to retrieve data
-** [https://curl.haxx.se/ curl] command or [https://www.gnu.org/software/wget/manual/wget.html GNU Wget] command ★★☆☆☆ Related page: [[Troubleshooting of curl errors]]
-** [https://www.selenium.dev/ SeleniumHQ Browser Automation] ★★★☆☆
-* Forum submit
-** submit the from without loggin ★★☆☆☆
-** submit the from after logged the account ★★★☆☆
-* Detection of abnormal data
-** [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes HTTP status codes] ★★☆☆☆
-** Data is wrong even the server throw HTTP 200 status code ★★★☆☆
-* Etiquette of web scraping
-** Limit of web request ★★☆☆☆
-* Tom and Jerry
-** VPN and proxy ★★☆☆☆
-** Decode the [https://en.wikipedia.org/wiki/CAPTCHA CAPTCHA] ★★★★☆
-** Decentralized web scraping ★★★★☆
-Data transforming
-* Character encoding ★☆☆☆☆
-* Data cleaning e.g. unprintable characters ★★☆☆☆
-* [https://en.wikipedia.org/wiki/Regular_expression Regular expression]  ★★★☆☆
-* Selection of database engine ★★★★☆
 == Further reading ==

Web scrape troubleshooting: Difference between revisions

Revision as of 17:44, 24 August 2020

Contents

List of technical issues

Before start to web scrpae

Skill tree of web scraping

Further reading

References

Navigation menu

Web scrape troubleshooting: Difference between revisions

Revision as of 17:44, 24 August 2020

List of technical issues

Before start to web scrpae

Skill tree of web scraping

Further reading

References

Navigation menu

Search