Web scrape troubleshooting: Difference between revisions
Jump to navigation
Jump to search
Line 54: | Line 54: | ||
Data extraction | Data extraction | ||
* How they build the website | * How they build the website & [[Information Architecture | information architecture]] | ||
** Understanding the navigation system ★★☆☆☆ | ** Understanding the navigation system ★★☆☆☆ | ||
** Parse the sitemap XML file ★★☆☆☆ | *** Understanding the classfication ★★☆☆☆ | ||
*** Parse the sitemap XML file ★★☆☆☆ | |||
* Understnding the web technology | * Understnding the web technology | ||
** HTTP GET/POST ★★☆☆☆ | ** HTTP GET/POST ★★☆☆☆ |
Revision as of 12:15, 23 August 2020
List of technical issues
- Content of web page was changed (revision): The expected web content (of specified DOM element) became empty.
- Multiple sources of same column such as different HTML DOM but have the same column value.
- Backup the HTML text of parent DOM element
- (optional) Complete HTML file backup
- The IP was banned from server
- Setting the temporization (sleep time) between each request e.g.: PHP: sleep - Manual, AutoThrottle extension — Scrapy 1.0.3 documentation or Sleep random seconds in programming.
- The server responded with a status of 403: '403 forbidden' --> Change the network IP
- CATCHA
- AJAX
- The web page needed to signed in
- Blocking the request without Referer or other headers.
- Connection timeout during a http request. e.g. In PHP default_socket_timeout is 30 seconds[1][2].
- Language and URL-encodes string
- Data cleaning issues e.g. Non-breaking space or other Whitespace character
Difficulty in implementing | Approach | Comments |
---|---|---|
easy | Url is the resource of dataset | |
more difficult | Url is the resource of dataset. Require to simulate post form submit with the form data or user agent | Using HTTP request and response data tool or PHP: cURL |
more difficult | Require to simulate the user behavior on browser such as click the button, submit the form and obtain the file finally. | Using Selenium or Headless Chrome |
difficult | Ajax |
Before start to web scrpae
- Are they offer datasets?
- Are they offer API (Application programming interface)?
Skill tree of web scraping
Data extraction
- How they build the website & information architecture
- Understanding the navigation system ★★☆☆☆
- Understanding the classfication ★★☆☆☆
- Parse the sitemap XML file ★★☆☆☆
- Understanding the navigation system ★★☆☆☆
- Understnding the web technology
- HTTP GET/POST ★★☆☆☆
- CSS seletor and DOM (Document Object Model) elements ★★☆☆☆
- AJAX (Asynchronous JavaScript and XML) ★★★☆☆
- Using API to retrieve data ★★☆☆☆
- Parse remote files to retrieve data ★★☆☆☆
- Using unofficial method to retrieve data
- curl command or GNU Wget command ★★☆☆☆
- SeleniumHQ Browser Automation ★★★☆☆
- Forum submit
- submit the from without loggin ★★☆☆☆
- submit the from after logged the account ★★★☆☆
- Etiquette of web scraping
- Limit ot web request ★★☆☆☆
- Tom and Jerry
- VPN and proxy ★★☆☆☆
- Decode the CAPTCHA ★★★★☆
- Decentralized web scraping ★★★★☆
Data transforming
- Data cleaning e.g. unprintable characters ★★☆☆☆
- Selection of database engine ★★★☆☆
Further reading
- Stateless: Why say that HTTP is a stateless protocol? - Stack Overflow
- Stateful: What is stateful? Webopedia Definition
- List of HTTP status codes - Wikipedia
References
Troubleshooting of ...
- PHP, cUrl, Python, selenium, HTTP status code errors
- Database: SQL syntax debug, MySQL errors, MySQLTuner errors or PostgreSQL errors
- HTML/Javascript: Troubleshooting of javascript, XPath
- Software: Mediawiki, Docker, FTP problems, online conference software
- Test connectivity for the web service, Web Ping, Network problem, Web user behavior, Web scrape troubleshooting
Template