COMP5348_EnterpriseArchetecture_2014Semester1_L10_ScalableAvailable(3)
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
› Large role for server apps
- - - What to do with duplicate requests? Try for idempotency (repeated txns OK) Or track and reject duplicates
OutlineWhatFra bibliotekfails?› Anything can fail! - Dealing with a failure may cause more faults! › Hardware › Software - OS, DBMS, individual process, language RTE, network stack, etc › Operations › Maintenance › Process › Environment
Analyzing failure
› Root cause analysis › Find the fault that led to failure › Then find why the fault was made - And why it wasn’t detected/fixed earlier › See /wiki/5_Whys My car will not start. (the problem) 1. Why? - The battery is dead. (first why) 2. Why? - The alternator is not functioning. (second why) 3. Why? - The alternator belt has broken. (third why) 4. Why? - The alternator belt was well beyond its useful service life and has never been replaced. (fourth why) 5. Why? - I have not been maintaining my car according to the recommended service schedule. (fifth why, root cause)
Terminology: Fault and Failure
› Any system or component has specified (desired) behavior, and observed behavior › Failure: the observed behavior is different from the specified behavior › Fault (defect): something gone/done wrongly › Once the fault occurs, there is a “latent failure”; some time later the failure becomes effective (observed) › Once the failure is effective, it may be reported and eventually repaired; from failure to the end of repair is a “service interruption”
› Recoverability (e.g. a database)
- the capability to reestablish performance levels and recover affected data after an application or system failure
- Time to detect failure - Time to correct failure - Time to restart application - Also time for scheduled maintenance (unless done while on-line)
- - Persistent queues of accepted requests Still a failure window though
› Large role for client apps/users
- - Did the request get lost on failure? Retry on error?
- - Re-route requests on failure Continuous service (almost) - - - Recover failed system while alternative handles workload May be some hand-over time (db recovery?) Active standby & log shipping reduce this - At the expense of 2x system cost…
Outline
› Availability › Types of Faults and Failures › Estimating reliability and availability › Techniques for dealing with failure › Scalability › Scale-up versus Scale-out
Database installed on cluster for high availability
Availability
› Often a question of application design and state management
- Stateful vs stateless - - - - What happens if a server fails? Can requests go to any server? Reduce dependency between components Failure tolerant designs
COMP5348
Enterprise Scale Software Architecture Semester 1, 2014 Lecture 10. Availability and Scalability
Based on material by Alan Fekete, Paul Greenfield, Uwe Roehm and from textbook by Gorton
How does it fail?
› Failstop - System ceases to do any work › Hard fault - Leads to: System continues but keeps doing wrong things › Intermittent (“soft”) fault - Leads to: Temporary wrong thing then correct behavior resumes › Timing failure - Correct values, but system responds too late
› Availability › Types of Faults and Failures › Estimating reliability and availability › Techniques for dealing with failure › Scalability › Scale-up versus Scale-out
› Redundancy is the key to availability
Available System
Web Clients
Web Server farm Load balanced using WLB or IP sprayer
App Servers farm using load balancing
Repeatability?
› If system is run the same way again, will the same failure behavior be observed? - If several systems are run on the same data, will they all fail the same way? › Heisenbugs: system does not fail the same way in different executions - May be due to race conditions - Very common in practice - Eg “just reboot/powercycle the system” › “Bohrbugs”: reproducible
› What happens to in-flight work?
- State recovers by aborting in-flight ops & doing db recovery but …
Transaction Recovery
› Could be handled by middleware
- Synchronous method calls or asynchronous messaging?
- And manageability decisions to consider
Redundancy provides Availability
› Passive or active standby systems
› Related to an application’s reliability
- Unreliable applications suffer poor availability
Availability
› Period of loss of availability determined by:
Copyright warning
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of the University of Sydney pursuant to Part VB of the Copyright Act 1968 ( the Ac t ). The material in this communication may be subject to copyright under the Act. Any further copying or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice.
Availability
› Strategies for high availability:
- - - - - Eliminate single points of failure Replication and failover Automatic detection and restart No single points of failure Spare everything
- - Disks, disk channels, processors, power supplies, fans, memory, .. Applications, databases, … - Hot standby, quick changeover on failure but costs more - Warm standby, slower changeover but easier
Availability
› Key requirement for most IT applications › Measured by the proportion of the required time it is useable. E.g.
- 100% available during business hours - No more than 2 hours scheduled downtime per week - 24x7x52 (close to 100% availability)