Partition tolerance (P) The system continues to function even if parts of it can't communicate between each other. MongoDB, ElasticSearch, Redis and others, all fall into this category. If you get a network partition, the smaller side of the partition stops working as it won't be able to elect a master.
The system goes into a read-only (or unavailable) mode to prevent inconsistencies until the partition heals. The way most caches work (like Memcached or Redis) or distributed message queues (like Amazon SQS). Consistency is not guaranteed but it will usually provide an answer (maybe old or incomplete) even during partition events. AP systems are the most common solution for actual distributed systems as downtime is not acceptable and failures will happen (and cause partitions) all the time. Consistency is the most common victim of distributed systems. How many 9's of availability can your system provide? How complete is the answer that you are providing to your user? Redefine what "correct behavior" means for your system.
Define your systems in terms of how you can degrade harvest or yield to fit it's usage and your customer's needs. Check http://tinyurl.com/yjxzqlk for a full Farmville writeup. Strategy 1: Trading Harvest for Yield — Probabilistic Availability. There is always a > 0% chance that something is broken since the network is based on best effort delivery. Can you distribute your replicas across availability zones so that one zone failing does not reduce harvest drastically? Strategy 2: Application Decomposition and Orthogonal Mechanisms. This is a 1999 paper and they already knew this and you're still asking yourself if you should use them or not. Push complexity to where it actually needs to live. One piece of the app can live as an AP system with an expiration based cache? Running on AWS with Elastic Load Balancer, Auto-Scaling groups, health checks and auto-termination? Even systems that were not originally made to be reliable can be made scalable by placing a "reliable" interface on top of them that provides these characteristics, like Netflix's Hystrix. Since it's publication in 1999, many of the services touted here have been implemented in what we now call the cloud, using these concepts of orthogonality and building blocks to get applications to be distributed. Unfortunately, some of the calls for research here are still unanswered for, specially the programming model that provides first-class abstractions for manupulating degraded results. So far, most of the solutions have been ad-hoc and application specific. Harvest, Yield, and Scalable Tolerant Systems: A Summary. This article presents a summary of the paper “Harvest, Yield, and Scalable Tolerant Systems” published by Eric Brewer & Amando Fox in 1999. The paper deals with the trade-offs between consistency and availability (CAP) for large systems. It’s very easy to point to CAP and assert that no system can have consistency and availability. As Coda Hale explains in his excellent blog post “You Can’t Sacrifice Partition Tolerance”: Of the CAP theorem’s Consistency, Availability, and Partition Tolerance, Partition Tolerance is mandatory in distributed systems. Instead of CAP, you should think about your availability in terms of yield (percent of requests answered successfully) and harvest (percent of required data actually included in the responses) and which of these two your system will sacrifice when failures happen. The paper focuses on increasing the availability of large scale systems by fault toleration, containment and isolation: We assume that clients make queries to servers, in which case there are at least two metrics for correct behavior: yield, which is the probability of completing a request, and harvest, which measures the fraction of the data reflected in the response, i.e. The two metrics, harvest and yield can be summarized as follows: Harvest : data in response/total data For example: If one of the nodes is down in a 100 node cluster, the harvest is 99% for the duration of the fault. Yield : requests completed with success/total number of requests Note: Yield is different from uptime.
Yield deals with the number of requests, not only the time the system wasn’t able to respond to requests. The paper argues that there are certain systems which require perfect responses to queries every single time. Also, there are systems that can tolerate imperfect answers once in a while. To increase the overall availability of our systems, we need to carefully think through the required consistency and availability guarantees it needs to provide. Trading Harvest for Yield — Probabilistic Availability. Nearly all systems are probabilistic whether they realize it or not. In particular, any system that is 100% available under single faults is probabilistically available overall (since there is a non-zero probability of multiple failures) The paper talks about understanding the probabilistic nature of availability.
This helps in understanding and limiting the impact of faults by making decisions about what needs to be available and what kind of faults the system can deal with. They outline the linear degradation of harvest in case of multiple node faults. The harvest is directly proportional to the number of nodes that are functioning correctly.