[译]最终一致性

前几天在网上看到了这篇amazon CTO Werner Vogels的文章,讲分布系统最终一致性的,特意翻过来,学习

Eventually Consistent – Revisited

最终一致

By Werner Vogels

http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

I wrote a first version of this posting on consistency models about a year ago, but I was never happy with it as it was written in haste and the topic is important enough to receive a more thorough treatment. ACM Queue asked me to revise it for use in their magazine and I took the opportunity to improve the article. This is that new version.

一年前我首次发表这篇关于一致性模型的文章,但郁闷的是那篇文章写得太仓促,而且这个主题很大,本应该投入更多精力。ACM Queue(一本很牛B的杂志,ACM办的,http://queue.acm.org/ )希望我能够修改一下这篇文章,他们杂志可以用。我答应了他们,一举两得,这样还可以有机会去完善我的文章。以下是这篇文章的新版本。

Eventually Consistent - Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability.

最终一致——在世界范围搭建一个可靠的分布式系统,就要求我们在一致性和可用性上进行权衡,取舍。

At the foundation of Amazon's cloud computing are infrastructure services such as Amazon's S3 (Simple Storage Service), SimpleDB, and EC2 (Elastic Compute Cloud) that provide the resources for constructing Internet-scale computing platforms and a great variety of applications. The requirements placed on these infrastructure services are very strict; they need to score high marks in the areas of security, scalability, availability, performance, and cost effectiveness, and they need to meet these requirements while serving millions of customers around the globe, continuously.

创建亚马逊云计算的时候我们的定位是,一个像亚马逊S3(Simple Storage Service)、Simple DB和EC2(Elastic Compute Cloud)一样的基础设施,他可以提供资源,搭建全网范围的计算平台和大量应用。对这些基础设施的要求是很苛刻的,他们需要在安全性、伸缩性、可用性、性能和性价比上都有上佳的表现,而且要知道,所有的这些要求是建立在不间断的为全球数百万用户提供服务的基础上的。

Under the covers these services are massive distributed systems that operate on a worldwide scale. This scale creates additional challenges, because when a system processes trillions and trillions of requests, events that normally have a low probability of occurrence are now guaranteed to happen and need to be accounted for up front in the design and architecture of the system. Given the worldwide scope of these systems, we use replication techniques ubiquitously to guarantee consistent performance and high availability. Although replication brings us closer to our goals, it cannot achieve them in a perfectly transparent manner; under a number of conditions the customers of these services will be confronted with the consequences of using replication techniques inside the services.

在这些服务的背后是一个存在于世界范围的庞大的分布系统。这种庞大的规模带来很多额外的挑战,由于当一个系统要处理数以万亿计的请求时,正常情况下的小概率事件在这里就可能被认为是必然事件,而且要在设计和构建系统之前就考虑到。考虑到这些系统是世界级规模的,我们无处不在使用冗余技术来保证一致性能和高可用性。虽然冗余让我们更容易达成目标,但这不是一个完美的,清晰的方法;在大量不相同的条件下,这些服务的用户将会面临服务内部冗余技术带来的不良结果。

One of the ways in which this manifests itself is in the type of data consistency that is provided, particularly when the underlying distributed system provides an eventual consistency model for data replication. When designing these large-scale systems at Amazon, we use a set of guiding principles and abstractions related to large-scale data replication and focus on the trade-offs between high availability and data consistency. In this article I present some of the relevant background that has informed our approach to delivering reliable distributed systems that need to operate on a global scale. An earlier version of this text appeared as a posting on the All Things Distributed weblog in December 2007 and was greatly improved with the help of its readers.

解决这个问题的一个途径就是在系统提供的数据一致性类型中声明自己,尤其是当底层分布式系统为数据冗余提供最终一致性模型的时候。在亚马逊设计这些大规模系统,我们使用了一系列的定向原理和大规模数据冗余的抽象关系,并且关注高可用性和数据一致性之间的取舍。本文我提到了一些有关背景,这些是我们即将应用在全球规模的分布式系统上的。本文2007年12月,较早的那个版本在读者的帮助下有了很大的提高。

Historical Perspective

历史沿革

In an ideal world there would be only one consistency model: when an update is made all observers would see that update. The first time this surfaced as difficult to achieve was in the database systems of the late '70s. The best "period piece" on this topic is "Notes on Distributed Databases" by Bruce Lindsay et al. It lays out the fundamental principles for database replication and discusses a number of techniques that deal with achieving consistency. Many of these techniques try to achieve distribution transparency—that is, to the user of the system it appears as if there is only one system instead of a number of collaborating systems. Many systems during this time took the approach that it was better to fail the complete system than to break this transparency.

理想条件下应该只有唯一的一个一致性模型:当一条更新出现,所有的订阅成员都应该收到这个更新。这个问题首次出现在70年代末期的数据库系统中。当时在这个领域具有指导意义的一篇文章是Bruce Lindsay写的《分布式数据库特征》,这篇文章列出了数据库冗余的基本原理并讨论了大量解决一致性问题的技术。这些技术中的很大一部分都试图把分布性对系统用户透明化,也就是让用户感觉到只有一个系统,而非若干个共同协作的系统。在那个时期,许多系统为了不破坏其透明度,甚至放弃完成系统。

In the mid-'90s, with the rise of larger Internet systems, these practices were revisited. At that time people began to consider the idea that availability was perhaps the most important property of these systems, but they were struggling with what it should be traded off against. Eric Brewer, systems professor at the University of California, Berkeley, and at that time head of Inktomi, brought the different trade-offs together in a keynote address to the PODC (Principles of Distributed Computing) conference in 2000.1 He presented the CAP theorem, which states that of three properties of shared-data systems—data consistency, system availability, and tolerance to network partition—only two can be achieved at any given time. A more formal confirmation can be found in a 2002 paper by Seth Gilbert and Nancy Lynch.

在90年代中期,随着互联网系统的壮大,这些实践都需要被重新审视。那时人们开始考虑可用性或许才是这些系统作重要的特性,但人们争论的话题是什么应该拿来与可用性进行对比取舍。伯克利大学教授Eric Brewer,当时lnktomi带头人,在2000年1月的PODC(Priciples of Distributed Computing)会议主题说明上把各种需要权衡的特性合并。他展示了CAP理论,这个理论陈述了三个数据共享系统的性质——数据一致性,系统可用性和兼容网络分割,其中只有两个可以始终保障。在一篇2002年Seth Gilbert和Nancy Lynch发表的论文中可以找到更多的严谨的证明。

A system that is not tolerant to network partitions can achieve data consistency and availability, and often does so by using transaction protocols. To make this work, client and storage systems must be part of the same environment; they fail as a whole under certain scenarios, and as such, clients cannot observe partitions. An important observation is that in larger distributed-scale systems, network partitions are a given; therefore, consistency and availability cannot be achieved at the same time. This means that there are two choices on what to drop: relaxing consistency will allow the system to remain highly available under the partitionable conditions, whereas making consistency a priority means that under certain conditions the system will not be available.

一个不能兼容网络分割的系统,经常会通过使用事务协议保证数据一致性和系统可用性。为了做到这一点,客户端和存储系统必须在某一部分具有相同的环境;在这种情况下系统可能会陷入整体崩溃的困境,而且在这样的系统中客户端是互不可见的。一个很重要的情况是在更大规模的分布式系统中,网络分割是必然的;因此,数据一致性和系统可用性就无法同时得到保障。这就意味着必须放弃一些特性,有两个选择:放松一致性可以让系统在多差异情况下保持高可用性,或优先考虑一致性,也就是说在某些情况下系统会不可用。

 

Both options require the client developer to be aware of what the system is offering. If the system emphasizes consistency, the developer has to deal with the fact that the system may not be available to take, for example, a write. If this write fails because of system unavailability, then the developer will have to deal with what to do with the data to be written. If the system emphasizes availability, it may always accept the write, but under certain conditions a read will not reflect the result of a recently completed write. The developer then has to decide whether the client requires access to the absolute latest update all the time. There is a range of applications that can handle slightly stale data, and they are served well under this model.

两种选择都要求客户端开发者了解系统到底提供什么。如果系统突出一致性,那么开发者就必须处理系统可能出现的不可用的情况。例如,一次写入,如果由于系统不可用写入失败了,那么开发者就必须解做一些事来保证数据正确写入。如果系统突出可用性,它可以保证数据始终被正确写入,但在某些情况下,读取操作不会去考虑最近的一次写入是否完成。开发者就必须决定客户端是否请求了一个最新版本的数据。的确有这么一些应用程序可以巧妙的处理这些无趣的数据,并且他们可以在这个模型下运行的很好。

In principle the consistency property of transaction systems as defined in the ACID properties (atomicity, consistency, isolation, durability) is a different kind of consistency guarantee. In ACID, consistency relates to the guarantee that when a transaction is finished the database is in a consistent state; for example, when transferring money from one account to another the total amount held in both accounts should not change. In ACID-based systems, this kind of consistency is often the responsibility of the developer writing the transaction but can be assisted by the database managing integrity constraints. 

原理上ACID特性(atomicity, consistency, isolation, durability)中定义的事物系统一致性特性是截然不同的,在ACID中,一个事务完成,数据库被认为处于一致状态;例如,当从一个帐户向另一个帐户转账的时候,两个帐户的总额不应该有变化。在基于ACID的系统中,这种一致性经常会给开发事务的开发者带来麻烦,但可以通过数据库控制的完整性约束来辅助解决问题。

Consistency—Client and Server

一致性——客户端与服务端

There are two ways of looking at consistency. One is from the developer/client point of view: how they observe data updates. The second way is from the server side: how updates flow through the system and what guarantees systems can give with respect to updates.

有两种方式来看待一致性。一是从开发者/客户端的角度看:如何监测数据更新。第二种方法是从服务端的角度:更新流如何通过系统和系统如何保证处理更新。

Client-side Consistency

客户端一致性

The client side has these components:

客户端有这些组件:

 • A storage system. For the moment we'll treat it as a black box, but one should assume that under the covers it is something of large scale and highly distributed, and that it is built to guarantee durability and availability.

 • 一个存储系统。暂时我们认为他是一个黑箱,但我们要假设表面之下是一个大规模而且高度分布的东西,而且它建造的持久可用。

 • Process A. This is a process that writes to and reads from the storage system.

 • 进程A。这是一个从存储系统进行读写操作的进程。

 •  Processes B and C. These two processes are independent of process A and write to and read from the storage system. It is irrelevant whether these are really processes or threads within the same process; what is important is that they are independent and need to communicate to share information.

 • 进程B和C。这是两个独立于A之外的,从存储系统进行读写操作进程。它是真正的进程还是同一个进程中的多个线程都无关紧要,重要的是他们相互独立而且需要相互联系,共享信息。

Client-side consistency has to do with how and when observers (in this case the processes A, B, or C) see updates made to a data object in the storage systems. In the following examples illustrating the different types of consistency, process A has made an update to a data object:

客户端一致性必须处理的:观测者(这个例子里就是进程A、B或C)是何时通过何种方式检测到一次更新对存储系统中的一个数据对象进行了操作。在接下来的例子中就说明了进程A已经对数据对象进行了一次更新之后,不同的一致性之间的区别:   

 • Strong consistency. After the update completes, any subsequent access (by A, B, or C) will return the updated value.

 • 强一致性。在一次更新完成后,任何随后而来的访问(无论通过A, B, 还是C)将返回更新后的值。

 • Weak consistency. The system does not guarantee that subsequent accesses will return the updated value. A number of conditions need to be met before the value will be returned. The period between the update and the moment when it is guaranteed that any observer will always see the updated value is dubbed the inconsistency window.

 • 弱一致性。系统不保证后续访问可以返回更新后的值。我们必须在值返回之前就面对很多情况。在更新发生和更新被确认之间的这段时间里,任意一个观测者应该看到一个更新后的值,这段时间被称为不一致窗口。

 • Eventual consistency. This is a specific form of weak consistency; the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme. The most popular system that implements eventual consistency is DNS (Domain Name System). Updates to a name are distributed according to a configured pattern and in combination with time-controlled caches; eventually, all clients will see the update.

 • 最终一致性。这是弱一致性的一个特例;存储系统确认一个对象是否有新的更新,最终所有的访问将返回上一次更新后的值。如果没有错误发生,那么不一致窗口的最大值就由一些,例如通信延迟,系统负载,或是冗余模式中存在联系的冗余副本数量之类的,因素决定。最常见的实现最终一致性的系统是DNS(域名系统Domain Name System)。一个域名更新操作根据配置的形式被分发出去,并结合有过期机制的缓存;最终所有的客户端可以观察到这个更新。

The eventual consistency model has a number of variations that are important to consider:

最终一致性模型有很多变体是需要重点关注的:

 • Causal consistency. If process A has communicated to process B that it has updated a data item, a subsequent access by process B will return the updated value, and a write is guaranteed to supersede the earlier write. Access by process C that has no causal relationship to process A is subject to the normal eventual consistency rules.

 • 原因一致性。如果进程A已经与进程B建立了联系更新了一个数据项,进程B发出的一个后续访问将会返回更新后的值,并且这次写入操作确认已经替代了上一次写入。进程C的访问就与进程A没有因果关系了,而是一个普通的最终一致性的规则的问题。

 • Read-your-writes consistency. This is an important model where process A, after it has updated a data item, always accesses the updated value and will never see an older value. This is a special case of the causal consistency model.

 • 自读写一致性。这是一个非常重要的模型,进程A在完成一次数据项更新后总是能访问到更新后的数据,永远看不到旧数值。它是原因一致性模型的一个特例。

 • Session consistency. This is a practical version of the previous model, where a process accesses the storage system in the context of a session. As long as the session exists, the system guarantees read-your-writes consistency. If the session terminates because of a certain failure scenario, a new session needs to be created and the guarantees do not overlap the sessions.

 • 会话一致性。这是前一个模型,当一个进程在会话上下文中访问存储系统时,的实践。只要会话还存在,系统就应该保证自我读写一致性。如果会话由于某种原因关闭了,那么需要建立一个新会话,这个会话并不会记忆上一会话。

 • Monotonic read consistency. If a process has seen a particular value for the object, any subsequent accesses will never return any previous values.       

 • 单读一致性。如果一个进程已经检测到对象的特殊值,那么后续的访问都不会返回任何历史值。

 • Monotonic write consistency. In this case the system guarantees to serialize the writes by the same process. Systems that do not guarantee this level of consistency are notoriously hard to program.

 • 单写一致性。在这样一个例子中,系统保证同一进程的写入是连续的,要建造不能保证这个层次一致性的系统是有一定难度的。

A number of these properties can be combined. For example, one can get monotonic reads combined with session-level consistency. From a practical point of view these two properties (monotonic reads and read-your-writes) are most desirable in an eventual consistency system, but not always required. These two properties make it simpler for developers to build applications, while allowing the storage system to relax consistency and provide high availability.

以上这些属性中,大量是可以组合在一起的。比如,我们就可以把单读一致性和会话级一致性组合在一起。从实践的角度看,这两个属性(单读一致性和自读写一致性)是最适合存在于一个最终一致性系统中的,但却不不一定必须是这两者。这两个属性使存储系统降低了对一致性的要求并仍然提供很高的可用性,这大大降低了开发者开发应用程序的门槛。

As you can see from these variations, quite a few different scenarios are possible. It depends on the particular applications whether or not one can deal with the consequences.

就像在这些变化中看到的,可能会出现一部分不同的场景。这由应用程序能否正确处理系统返回的结果决定。

Eventual consistency is not some esoteric property of extreme distributed systems. Many modern RDBMSs (relational database management systems) that provide primary-backup reliability implement their replication techniques in both synchronous and asynchronous modes. In synchronous mode the replica update is part of the transaction. In asynchronous mode the updates arrive at the backup in a delayed manner, often through log shipping. In the latter mode if the primary fails before the logs are shipped, reading from the promoted backup will produce old, inconsistent values. Also to support better scalable read performance, RDBMSs have started to provide the ability to read from the backup, which is a classical case of providing eventual consistency guarantees in which the inconsistency windows depend on the periodicity of the log shipping.

最终一致性不是那些被过度分布的系统中存在的晦涩难懂的属性。许多现代的RDBMS(关系数据库管理系统)都提供自主备份来实现同步和异步模型中的复制。在同步模型中,复制一个更新是事物的一部分。在异步模型中,更新通过一个延迟任务(通常是日志的方式)写入到副本。后者如果主服务在日志传输之前失败,那么从备份中读取到的将是老数据,不一致的数据。RDBMS已经提供了从备份读取数据的功能,这使读性能提升空间更大,这是一种提供最终一致性保证的经典案例,在这个例子中不一致窗口由日志通信情况决定。

Server-side Consistency

服务器端一致性

On the server side we need to take a deeper look at how updates flow through the system to understand what drives the different modes that the developer who uses the system can experience. Let's establish a few definitions before getting started:

在服务器端,我们得对更新如何在整个系统中流动做更深入的研究,以了解是什么造成了,使用者体会到的,不同模型的区别。我们在开始之前先明确几个定义:

N = the number of nodes that store replicas of the data

N:存储数据冗余副本的节点数

W = the number of replicas that need to acknowledge the receipt of the update before the update completes

W:在更新结束前,需要发出更新到达信号的冗余副本数

R = the number of replicas that are contacted when a data object is accessed through a read operation

R:一个数据对象进行读操作需要建立的冗余副本的数量

If W+R > N, then the write set and the read set always overlap and one can guarantee strong consistency. In the primary-backup RDBMS scenario, which implements synchronous replication, N=2, W=2, and R=1. No matter from which replica the client reads, it will always get a consistent answer. In asynchronous replication with reading from the backup enabled, N=2, W=1, and R=1. In this case R+W=N, and consistency cannot be guaranteed.

如果W+R>N,那么读和写的场景总是有交集,而且可以保证强壮的一致性。在自主复制的RDBMS描述中,实现同步复制的系统,N=2,W=2,R=1。无论客户端读哪个冗余副本,都可以获得一个同步的返回值。在允许读取副本异步复制系统中,N=2,W=1,R=1。在这个例子中R+W=N,一致性得不到保证。

The problems with these configurations, which are basic quorum protocols, is that when the system cannot write to W nodes because of failures, the write operation has to fail, marking the unavailability of the system. With N=3 and W=3 and only two nodes available, the system will have to fail the write.

这些配置信息是最低数量约定,这就带来一个问题,当系统由于错误而不能写入W个节点时,写入操作必须返回错误并把系统标记为不可用。当N=3,W=3,并且只有两个节点可用的情况下,系统写入注定失败。

In distributed-storage systems that need to provide high performance and high availability, the number of replicas is in general higher than two. Systems that focus solely on fault tolerance often use N=3 (with W=2 and R=2 configurations). Systems that need to serve very high read loads often replicate their data beyond what is required for fault tolerance; N can be tens or even hundreds of nodes, with R configured to 1 such that a single read will return a result. Systems that are concerned with consistency are set to W=N for updates, which may decrease the probability of the write succeeding. A common configuration for these systems that are concerned about fault tolerance but not consistency is to run with W=1 to get minimal durability of the update and then rely on a lazy (epidemic) technique to update the other replicas.

在分布式存储系统中,对高性能和高可用性的需求往往不及对副本数的高。只专注容错性的系统往往会使用N=3,W=2和R=2这样的配置。需要处理高读负载的系统往往会抛开容错性不管去做数据冗余;N可以是数十或者数百个节点,而R则配成1,这样只要有一个结点完成返回就可以被读到了。而关注一致性的系统就会为了更新而把W和N配成相等,这样就可以降低写延迟。对这些这些系统来说,一个比较通用的配置是:主要关注容错性,为了一致性让W=1得到最小程度的稳定,然后依赖懒惰技术去更新其他的冗余副本。

How to configure N, W, and R depends on what the common case is and which performance path needs to be optimized. In R=1 and N=W we optimize for the read case, and in W=1 and R=N we optimize for a very fast write. Of course in the latter case, durability is not guaranteed in the presence of failures, and if W < (N+1)/2, there is the possibility of conflicting writes when the write sets do not overlap.

那么如何基于通用的案例去配置N,W和R,并选择哪条途径去做性能优化呢?在R=1并且N=W的情况下,我们需要优化读取,而当W=1并且R=N的情况下,我们则需要对写进行优化。当然后者在出现故障的情况下是不能保证持久性的,而且当W<(N+1)/2,写操作集合不相交情况下可能会出现冲突。

Weak/eventual consistency arises when W+R <= N, meaning that there is a possibility that the read and write set will not overlap. If this is a deliberate configuration and not based on a failure case, then it hardly makes sense to set R to anything but 1. This happens in two very common cases: the first is the massive replication for read scaling mentioned earlier; the second is where data access is more complicated. In a simple key-value model it is easy to compare versions to determine the latest value written to the system, but in systems that return sets of objects it is more difficult to determine what the correct latest set should be. In most of these systems where the write set is smaller than the replica set, a mechanism is in place that applies the updates in a lazy manner to the remaining nodes in the replica's set. The period until all replicas have been updated is the inconsistency window discussed before. If W+R <= N, then the system is vulnerable to reading from nodes that have not yet received the updates.

弱/最终一致性在W+R<=N的时候出现,就是说有可能读和写集合没有交集。一个成熟的,经过实践验证的配制,一定不会把R配成1。这样就避免两个情况发生:第一个是避免过早为了读操作而产生大规模冗余;另一个是数据链入更加复杂。一个简单的键-值模型中,就可以非常容易通过对比版本确定一个最新写入系统的值,但在返回一个对象集合的系统中就很难确定哪个集合才是最新的。大多数写入集合比冗余集合小很多的系统中,是通过使用延迟方法把更新传递到在冗余集合中的节点实现的。在所有副本被更新之前这段时间叫做不一致窗口,之前提到过这个概念。如果W+R<=N,那么系统在尚未接收更新时,读数据是很不健壮的。

Whether or not read-your-writes, session, and monotonic consistency can be achieved depends in general on the "stickiness" of clients to the server that executes the distributed protocol for them. If this is the same server every time, then it is relatively easy to guarantee read-your-writes and monotonic reads. This makes it slightly harder to manage load balancing and fault tolerance, but it is a simple solution. Using sessions, which are sticky, makes this explicit and provides an exposure level that clients can reason about.

不管自读写,会话,还是单读一致性能否保证,由于户端和服务器之间的黏性,都要为他们构造分布式协议。如果每一次请求的服务端都是一样的,那么保证自读写和单读一致性就相对简单。虽然这会造成附载均衡和容灾比较麻烦,但这仍是一个普遍的解决方案。会话具有很强的黏性,使用会话就使这个问题非常清晰,而且可以给客户端提供一个可参考的明确的等级。

Sometimes the client implements read-your-writes and monotonic reads. By adding versions on writes, the client discards reads of values with versions that precede the last-seen version.

有时客户端会实现自读写和单读一致性。通过给数据加版本控制,客户端忽略最新版本之前的所有数据。

Partitions happen when some nodes in the system cannot reach other nodes, but both sets are reachable by groups of clients. If you use a classical majority quorum approach, then the partition that has W nodes of the replica set can continue to take updates while the other partition becomes unavailable. The same is true for the read set. Given that these two sets overlap, by definition the minority set becomes unavailable. Partitions don't happen frequently, but they do occur between data centers, as well as inside data centers.

系统中的部分节点是无法到达其他节点的,这就产生了分区,但对于客户端来说这些节点都是可达的。如果使用经典多数仲裁法,那么有W个节点副本集合的分区就可以在其他分区失效的情况下进行正常更新。读集合也是如此。给定这两个集合的交集,根据定义多数集合失效。分区不会经常出现,但的确会在数据中心之间发生,数据中心内部也会发生。

In some applications the unavailability of any of the partitions is unacceptable, and it is important that the clients that can reach that partition make progress. In that case both sides assign a new set of storage nodes to receive the data, and a merge operation is executed when the partition heals. For example, within Amazon the shopping cart uses such a write-always system; in the case of partition, a customer can continue to put items in the cart even if the original cart lives on the other partitions. The cart application assists the storage system with merging the carts once the partition has healed.

在一些应用程序中,任何一个无效分区都是不可访问的,而且客户端能够到达可运行的分区非常重要。在这个例子中两方面都分配一个新的存储节点来接收数据,并在分区恢复后进行数据合并。例如,Amazon购物车是一个高写入的系统;在出现分区的情况下,客户可以继续把货物放进购物车,即使原来的购物车信息存活在另一个分区中。一旦分区恢复,购物车程序就协助存储系统合并购物车数据。

Amazon's Dynamo

Amazon's Dynamo

A system that has brought all of these properties under explicit control of the application architecture isAmazon's Dynamo, a key-value storage system that is used internally in many services that make up the Amazon e-commerce platform, as well as Amazon's Web Services. One of the design goals of Dynamo is to allow the application service owner who creates an instance of the Dynamo storage system—which commonly spans multiple data centers—to make the trade-offs between consistency, durability, availability, and performance at a certain cost point.

Amazon' Dynamo是一个具有所有这些属性,并且在应用架构直接控制下的系统,与Amazon Web服务Amazon电子商务平台上的许多服务都使用了一个“键-值”存储系统。Dynamo的一个设计目标就是允许应用服务的主人,也就是创建这个跨数据中心Dynamo实例的人,在一致性,持久性,可用性和性能上权衡找到一个平衡点。

 

Summary

摘要

Data inconsistency in large-scale reliable distributed systems has to be tolerated for two reasons: improving read and write performance under highly concurrent conditions; and handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running.

在一个大规模可用的的分布系统上,数据一致性必须为以下两个原因让路:高并发下提高读写性能;句柄分区覆盖主分区所渲染的部分,即使这些节点是可用的。

Whether or not inconsistencies are acceptable depends on the client application. In all cases the developer needs to be aware that consistency guarantees are provided by the storage systems and need to be taken into account when developing applications. There are a number of practical improvements to the eventual consistency model, such as session-level consistency and monotonic reads, which provide better tools for the developer. Many times the application is capable of handling the eventual consistency guarantees of the storage system without any problem. A specific popular case is a Web site in which we can have the notion of user-perceived consistency. In this scenario the inconsistency window needs to be smaller than the time expected for the customer to return for the next page load. This allows for updates to propagate through the system before the next read is expected.

不管客户端应用程序是否允许不一致性出现。在任何情况下,开发者都需要关注系统提供的一致性保障并且在开发过程中需要重视这一点。在实践中有最终一致性模型有很多地方可以优化,像会话级一致性和单读一致性一样,给开发者提供了很多不错的工具。很多时候应用程序完全可以保证存储系统的最终一致性。网页上具有用户认知一致性就是一个很好的例子。在这个场景中,不一致窗口需要比用户获取下一个页面载入的时间间隔小。一次更新需要在下一次读取之前在系统中传输完成。

The goal of this article is to raise awareness about the complexity of engineering systems that need to operate at a global scale and that require careful tuning to ensure that they can deliver the durability, availability, and performance that their applications require. One of the tools the system designer has is the length of the consistency window, during which the clients of the systems are possibly exposed to the realities of large-scale systems engineering.

本文的目的是为了提高对,需要在全球范围内进行操作并要求持久性,可用性和高性能的复杂工程系统的关注。系统设计者需要尤其关注一致性窗口,在这个窗口中客户端认为这个庞大的系统是可用的。

 

标签:分布式 亚马逊 amazon 最终一致性

添加新评论