[译]Cassandra内部架构

最近想看看Cassandra的源码,于是去看他们的wiki,先把一个概述翻了,当誓师,hoho

原文链接:http://wiki.apache.org/cassandra/ArchitectureInternals

General

概述

Configuration file is parsed by DatabaseDescriptor (which also has all the default values, if any)

DatabaseDescriptor是专门解析配制文件的,配制的默认值也跟这定义。

Thrift generates an API interface in Cassandra.java; the implementation is CassandraServer, and CassandraDaemon ties it together. 

Cassandra.java调用Thrift(一个apache的跨语言框架)产生一个API接口,CassandraServer负责实现,CassandraDaemon把二者联系起来。

CassandraServer turns thrift requests into the internal equivalents, then StorageProxy does the actual work, then CassandraServer turns it back into thrift again

CassandraServer把thrift请求丢到一个内部容器里,然后StorageProxy来完成具体的工作,CassandraServer再把它拿回到thrift。

StorageService is kind of the internal counterpart to CassandraDaemon. It handles turning raw gossip into the right internal state.

StorageService是一个类似内部CassandraDaemon的玩艺儿。它就是为了让那些还没处理过的请求待在他们应该在的位置上。

AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. replicas of each key range. Primary replica is always determined by the token ring (in TokenMetadata) but you can do a lot of variation with the others. RackUnaware just puts replicas on the next N-1 nodes in the ring. RackAware puts the first non-primary replica in the next node in the ring in ANOTHER data center than the primary; then the remaining replicas in the same as the primary.

AbstractReplicationStrategy控制节点等级。每一个键的副本都有顺序。主副本是由定义在TokenMetadata里的token ring决定的,但是在其他副本上你可以有很大的发挥空间。RackUnaware只负责把副本送到环上另外N-1个节点。RackAware把把第一个非主副本放到不同于主副本的另外一个数据中心的后续节点上;剩下的副本就跟主副本在一起了。

MessagingService handles connection pooling and running internal commands on the appropriate stage (basically, a threaded executorservice). Stages are set up in StageManager; currently there are read, write, and stream stages. (Streaming is for when one node copies large sections of its sstables to another, for bootstrap or relocation on the ring.) The internal commands are defined in StorageService; look for registerVerbHandlers.

MessagingService统筹链接并在合适的阶段运行内部命令(其实就是个线程池)。各个阶段由StageManager发起;目前包括读,写和流阶段。(流,是当一个节点需要拷贝一个大段数据到另外的节点时,提供环上的引导和重定位。)内部命令在StorageService中定义,参考registerVerbHandlers。

Write path

写流程

StorageProxy gets the nodes responsible for replicas of the keys from the ReplicationStrategy, then sends RowMutation messages to them.

StorageProxy从ReplicationStrategy获取存储键副本的节点,并发送RowMutation信息到这些节点。

  • If nodes are changing position on the ring, "pending ranges" are associated with their destinations in TokenMetadata and these are also written to.
  • 如果节点在环上的位置发生变化,“pending ranges”与他们的在TokenMetadata中定义的目标都会被告知。
  • If nodes that should accept the write are down, but the remaining nodes can fulfill the requested ConsistencyLevel, the writes for the down nodes will be sent to another node instead, with a header (a "hint") saying that data associated with that key should be sent to the replica node when it comes back up. This is called HintedHandoff and reduces the "eventual" in "eventual consistency." Note that HintedHandoff is only an optimization; ArchitectureAntiEntropy is responsible for restoring consistency more completely.
  • 如果节点的写入失败,而其他节点有能够满足ConsistencyLevel的要求,那么本来应该写入这个节点的数据就顺延到其他节点,并在头信息中标明:那个节点一旦恢复,与这个key相关联的所有数据需要被写回去。这种机制叫做HintedHandoff,他避免了“最终一致性”中“最终”的不足。但要注意HintedHandoff仅仅是锦上添花;ArchitectureAntiEntropy才是保证还原一致性的关键。

on the destination node, RowMutationVerbHandler hands the write first to CommitLog.java, then to the Memtable for the appropriate ColumnFamily (through Table.apply).

在目标节点上,RowMutationVerbHandler首先调用CommitLog.java进行写操作,然后才通过Table.apply写入Memtable上合适的ColumnFamily。

When a Memtable is full, it gets sorted and written out as an !SSTable asynchronously by ColumnFamilyStore.switchMemtable

当一个Memtable写满后,会进行排序并通过ColumnFamilyStore.switchMemtable异步的以一个!SSTable格式输出。

  • When enough SSTables exist, they are merged by ColumnFamilyStore.doFileCompaction
  • 当SSTables达到一定数量,ColumnFamilyStore.doFileCompaction会将它们合并
  • Making this concurrency-safe without blocking writes or reads while we remove the old SSTables from the list and add the new one is tricky, because naive approaches require waiting for all readers of the old SSTables to finish before deleting them (since we can't know if they have actually started opening the file yet; if they have not and we delete the file first, they will error out). The approach we have settled on is to not actually delete old SSTables synchronously; instead we register a phantom reference with the garbage collector, so when no references to the !SSTable exist it will be deleted. (We also write a compaction marker to the file system so if the server is restarted before that happens, we clean out the old SSTables at startup time.)

  • 由于呆板的设计要求在删除一些SSTable之前,在这些SSTable上所有的读操作要全部完成(如果我们在不知道他们是否已经开始打开这些文件的情况下,就对这些文件进行删除,那么他们必然会报错),所以实现在列表中增减SSTable时,不阻塞读写并保证一致性安全非常棘手。因此我们采用的设计是不同步删除旧的SSTable;取而代之我们给回收站注册一个虚引用,当这个SSTable没有引用时就把它删除。(我们还在文件系统上安置了一个压缩器,当服务器在这些有)

See ArchitectureSSTable and ArchitectureCommitLog for more details

参考ArchitectureSSTable和ArchitectureCommitLog了解更多信息

Read path

读流程

StorageProxy gets the nodes responsible for replicas of the keys from the ReplicationStrategy, then sends read messages to them

StorageProxy从ReplicationStrategy获取存储键副本的节点,并把读消息发送到这些节点

  • This may be a SliceFromReadCommand, a SliceByNamesReadCommand, or a RangeSliceReadCommand, depending
  • 肯能以来于SliceFromReadCommand、SliceByNamesReadCommand或RangeSliceReadCommand

On the data node, ReadVerbHandler gets the data from CFS.getColumnFamily or CFS.getRangeSlice and sends it back as a ReadResponse

在数据节点上,ReadVerbHandler从CFS.getColumnFamily或CFS.getRangeSlice上获取数据,并作为一个ReadResonse发还回去

  • For single-row requests, we use a QueryFilter subclass to pick the data from the Memtable and SSTables that we are looking for. The Memtable read is straightforward. The SSTable read is a little different depending on which kind of request it is:
  • 对于单行请求,我们用QueryFilter子类从Memtable和SSTable获取需要的数据。从Memtable可以直接读取数据,而从SSTable读取数据则依请求不同稍有区别:
  • If we are reading a slice of columns, we use the row-level column index to find where to start reading, and deserialize block-at-a-time (where "block" is the group of columns covered by a single index entry) so we can handle the "reversed" case without reading vast amounts into memory

  • 如果读取的是列集合,那就要用行级列索引查找从读取的起始位置并且反序列化每个块(“块”就是单个索引所覆盖到的一组列),这样我们不需要大量占用内存就可以处理“逆向”情况。
  • If we are reading a group of columns by name, we still use the column index to locate each column, but first we check the row-level bloom filter to see if we need to do anything at all

  • 如果我们按名称读取一部分列,仍然通过列索引定位,但首先要检查行级过滤器,看是否需要这些索引。
  • The column readers provide an Iterator interface, so the filter can easily stop when it's done, without reading more columns than necessary
  • 列读取工具提供一个接口:Iterator,过滤器可以方便的知道它何时结束,而无需读取额外列信息
  • Since we need to potentially merge columns from multiple SSTable versions, the reader iterators are combined through a ReducingIterator, which takes an iterator of uncombined columns as input, and yields combined versions as output

  • 如果需要从多个SSTable版本合并列信息,那么读取迭代通过ReducingIterator进行组合,它以一组未组合过的列作为输入,以组合后的版本作为输出

If a quorum read was requested, StorageProxy waits for a majority of nodes to reply and makes sure the answers match before returning. Otherwise, it returns the data reply as soon as it gets it, and checks the other replies for discrepancies in the background in StorageService.doConsistencyCheck. This is called "read repair," and also helps achieve consistency sooner.

如果最小读取节点数有要求,StorageProxy就等待足够多的节点回复,并确认回复信息后再返回。还有一种方式,它一获得数据就返回,然后StorageService.doConsistencyCheck会在后台检查获取到的信息的差异。这种方式叫做“读修复”,提高了实现一致性的效率。

  • As an optimization, StorageProxy only asks the closest replica for the actual data; the other replicas are asked only to compute a hash of the data.
  • 作为一种优化手段,StorageProxy只是从最近的副本读取切实的数据,其他副本只是计算数据的哈希值。

 

标签:Cassandra 架构 翻译 中文 wiki

添加新评论