最近看了亚麻的Dynamo,个人认为其中always writeable的业务目标,对于DHT,vector clock,merkel tree的应用,包括对于一致性和高可用的权衡(基于CAP猜想,实现默认保证分区容错,因此二选一)等都很有意思。建议参考原论文食用。
What is the problem that this paper tries to solve? How would summarise its main idea in a few sentences? How does it work in more detail?
What is good about the paper? What is not good about the paper?
To what extent is the design of Dynamo inspired by Distributed Hash Tables (DHTs)? What are the advantages and disadvantages of such a design?
(part 3.3)
can be described as a zero-hop DHT
P2P:global
dynamo:locality
How does the design of Dynamo compare to that of BigTable?
Dynamo:for ACID(transaction)
BigTable: for structured data
key point:
target: always writeable
consistency & available(dynamo) : always conflict
dynamo: weak consistency: eventual consistency
vector clocks
Dynamo
Requirements
-
simple query model: r/w op for unique key to value, no mutli-data & relational schema
-
consistency & available : sometimes conflict
Experience at Amazon has shown that data stores that provide ACID guarantees tend to have poor availability.
-
efficiency: commodity hardware infrastructure(通用硬件), achieve SLA
-
other: internal service without security related requirements such as authentication and authorization.
Target: meet SLA
Figure 1: Typically, the aggregator services are stateless, although they use extensive caching.
common standard: average, median and expected variance
while amazon: measured at the 99.9th percentile of the distribution
design
it is well known that when dealing with the possibility of network failures, strong consistency and high data availability cannot be achieved simultaneously
conflict resolution: eventually consistent data store
An important design consideration is to decide when to perform the process of resolving update conflicts
eg. whether conflicts should be resolved during reads(tradition) or writes(dynamo, for "always writeable")
who performs the process of conflict resolution
- data store: simple, eg. "last write win"
- application: flexible & suitable
Other key principles:Incremental scalability, Symmetry, Decentralization, Heterogeneity
related work(omit here)
P2P system
Architecture
partitioning, replication, versioning, membership, failure handling and scaling.
interface
get() put()
partitioning
basic consistent hashing algorithm(hash ring):
- non-uniform data and load distribution
- oblivious to the heterogeneity
improvement:
virtual node: A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node.
when a new node is added to the system, it is assigned multiple positions (henceforth, “tokens”) in the ring.
Replication
In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 clockwise successor nodes in the ring.
eg. in figure2: B itself, & C,D replicated
for virtual nodes, avoid dual node -> preference list stepping position(num > N for possible node failure) -> distinct physical nodes
Data versioning(important for consistency)
Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas asynchronously.
(temporary inconsistencies)
thus, possible multi-versions(even the same data)
vector clocks: capture causality between different versions of the same object
format: a list of (node, counter) pairs
data conflict: return all the data to the client/logic to deal with
size restriction(possible for node failure)
Execution of operation: get() & put()
how to get node?
- load balancer route choose
- partition-aware client library
configurable values: R and W.
R is the minimum number of nodes that must participate in a successful read operation.
W is the minimum number of nodes that must participate in a successful write operation.
Setting R and W such that R + W > N yields a quorum-like system.
In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas.
For this reason, R and W are usually configured to be less than N, to provide better latency.
Handling Failures(temporary node failure): Hinted Handoff
sloppy quorum
handling the failure of an entire data center: each object is replicated across multiple data centers
Handling Failures(permanent node failure): Replica synchronization
Merkle tree: To detect the inconsistencies between replicas faster and to minimize the amount of transferred data
hash the childnode, construct tree from bottom to uphill, anti-entropy
Ring Membership
how virtual node mapped to physical node?
When a node starts for the first time, it chooses its set of tokens (virtual nodes in the consistent hash space) and maps nodes to their respective token sets.
Adding/Removing Storage Nodes
add front keys to new nodes, then remove related repetitive keys from back nodes
Implementation: all Java
- request coordination
- membership and failure detection
- local persistence engine
EXPERIENCES & LESSONS
Class discussion
internal service so dont care about the security problem
virtual node idea -> load balance(flexibility): random -> logical ring depend on token sets
large-scale distributed system:
block chain: for security & anonymous
web3
consistent hash works: the ring partition
DHT(distributed hash table) ring: each node contains previous range
how the data stored: checking alongside the ring efficiently
gossip-based protocol: propagates membership changes and maintains an eventually consistent view of membership
use binary research to find the destination
distinct physical nodes: the preference list skipping particular position in the ring
N: virtual nodes, while it is possible that the multi virtual nodes on the same physical nodes, thus skipping the same physical nodes.
Brewer's conjecture: CAP Theorem
consistency, availability, and partition-tolerance: pick 2 out of 3!
native design: confirm partition, thus sacrifice strong consistency to earn high availability