本文重点关注了系统设计相关的内容,paper后半部分的具体应用此处没有过多涉及。从个人笔记修改而来,因此为英文版本。
Bigtable: A Distributed Storage System for Structured Data
Data model: not a relational data model
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. —— part2
How the map indexed?
(row:string, column:string, time:int64) → string
just like json format, eg:
table{
// ...
"aaaaa" : { //row
"A:foo" : { //col
15 : "y", //timestamp
4 : "m"
},
"A:bar" : { //col
15 : "d",
},
"B:" : { //col
6 : "w"
3 : "o"
1 : "w"
}
},
// ...
}
a particular table: webtable
- row(also called tablet): reversed URL
concurrent: single row key is atomic
lexicographic order - col: column families, contents
family:qualifier
Access control and both disk and memory accounting - timestamp
avoid collisions: unique timestamp, decreasing order
garbage-collection mechanism(eg.)
API
C++ read/write
MapReduce + Bigtable
Building Block
Google File System: store log and data files
distributed Google File System
Google SSTable file format: store Bigtable data
K-V map: iterate key/value pairs in a specified key range
- a sequence of blocks
- a block index
disk seek or memory seek?
Optionally, SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disk.
Chubby: distributed lock service
5 active replicas: 1 master, 4 slave
Paxos algorithm: to keep its replicas consistent in the face of failure
namespace: including directory and small file, op r/w is atomic
session: when expires, lose locks and open handles
Implementation
consist:
- library(?) linked to every client
- 1 master server(schedule, garbage-collect......)
- many tablet server(10-1000 tablets)
As with many single-master distributed storage systems, client data does not move through the master: clients communicate directly with tablet servers for reads and writes.
hierarchy (B+-tree)
Chubby file -> Root tablet -> other METADATA tablets -> UserTables
METADATA: many other things stored in it
Master: schedule & manage
Each tablet is assigned to one tablet server at a time. Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. The master monitors this directory (the servers directory) to discover tablet servers.
The essential point for distributed database: lock
The Bigtable is only a series of ops, real data is stored in GFS.(SSTable)
Tablet Representation
memtable: the recently committed updates are stored in memory in a sorted buffer
reconstruct: redo points in commit logs
Compactions
As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS.
minor(memtable) -> major(SSTable) compaction
Refinement
locality group
Clients can group multiple column families together into a locality group. A separate SSTable is generated for each locality group in each tablet.
This section describes portions of the implementation in more detail in order to highlight these refinements.
in-memory locality groups are loaded lazily
storage: compression
read performance: caching
Bloom filters
commit-log
Speeding up tablet recovery
Exploiting immutability
Performance Evaluation
Lesson
- large distributed systems are vulnerable to many types of failures
- it is important to delay adding new features until it is clear how the new features will be used
- the importance of proper system-level monitoring
- the value of simple designs