MIT6.824_LEC3_GFS_Outline

标签：副本 Outline GFS primary coordinator MIT6.824 节点客户端

为什么我们要阅读 GFS 论文?

分布式存储是关键的抽象概念
接口和语法应该是怎样的?
内部是怎么运行的?
GFS 论文对6.824这门课的很多主题有指导意义
- 并行性能
- 容错
- 副本机制
- 一致性
优秀的系统论文,无论是从应用程序还是到网络的所有细节
工业界的成功设计
有影响力的开源实现 Hadoop HDFS

为什么分布式存储难?

高性能 → 数据分布在多节点
多节点 → 故障常态化
容错 → 多副本
副本 → 数据一致性问题
强一致性 → 性能开销大

我们希望有什么样的一致性?

理想状态下与单节点服务相同
理想服务器下服务器会完美执行客户端发送的请求,即使多客户端同时发送
每次读取都会与之前写入的内容完全相同,即使服务器故障重启了
所有客户端读取到的数据都是相同的

假设 C1 C2同时写入,然后 C3 C4 读,他们读到的是1还是2?

C1: Wx1
C2: Wx2
C3:　　 Rx?
C4:　　　　 Rx?

有可能是1也有可能是2,但是它们必须看到一样的数,这就是强一致性模型,但是单节点的容错性很差

为了解决容错性而使用多副本导致实现强一致性很难.

一个简单但失效的副本如下
两个副本服务器 S1 S2
客户端并行发送写请求给两个服务器客户端向其中一个发送读请求

上一个例子中,C1 C2 的写请求到达副本的顺序可能不同
如果 C3 读 S1 副本那他可能读到 1
如果 C4 读 S2 副本那他可能读到 2 又或者 S1 收到一个写请求,但是客户端在发送请求给 S2 之前崩了
这就不是强一致性 S1 S2的数据不一样了
确保副本之前的强一致性会有很大网络开销很慢
在性能和一致性之前有一些折中的选择我们今天会介绍

GFS

背景

很多 Google 服务需要一个大型高效统一的存储系统
MR 爬虫索引日志存储及分析
在这些应用程序之间共享
自动分发每个文件到很多服务器节点及磁盘上

多客户端并行性能 -- mapreduce
单节点大文件

节点故障自动恢复
每次只部署在一个数据中心
只需服务 Google 应用和用户
针对大文件的顺序读取和追加
不是一个低时延的小项目

在2003年这篇论文有什么新颖的点?怎么说服 SOSP 接收这篇文章的?

不是基本的分布式,分区,容错思想
巨大的规模
工业界的实践
弱一致性的成功使用

整体结构

客户端 (库 RPC -- 但是对 UNIX 文件系统不可见)
Coordinator 管理文件名 (元数据)
块节点存储 64MB 的块
大文件被分割成 64MB 的块,写入到块服务器上
每个块有3个副本

Coordinator 状态

内存中保存的表(为了速度,且必须小)
文件名对应一个块 handles 的列表 nv
块 handle 版本 nv
块节点的列表 v
是否 primary v
租约 v nv为非易失性存储,追加写入磁盘日志

客户端 C 读取文件的步骤是怎样的?

C 发送文件名和偏移量给 CO (如果没有缓存)
CO 找到块句柄
CO 回复块句柄和块服务器的列表,仅返回有最新版本的
C 缓存接受的列表
C 发送请求给最近的块节点,请求包含块句柄和偏移量
块节点从磁盘读取块文件返回给 C

客户端仅仅请求文件名和块句柄列表

客户端缓存文件名和块句柄信息的映射
CO 不处理数据,所以不会有很大的负载

CO 怎么知道哪些块节点有哪些块?

启动的的时候轮询

客户端记录追加的步骤是怎样的?

C 向 CO 请求文件最新的块
CO 返回 C 主块 P 和副本
C 发送数据给所有块服务器(不写入),等待所有快服务器的回应
C 告诉 P 可以追加了
P 校验租约是否到期,块空间是否足够
P 计算出追加的偏移量在文件末尾
P 把数据写入块中(Linux 文件)
P 告诉副本偏移量,告诉他们可以追加数据了
P 等待所有副本回复,超时或者磁盘空间不足会回复 error
P 告诉 C 成功或失败
如果失败 C 重试

GFS 为客户端提供哪些一致性保证?

需要以一种形式告诉应用程序如何使用 GFS

如果 p 告知 c 一个追加请求成功了,那么之后所有的读取者都会在这个文件的某个地方看到这个追加记录

这允许很多异常情况:不同客户端可能看到此条数据的不同记录方式,可能会看到重复数据,可能会看到错误写入的数据,GFS 应用程序要为这些做好准备

想想 GFS 是怎么实现它所承诺的?

看看它对各种故障的处理:
　　崩溃,崩溃+重启,崩溃+替换,消息丢失,分区
　　GFS是怎么处理每种故障的?

如果客户端在追加记录的时候失败了怎么办?
如果一个客户端缓存了一个错误的主块怎么办?
如果客户端缓存了一个块的过期的服务器列表呢?
CO 崩溃重启会不会导致它忘记这个文件?或者忘了哪些块服务器存有哪些块
两个客户端在完全相同的时间追加记录,他们会不会覆盖对面的记录?
假设一个从节点没有收到主节点发送的追加命令
- 由于临时网络故障
- 如果客户端从这个节点读取数据会怎样?
如果主节点 S1 还活着并且和客户端通信,但是 CO 和 S1 之间网络故障怎么办?
- 网络分区
- CO 会选一个新的主节点吗?
- 现在会有两个主节点吗?
- 所以现在会追加到一个主节点然后从另一个读取吗?这不是会破坏一致性吗?
- 这就是脑裂
主节点在向从节点发送追加请求之前崩溃了怎么办?
- 从节点在没有收到这个追加指令会被选为新主节点吗?
块服务器 S4 有一个老的过期的块副本是离线的.主从节点都崩溃了.
- S4 在主从节点恢复之前恢复了.CO 会挑 S4 作为主节点吗?
- 会选择老的数据作为主块还是根本没有副本?

How does the coordinator set up a primary for a chunk?
If client wants to write, but no primary, or primary lease expired and dead.
Coordinator has been polling chunkservesr about what chunks/versions they have.
1. if no chunkservers w/ latest version #, error
2. pick primary P and secondaries from those w/ latest version #
3. increment version #, write to disk
4. tell P and secondaries who they are, and new version #
5. replicas write new version # to disk
What should a primary do if a secondary always fails writes?
e.g. dead, or out of disk space, or disk has broken.
Keep failing client requests?
Or ask coordinator to declare a new set of servers and new version?
　　The paper does not describe this process.
If there's a partitioned primary serving client appends, and its
lease expires, and the coordinator picks a new primary, will the new
primary have the latest data as updated by partitioned primary?
What if the coordinator fails altogether.
Will the replacement know everything the dead coordinator knew?
E.g. each chunk's version number? primary? lease expiry time?
Who/what decides the coordinator is dead, and must be replaced?
Could the coordinator replicas ping the coordinator, take over if no response?
What happens if the entire building suffers a power failure?
And then power is restored, and all servers reboot.
Suppose the coordinator wants to create a new chunk replica.
Maybe because too few replicas.
Suppose it's the last chunk in the file, and being appended to.
How does the new replica ensure it doesn't miss any appends?
　　After all it is not yet one of the secondaries.
Is there any circumstance in which GFS will break the guarantee?
i.e. append succeeds, but subsequent readers don't see the record.
All coordinator replicas permanently lose state (permanent disk failure).
　　Could be worse: result will be "no answer", not "incorrect data".
　　"fail-stop"
All chunkservers holding the chunk permanently lose disk content.
　　again, fail-stop; not the worse possible outcome
CPU, RAM, network, or disk yields an incorrect value.
　　checksum catches some cases, but not all
Time is not properly synchronized, so leases don't work out.
　　So multiple primaries, maybe write goes to one, read to the other.

What would it take to have no anomalies -- strict consistency?
I.e. all clients see the same file content.
Too hard to give a real answer, but here are some issues.

All replicas should complete each write, or none.
　　Perhaps tentative writes until all promise to complete it?
　　Don't expose writes until all have agreed to perform them!
Primary should detect duplicate client write requests.
If primary crashes, some replicas may be missing the last few ops.
　　New primary must talk to all replicas to find all recent ops,
　　and sync with secondaries.
Clients must be prevented from reading from stale ex-secondaries;
　　perhaps secondaries have leases, or clients know about chunk versions
　　and get a lease on that version from coordinator.
You'll see solutions in Labs 2 and 3!

Performance (Figure 3)
large aggregate throughput for read
　　94 MB/sec total for 16 clients
　　 or 6 MB/second per client
　　 is that good?
　　 one disk sequential throughput was about 30 MB/s
　　 one NIC was about 10 MB/s
　　Close to saturating inter-switch link's 125 MB/sec
　　So: per-client performance is not huge
　　　　but multi-client scalability is good
　　　　which is more important?
　　Table 3 reports 500 MB/sec for production GFS, which is a lot
writes to different files lower than possible maximum
　　authors blame their network stack (but no detail)
concurrent appends to single file
　　limited by the server that stores last chunk
hard to interpret after 15 years, e.g. how fast were the disks?

Random issues worth considering
What would it take to support small files well?
What would it take to support billions of files?
Could GFS be used as wide-area file system?
　　With replicas in different cities?
　　All replicas in one datacenter is not very fault tolerant!
How long does GFS take to recover from a failure?
　　Of a chunkserver?
　　Of the coordinator?
How well does GFS cope with slow chunkservers?

Retrospective interview with GFS engineer:
http://queue.acm.org/detail.cfm?id=1594206
file count was the biggest problem
　　eventual numbers grew to 1000x those in Table 2 !
　　hard to fit in coordinator RAM
　　coordinator scanning of all files/chunks for GC is slow
1000s of clients too much CPU load on coordinator
applications had to be designed to cope with GFS semantics
　　and limitations
coordinator fail-over initially entirely manual, 10s of minutes
BigTable is one answer to many-small-files problem
and Colossus apparently shards coordinator data over many coordinators

Summary
case study of performance, fault-tolerance, consistency
　　specialized for MapReduce applications
good ideas:
　　global cluster file system as universal infrastructure
　　separation of naming (coordinator) from storage (chunkserver)
　　sharding for parallel throughput
　　huge files/chunks to reduce overheads
　　primary to sequence writes
　　leases to prevent split-brain chunkserver primaries
not so great:
　　single coordinator performance
　　 ran out of RAM and CPU
　　chunkservers not very efficient for small files
　　lack of automatic fail-over to coordinator replica
　　maybe consistency was too relaxed

标签：副本,Outline,GFS,primary,coordinator,MIT6.824,节点,客户端
From： https://www.cnblogs.com/autumnnnn/p/16813056.html