首页 > 其他分享 >CockroachDB——类似spanner的开源版,底层使用rocksdb存储

CockroachDB——类似spanner的开源版,底层使用rocksdb存储

时间:2023-07-04 19:33:40浏览次数:40  
标签:rocksdb distributed US range SQL spanner failures CockroachDB

摘自:https://github.com/cockroachdb/cockroach/blob/master/docs/design.md

CockroachDB is a distributed SQL database. The primary design goals are scalability, strong consistency and survivability(hence the name). CockroachDB aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. CockroachDB nodes are symmetric; a design goal is homogeneous deployment (one binary) with minimal configuration and no required external dependencies.

The entry point for database clients is the SQL interface. Every node in a CockroachDB cluster can act as a client SQL gateway. A SQL gateway transforms and executes client SQL statements to key-value (KV) operations, which the gateway distributes across the cluster as necessary and returns results to the client. CockroachDB implements a single, monolithic sorted mapfrom key to value where both keys  and values are byte strings.

The KV map is logically composed of smaller segments of the keyspace called ranges. Each range is backed by data stored in a local KV storage engine (we use RocksDB, a variant of LevelDB). Range data is replicated to a configurable number of additional CockroachDB nodes. Ranges are merged and split to maintain a target size, by default 64M. The relatively small size facilitates quick repair and rebalancing to address node failures, new capacity and even read/write load. However, the size must be balanced against the pressure on the system from having more ranges to manage.

CockroachDB achieves horizontally scalability:

  • adding more nodes increases the capacity of the cluster by the amount of storage on each node (divided by a configurable replication factor), theoretically up to 4 exabytes (4E) of logical data;
  • client queries can be sent to any node in the cluster, and queries can operate independently (w/o conflicts), meaning that overall throughput is a linear factor of the number of nodes in the cluster.
  • queries are distributed (ref: distributed SQL) so that the overall throughput of single queries can be increased by adding more nodes.

CockroachDB achieves strong consistency:

  • uses a distributed consensus protocol for synchronous replication of data in each key value range. We’ve chosen to use the Raft consensus algorithm; all consensus state is stored in RocksDB.
  • single or batched mutations to a single range are mediated via the range's Raft instance. Raft guarantees ACID semantics.
  • logical mutations which affect multiple ranges employ distributed transactions for ACID semantics. CockroachDB uses an efficient non-locking distributed commit protocol.

CockroachDB achieves survivability:

  • range replicas can be co-located within a single datacenter for low latency replication and survive disk or machine failures. They can be distributed across racks to survive some network switch failures.
  • range replicas can be located in datacenters spanning increasingly disparate geographies to survive ever-greater failure scenarios from datacenter power or networking loss to regional power failures (e.g. { US-East-1a, US-East-1b, US-East-1c }{ US-East, US-West, Japan }{ Ireland, US-East, US-West}{ Ireland, US-East, US-West, Japan, Australia }).

CockroachDB provides snapshot isolation (SI) and serializable snapshot isolation (SSI) semantics, allowing externally consistent, lock-free reads and writes--both from a historical snapshot timestamp and from the current wall clock time. SI provides lock-free reads and writes but still allows write skew. SSI eliminates write skew, but introduces a performance hit in the case of a contentious system. SSI is the default isolation; clients must consciously decide to trade correctness for performance. CockroachDB implements a limited form of linearizability, providing ordering for any observer or chain of observers.

Similar to Spanner directories, CockroachDB allows configuration of arbitrary zones of data. This allows replication factor, storage device type, and/or datacenter location to be chosen to optimize performance and/or availability. Unlike Spanner, zones are monolithic and don’t allow movement of fine grained data on the level of entity groups.

Architecture

CockroachDB implements a layered architecture. The highest level of abstraction is the SQL layer (currently unspecified in this document). It depends directly on the SQL layer, which provides familiar relational concepts such as schemas, tables, columns, and indexes. The SQL layer in turn depends on the distributed key value store, which handles the details of range addressing to provide the abstraction of a single, monolithic key value store. The distributed KV store communicates with any number of physical cockroach nodes. Each node contains one or more stores, one per physical device.

Each store contains potentially many ranges, the lowest-level unit of key-value data. Ranges are replicated using the Raft consensus protocol. The diagram below is a blown up version of stores from four of the five nodes in the previous diagram. Each range is replicated three ways using raft. The color coding shows associated range replicas.

Each physical node exports two RPC-based key value APIs: one for external clients and one for internal clients (exposing sensitive operational features). Both services accept batches of requests and return batches of responses. Nodes are symmetric in capabilities and exported interfaces; each has the same binary and may assume any role.

Nodes and the ranges they provide access to can be arranged with various physical network topologies to make trade offs between reliability and performance. For example, a triplicated (3-way replica) range could have each replica located on different:

  • disks within a server to tolerate disk failures.
  • servers within a rack to tolerate server failures.
  • servers on different racks within a datacenter to tolerate rack power/network failures.
  • servers in different datacenters to tolerate large scale network or power outages.

Up to F failures can be tolerated, where the total number of replicas N = 2F + 1 (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).

标签:rocksdb,distributed,US,range,SQL,spanner,failures,CockroachDB
From: https://blog.51cto.com/u_11908275/6624558

相关文章

  • 【翻译】rocksdb write stall
    翻译自官方wiki:https://github.com/facebook/rocksdb/wiki/Write-Stalls转载请注明出处:https://www.cnblogs.com/morningli/p/16791706.htmlwritestall当flush或compaction无法跟上写入的速率时,rocksdb有旁路系统来减慢写入速率。如果没有这样的系统,用户如果持续写入比硬件能......
  • 怎样打造一个分布式数据库——rocksDB, raft, mvcc,本质上是为了解决跨数据中心的复制
    为什么我们要创建另外一个数据库?在前面三十年基本上是关系数据库的时代,那个时代创建了很多伟大的公司,比如说IBM、Oracle、微软也有自己的数据库,早期还有一个公司叫Sybase,有一部分特别老的程序员同学在当年的教程里面还可以找到这些东西,但是现在基本上看不到了。另外是NoSQL。NoS......
  • percona-server-rocksdb-8.0.32 安装
    MyRocks是关系型数据库Mysql基于RocksDB的存储引擎,一个可嵌入的、持久的键值存储。PerconaMyRocks是集于PerconaServerforMySQL的.RocksDB存储基于日志结构的合并树(LSMtree)。它针对快速存储进行了优化,有出色的空间和写入效率以及可接受的读取性能。因此,如果您的工作负......
  • 【图文详解】一文全面彻底搞懂HBase、LevelDB、RocksDB等NoSQL背后的存储原理:LSM-tree
    LSM树广泛用于数据存储,例如RocksDB、ApacheAsterixDB、Bigtable、HBase、LevelDB、ApacheAccumulo、SQLite4、Tarantool、WiredTiger、ApacheCassandra、InfluxDB和ScyllaDB等。在这篇文章中,我们将深入探讨LogStructuredMergeTree,又名LSM树:许多高度可扩展的NoSQL分......
  • Google Spanner数据库查询优化
    背景介绍运维反馈生产环境定时任务管理界面查询速度太慢,经过定位发现,是SQL查询速度太慢导致的,经过定位发现出有以下SQL数据查询过慢SELECTt.id,t.job_group,t.job_id,t.executor_address,t.executor_handler,t.executor_param,t.executor_sharding_param,t.executor_fai......
  • RocksDB 8.0 发布
    导读RocksDB是一个高性能键值数据的嵌入式数据库,它是GoogleLevelDB的一个分叉,经过优化,可以利用更多CPU核心,并有效地利用快速存储。它是用C++编写的,并为C++、C和Java提供官方语言绑定,同时还有许多第三方语言绑定。行为改变ReadOptions::verify_checksums=fa......
  • 大数据经典论文解读 - Spanner
    SpannerMegastore存在各种缺点:跨实体组事务需要昂贵的两阶段事务,所有跨数据中心的数据写入都通过Paxos算法,使得单个实体组只能支持每秒几次的事务。Spanner是一个全新设计的新系统,而不是Megastore或Bigtable上的修修补补。两个主题:解决了Megastore中哪些不足数据库事务,特别......
  • Rocksdb参数调优
    文章来源(为避免原作者删掉了文档,对这篇文章做了个拷贝):https://xiking.win/2018/12/05/rocksdb-tuning/RocksDB对比LevelDB暴露了很多参数来适应更多的应用场景,带来的好处就是可以通过tuning使系统性能达到最大,当然,如果tuning不合理会有相反的后果。在Facebook内部,RocksDB既能用在......
  • Rocksdb 调优指南
    本指南的目的是提供你足够的信息用于根据自己的工作负载和系统配置调优RocksDB。RocksDB非常灵活,这有好也有坏。你可以真多很多工作场景和存储技术进行调优。在Facebook,我......
  • Rocksdb FAQ
    问:如果我的进程crash了,我的数据库数据会受影响吗?答:不会,但是如果你没有开启WAL没有刷入到存储介质的memtable数据可能会丢失。问:如果我的机器crash了,RocksDB能保证数据的......