前，PingCAP 受邀参加“Percona Live Amsterdam 2016”，
作为亚洲唯一一家受邀企业，PingCAP 联合创始人兼 CEO 刘奇分享了 TiDB 的开发经验及最新技术进展。#
First, about me. I am an infrastructure engineer and I am also the CEO of PingCAP. Currently, my team and I are working on two open source projects: TiDB and TiKV. Ti is short for Titanium, which is a chemical element known for its corrosion resistance and it is widely used in high-end technologies.
So today we will cover the following topics.
Before we start, let’s go back to the very beginning and ask yourself a question: Why another database. We all know that there are many databases, such as the traditional Relational database and NoSQL. So why another one?
Why another database?
What kind of database we want to build?
How to design such a database, including the principles, the architecture, and design decisions.
How to develop such a database, including the architecture and the core technologies for TiKV and TiDB.
How we test the database to ensure the quality, and stability.
So we are building a NewSQL database with the following features:
Relational databases like MySQL, Oracle, PostgreSQL, etcetera: they are very difficult to scale. Even though we have sharding solutions, YouTube/vitess, MySQL proxy, but none of them supports distributed transactions and cross-node join.
NoSQL like HBase, MongoDB, and Cassandra: They scale well, but they don’t support SQL and consistent transactions.
NewSQL, represented by Google Spanner and F1, which is as scalable as NoSQL systems and it maintains the ACID transactions. That’s exactly what we need. Inspired by Spanner and F1, we are making a NewSQL database. Of course, it’s open source.
First of all, it supports SQL. We have been using SQL for decades and many of our applications are using SQL. We cannot just give it up.
Second, it must be very easy to scale. You can easily increase the capacity or balance the load by adding more machines.
Third, it supports ACID transaction, which is one of the key features of relational database. With a strong consistency guarantee, developers can write correct logic with less code.
Last, it is highly available in case of machine failures or even downtime of an entire data center. And it can recover automatically.
In short, we want to build a distributed, consistent, scalable, SQL Database. We name it TiDB.
Now we have a clear picture of what kind of database we want to build, the next step is how, how to design it, how to develop it and how to test it. In the next few slides, I am going to talk about how to design TiDB.
In this section, I will introduce how we design TiDB, including the principles, the architecture and design decisions.
▌The principles or the philosophy
Before we design, we have several principles or philosophy in mind:
TiDB must be user-oriented.
○ It must ensure that no data is ever lost and the system can automatically recover from machine failures or even downtime of the entire datacenters.
○ It should be easy to use.
○ It should be cross-platform and can run on any environment, no matter it’s on-premise, cloud or container.
○ As an open source project, we are dedicated to being an important part of the big community through our active engagement, contribution and collaboration.
We need TiDB to be easy to maintain so we chose the loose coupling approach. We design the database to be highly layered with a SQL layer and a Key-Value layer. If there is a bug in SQL layer, we can just update the SQL layer.
The alternatives: Although our project is inspired by Google Spanner and F1, we are different from those projects. When we design TiDB and TiKV, we have our own practices and decisions in choosing different technologies.
❖ Disaster recovery
The first and foremost design principle is to build a database where no data is lost. To ensure the safety of the data, we found that multiple replicas are just not enough and we still need to keep Binlog in both the SQL layer and the Key-Value layer. And of course, we must make sure that we always have a backup in case the entire cluster crashes.
❖ Easy to use
The second design principle is about the usability. After years of struggling among different workarounds and trade-offs, we are fully aware of the pain points of the users. So when it comes to us to design a database, we are going to make it easy to use and there should be no scary sharding keys, no partition, no explicit handmade local index or global index, and making scale transparent to the users.
The database we are building also needs to be cross-platform. The database can run on the on-premise devices. Here is a picture of TiDB running on a Raspberry Pi cluster with 20 nodes.
It can also supports the popular containers such as Docker. And we are making it work with Kubernetes. And of course, it can be run on any cloud platform, whether it’s public, private or hybrid.
❖ The community and ecosystem
The next design principle is about the community and ecosystem. We want to stand on the shoulders of the giants instead of creating something new and scary. TiDB supports MySQL protocol and is compatible with most of the MySQL drivers(ODBC, JDBC) and SQL syntax, MySQL clients and ORM, and the following MySQL management tools and bench tools.
etcd is a great project. In our Key-Value store, TiKV, which I will dive deep into later, we have been working with the etcd team very closely. We share the Raft implementation, and we do code reviews on Raft module for each other.
RocksDB is also a great project. It’s mature, fast, tunable, and widely used in very large scale production environments, especially in facebook . TiKV uses RocksDB as it’s local storage. While we were testing it in our system, we found some bugs. The RocksDB team fixed those bugs very quickly.
A few months ago, we need a tool to simulate slow, unstable disk, and the team member found Namazu. But at that time, Namazu didn’t support hooking fsync. When the team member raised this request to their team, they responded immediately and implement the feature in just a few hours and they are very open to implement other features as well. We are deeply impressed by their responsiveness and their efficiency.
The Rust community is amazing. Besides the good developing experience of using Rust, we also build the Prometheus driver in Rust to collect the metrics.
We are so glad to be a part of this great family. So many thanks to the Rust team, gRPC, Prometheus and Grafana.
We are using the Spark connector in TiDB. TiDB is great for small or medium queries and Spark is better for complex queries with lots of data. We believe we can learn a lot from the Spark community too, and of course we would like to contribute as much as possible.
So overall, we’d like to be a part of the big open source community and would like to engage, contribute and collaborate to build great things together.
▌Loose coupling – the logical architecture
This diagram shows the logical architecture of the database.
As I mentioned earlier about our design principle, we are adopting the loose coupling approach. From the diagram, we can see that it is highly-layered. We have TiDB to work as the MySQL server, and TiKV to work as the distributed Key-Value layer. Inside TiDB, we have the MySQL Server layer and the SQL layer. Inside TiKV, we have transaction, MVCC, Raft, and the Local Key-Value Storage, RocksDB.
For TiKV, we are using Raft to ensure the data consistency and the horizontal scalability. Raft is a consensus algorithm that equals to Paxos in fault-tolerance and performance. Our implementation is ported from etcd, which is widely used, very well tested and highly stable. I will cover the technical details later.
From the architecture, you can also see that we don’t have a distributed file system. We are using RocksDB as the local store.
In the next few slides, I am going to talk about design decisions about using the alternative technologies compared with Spanner and F1, as well as the pros and cons of these alternatives.
❖ Atomicclocks / GPS clocks VS TimeStamp Allocator
If you’ve read the Spanner paper, you might know that Spanner has TrueTime API, which uses the atomic clocks and GPS receivers to keep the time consistent between different data centers.
The first alternative technology we chose is to replace the TrueTime API with the TimeStamp Allocator. It goes without any doubt that time is important and that Real time is vital in distributed systems. But can we get real time? What about clock drift?
The sad truth is that we can’t get real time precisely because of clock drift, even if we use GPS or Atomic Clocks.
In TiDB, we don’t have Atomic clocks and GPS clocks. We are using the Timestamp Allocator introduced in Percolator, a paper published by Google in 2006.
The pros of using the Timestamp Allocator are its easy implementation and no dependency on any hardware. The disadvantage lies in that if there are multiple datacenters, especially if these DCs are geologically distributed, the latency is really high.
❖ Distributed file system VS RocksDB
Spanner uses Colossus File System, the suc cessor to the Google File System (GFS), as its distributed file system. But in TiKV, we don’t depend on any distributed file system. We use RocksDB. RocksDB is an embeddable persistent key-value store for fast storage. The primary design point for RocksDB is its great performance for server workloads. It’s easy for tuning Read, Write and Space Amplification. The pros lie in that it’s very simple, very fast and easy to tune. However, it’s not easy to work with Kubernetes properly.
❖ Paxos VS Raft
The next choice we have made is to use the Raft consensus algorithm instead of Paxos. The key features of Raft are: Strong leader, leader election and membership changes. Our Raft implementation is ported from etcd. The pros are that it’s easy to understand and implement, widely used and very well tested. As for Cons, I didn’t see any real cons.
❖ C++ VS Go & Rust
As for the programming languages, we are using Go for TiDB and Rust for TiKV. We chose Go because it’s very good for fast development and concurrency, and Rust for high quality and performance. As for the Cons, there are not as many third-party libraries.
That’s all about how we design TiDB. I have introduced the principles, the architecture, and design decisions about using the alternative technologies. The next step is to develop TiDB.