TiDB

TiDB

TiDB Weekly [2017.01.01]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 88 次浏览 • 2017-01-07 13:13 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 28 PRs in the TiDB repositories.


Added



Fixed



Improved



New contributor



Weekly update in TiKV


Last week, We landed 19 PRs in the TiKV repositories.


Added




  • Move raw_get to thread pool.




  • Add ResourceKind for operator and don't resend duplicated AddPeer response when the peer is still pending, see #449.



  • Add member command for pd-ctl.


Fixed



Improved



原文链接

TiDB Weekly [2016.12.26]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 50 次浏览 • 2017-01-07 13:12 • 来自相关话题

New Release

TiDB RC1 is released!

Weekly updat... 查看全部

New Release


TiDB RC1 is released!


Weekly update in TiDB


Last week, we landed 34 PRs in the TiDB repositories.


Added



Fixed



Improved



New contributor



Weekly update in TiKV


Last week, We landed 14 PRs in the TiKV repositories.


Added



Fixed



Improved



原文链接

TiDB Weekly [2016.12.19]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 121 次浏览 • 2016-12-25 13:46 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 32 PRs in the TiDB repositories.


Added



Fixed



Improved



New contributor



Weekly update in TiKV


Last week, we landed 11 PRs in the TiKV repositories.


Added



Fixed



Improved



原文链接

TiDB Weekly [2016.12.12]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 129 次浏览 • 2016-12-17 14:14 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 41 PRs in the TiDB repositories.


Added



Fixed



Improved



Weekly update in TiKV


Last week, we landed 34 PRs in the TiKV repositories.


Added



Fixed



Improved



原文链接

TiDB Weekly [2016.12.05]

文章分享qiuyesuifeng 发表了文章 • 0 个评论 • 123 次浏览 • 2016-12-07 11:42 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 48 PRs in the TiDB repositories, 6 PRs in the TiDB docs repositories.


Added



Fixed



Improved



Document change


The following guides are updated:



Weekly update in TiKV


Last week, we landed 22 PRs in the TiKV repositories.


Added



Fixed



Improved




  • Replace the score type with resource kind to calculate the scores more easily.




  • Replace origin concept balance with schedule and simplify configurations.



  • Use coordinator to control the speed of different schedulers.


原文链接

TiDB Weekly [2016.11.28]

文章分享qiuyesuifeng 发表了文章 • 0 个评论 • 101 次浏览 • 2016-11-30 12:12 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 44 PRs in the TiDB repositories, 3 PRs in the TiDB docs repositories.


Added



Fixed



+ Add sequence number in binlog to preserve the original mutation order.



Improved



Document change


Add the following new guides:



Weekly update in TiKV


Last week, we landed 20 PRs in the TiKV repositories.


Added



Fixed



Improved



原文链接

Percolator 和 TiDB 事务算法

文章分享qiuyesuifeng 发表了文章 • 1 个评论 • 124 次浏览 • 2016-11-24 11:24 • 来自相关话题

本文先概括的讲一下 Google Percolator 的大致流程。Percolator 是 Google 的上一代分布式事务解决方案,构建在 BigTable 之上,在 Google 内部 用于网页索引更新的业务,原始的论文查看全部

本文先概括的讲一下 Google Percolator 的大致流程。Percolator 是 Google 的上一代分布式事务解决方案,构建在 BigTable 之上,在 Google 内部 用于网页索引更新的业务,原始的论文在此。原理比较简单,总体来说就是一个经过优化的二阶段提交的实现,进行了一个二级锁的优化。TiDB 的事务模型沿用了 Percolator 的事务模型。
总体的流程如下:


读写事务


1) 事务提交前,在客户端 buffer 所有的 update/delete 操作。
2) Prewrite 阶段:


首先在所有行的写操作中选出一个作为 primary,其他的为 secondaries。


PrewritePrimary: 对 primaryRow 写入 L 列(上锁),L 列中记录本次事务的开始时间戳。写入 L 列前会检查:



  1. 是否已经有别的客户端已经上锁 (Locking)。

  2. 是否在本次事务开始时间之后,检查 W 列,是否有更新 [startTs, +Inf) 的写操作已经提交 (Conflict)。


在这两种种情况下会返回事务冲突。否则,就成功上锁。将行的内容写入 row 中,时间戳设置为 startTs。


将 primaryRow 的锁上好了以后,进行 secondaries 的 prewrite 流程:



  1. 类似 primaryRow 的上锁流程,只不过锁的内容为事务开始时间及 primaryRow 的 Lock 的信息。

  2. 检查的事项同 primaryRow 的一致。


当锁成功写入后,写入 row,时间戳设置为 startTs。


3) 以上 Prewrite 流程任何一步发生错误,都会进行回滚:删除 Lock,删除版本为 startTs 的数据。


4) 当 Prewrite 完成以后,进入 Commit 阶段,当前时间戳为 commitTs,且 commitTs> startTs :



  1. commit primary:写入 W 列新数据,时间戳为 commitTs,内容为 startTs,表明数据的最新版本是 startTs 对应的数据。

  2. 删除L列。


如果 primary row 提交失败的话,全事务回滚,回滚逻辑同 prewrite。如果 commit primary 成功,则可以异步的 commit secondaries, 流程和 commit primary 一致, 失败了也无所谓。


事务中的读操作



  1. 检查该行是否有 L 列,时间戳为 [0, startTs],如果有,表示目前有其他事务正占用此行,如果这个锁已经超时则尝试清除,否则等待超时或者其他事务主动解锁。注意此时不能直接返回老版本的数据,否则会发生幻读的问题。

  2. 读取至 startTs 时该行最新的数据,方法是:读取 W 列,时间戳为 [0, startTs], 获取这一列的值,转化成时间戳 t, 然后读取此列于 t 版本的数据内容。


由于锁是分两级的,primary 和 seconary,只要 primary 的行锁去掉,就表示该事务已经成功 提交,这样的好处是 secondary 的 commit 是可以异步进行的,只是在异步提交进行的过程中 ,如果此时有读请求,可能会需要做一下锁的清理工作。


原文链接

TiKV 的 MVCC(Multi-Version Concurrency Control)机制

文章分享qiuyesuifeng 发表了文章 • 0 个评论 • 121 次浏览 • 2016-11-24 11:20 • 来自相关话题

并发控制简介

事务隔离在数据库系统中有着非常重要的作用,因为对于用户来说数据库必须提供这样一个“假象”:当前只有这么一个用户连接到了数据库中,这样可以减轻应用层的开发难度。但是,对于数据库系统来说,因为同一时间可能会存在很多用户连接,那... 查看全部

并发控制简介


事务隔离在数据库系统中有着非常重要的作用,因为对于用户来说数据库必须提供这样一个“假象”:当前只有这么一个用户连接到了数据库中,这样可以减轻应用层的开发难度。但是,对于数据库系统来说,因为同一时间可能会存在很多用户连接,那么许多并发问题,比如数据竞争(data race),就必须解决。在这样的背景下,数据库管理系统(简称 DBMS)就必须保证并发操作产生的结果是安全的,通过可串行化(serializability)来保证。


虽然 Serilizability 是一个非常棒的概念,但是很难能够有效的实现。一个经典的方法就是使用一种两段锁(2PL)。通过 2PL,DBMS 可以维护读写锁来保证可能产生冲突的事务按照一个良好的次序(well-defined) 执行,这样就可以保证 Serializability。但是,这种通过锁的方式也有一些缺点:



  1. 读锁和写锁会相互阻滞(block)。

  2. 大部分事务都是只读(read-only)的,所以从事务序列(transaction-ordering)的角度来看是无害的。如果使用基于锁的隔离机制,而且如果有一段很长的读事务的话,在这段时间内这个对象就无法被改写,后面的事务就会被阻塞直到这个事务完成。这种机制对于并发性能来说影响很大。


多版本并发控制(Multi-Version Concurrency Control, 以下简称 MVCC)以一种优雅的方式来解决这个问题。在 MVCC 中,每当想要更改或者删除某个数据对象时,DBMS 不会在原地去删除或这修改这个已有的数据对象本身,而是创建一个该数据对象的新的版本,这样的话同时并发的读取操作仍旧可以读取老版本的数据,而写操作就可以同时进行。这个模式的好处在于,可以让读取操作不再阻塞,事实上根本就不需要锁。这是一种非常诱人的特型,以至于在很多主流的数据库中都采用了 MVCC 的实现,比如说 PostgreSQL,Oracle,Microsoft SQL Server 等。


TiKV 中的 MVCC


让我们深入到 TiKV 中的 MVCC,了解 MVCC 在 TiKV 中是如何实现的。


Timestamp Oracle(TSO)


因为TiKV 是一个分布式的储存系统,它需要一个全球性的授时服务,下文都称作 TSO(Timestamp Oracle),来分配一个单调递增的时间戳。 这样的功能在 TiKV 中是由 PD 提供的,在 Google 的 Spanner 中是由多个原子钟和 GPS 来提供的。


Storage


从源码结构上来看,想要深入理解 TiKV 中的 MVCC 部分,src/storage 是一个非常好的入手点。 Storage 是实际上接受外部命令的结构体。


pub struct Storage {
engine: Box<Engine>,
sendch: SendCh<Msg>,
handle: Arc<Mutex<StorageHandle>>,
}

impl Storage {
pub fn start(&mut self, config: &Config) -> Result<()> {
let mut handle = self.handle.lock().unwrap();
if handle.handle.is_some() {
return Err(box_err!("scheduler is already running"));
}

let engine = self.engine.clone();
let builder = thread::Builder::new().name(thd_name!("storage-scheduler"));
let mut el = handle.event_loop.take().unwrap();
let sched_concurrency = config.sched_concurrency;
let sched_worker_pool_size = config.sched_worker_pool_size;
let sched_too_busy_threshold = config.sched_too_busy_threshold;
let ch = self.sendch.clone();
let h = try!(builder.spawn(move || {
let mut sched = Scheduler::new(engine,
ch,
sched_concurrency,
sched_worker_pool_size,
sched_too_busy_threshold);
if let Err(e) = el.run(&mut sched) {
panic!("scheduler run err:{:?}", e);
}
info!("scheduler stopped");
}));
handle.handle = Some(h);

Ok(())
}
}

start 这个函数很好的解释了一个 storage 是怎么跑起来的。


Engine


首先是 EngineEngine 是一个描述了在储存系统中接入的的实际上的数据库的接口,raftkvEnginerocksdb 分别实现了这个接口。


StorageHandle


StorageHanle 是处理从sench 接受到指令,通过 mio 来处理 IO。


接下来在Storage中实现了async_getasync_batch_get等异步函数,这些函数中将对应的指令送到通道中,然后被调度器(scheduler)接收到并异步执行。


Ok,了解完Storage 结构体是如何实现的之后,我们终于可以接触到在Scheduler 被调用的 MVCC 层了。


当 storage 接收到从客户端来的指令后会将其传送到调度器中。然后调度器执行相应的过程或者调用相应的异步函数。在调度器中有两种操作类型,读和写。读操作在 MvccReader 中实现,这一部分很容易理解,暂且不表。写操作的部分是MVCC的核心。


MVCC


Ok,两段提交(2-Phase Commit,2PC)是在 MVCC 中实现的,整个 TiKV 事务模型的核心。在一段事务中,由两个阶段组成。


Prewrite

选择一个 row 作为 primary row, 余下的作为 secondary row。
对primary row 上锁. 在上锁之前,会检查是否有其他同步的锁已经上到了这个 row 上 或者是是否经有在 startTS 之后的提交操作。这两种情况都会导致冲突,一旦都冲突发生,就会回滚(rollback)
对于 secondary row 重复以上操作。


Commit

RollbackPrewrite 过程中出现冲突的话就会被调用。


Garbage Collector

很容易发现,如果没有垃圾收集器(Gabage Collector) 来移除无效的版本的话,数据库中就会存有越来越多的 MVCC 版本。但是我们又不能仅仅移除某个 safe point 之前的所有版本。因为对于某个 key 来说,有可能只存在一个版本,那么这个版本就必须被保存下来。在TiKV中,如果在 safe point 前存在Put 或者Delete,那么说明之后所有的 writes 都是可以被移除的,不然的话只有DeleteRollbackLock 会被删除。


TiKV-Ctl for MVCC


在开发和 debug 的过程中,我们发现查询 MVCC 的版本信息是一件非常频繁并且重要的操作。因此我们开发了新的工具来查询 MVCC 信息。TiKV 将 Key-Value,Locks 和Writes 分别储存在CF_DEFAULTCF_LOCKCF_WRITE中。它们以这样的格式进行编码

























default lock write
key z{encoded_key}{start_ts(desc)} z{encoded_key} z{encoded_key}{commit_ts(desc)}
value {value} {flag}{primary_key}{start_ts(varint)} {flag}{start_ts(varint)}

Details can be found here.


因为所有的 MVCC 信息在 Rocksdb 中都是储存在 CF Key-Value 中,所以想要查询一个 Key 的版本信息,我们只需要将这些信息以不同的方式编码,随后在对应的 CF 中查询即可。CF Key-Values 的表示形式


原文链接

解析 TiDB 在线数据同步工具 Syncer

文章分享qiuyesuifeng 发表了文章 • 3 个评论 • 192 次浏览 • 2016-11-24 11:02 • 来自相关话题

TiDB 是一个完全分布式的关系型数据库,从诞生的第一天起,我们就想让它来兼容 MySQL 语法,希望让原有的 MySQL 用户 (不管是单机的 MySQL,还是多机的 MySQL Sharding) 都可以在基本不修改代码的情况下,除了可以保留原有的 ... 查看全部

TiDB 是一个完全分布式的关系型数据库,从诞生的第一天起,我们就想让它来兼容 MySQL 语法,希望让原有的 MySQL 用户 (不管是单机的 MySQL,还是多机的 MySQL Sharding) 都可以在基本不修改代码的情况下,除了可以保留原有的 SQL 和 ACID 事务之外,还可以享受到分布式带来的高并发,高吞吐和 MPP 的高性能。


对于用户来说,简单易用是他们试用的最基本要求,得益于社区和 PingCAP 小伙伴们的努力,我们提供基于 Binary 和 基于 Kubernetes 的两种不同的一键部署方案来让用户可以在几分钟就可以部署起来一个分布式的 TiDB 集群,从而快速地进行体验。
当然,对于用户来说,最好的体验方式就是从原有的 MySQL 数据库同步一份数据镜像到 TiDB 来进行对于对比测试,不仅简单直观,而且也足够有说服力。实际上,我们已经提供了一整套的工具来辅助用户在线做数据同步,具体的可以参考我们之前的一篇文章:TiDB 作为 MySQL Slave 实现实时数据同步, 这里就不再展开了。后来有很多社区的朋友特别想了解其中关键的 Syncer 组件的技术实现细节,于是就有了这篇文章。


首先我们看下 Syncer 的整体架构图, 对于 Syncer 的作用和定位有一个直观的印象。


syncer


从整体的架构可以看到,Syncer 主要是通过把自己注册为一个 MySQL Slave 的方式,和 MySQL Master 进行通信,然后不断读取 MySQL Binlog,进行 Binlog Event 解析,规则过滤和数据同步。从工程的复杂度上来看,相对来说还是非常简单的,相对麻烦的地方主要是 Binlog Event 解析和各种异常处理,也是容易掉坑的地方。


为了完整地解释 Syncer 的在线同步实现,我们需要有一些额外的内容需要了解。


MySQL Replication


我们先看看 MySQL 原生的 Replication 复制方案,其实原理上也很简单:


1)MySQL Master 将数据变化记录到 Binlog (Binary Log),
2) MySQL Slave 的 I/O Thread 将 MySQL Master 的 Binlog 同步到本地保存为 Relay Log
3)MySQL Slave 的 SQL Thread 读取本地的 Relay Log,将数据变化同步到自身


mysql-replication


MySQL Binlog


MySQL 的 Binlog 分为几种不同的类型,我们先来大概了解下,也看看具体的优缺点。


1)Row
MySQL Master 将详细记录表的每一行数据变化的明细记录到 Binlog。
优点:完整地记录了行数据的变化信息,完全不依赖于存储过程,函数和触发器等等,不会出现因为一些依赖上下文信息而导致的主从数据不一致的问题。

缺点:所有的增删改查操作都会完整地记录在 Binlog 中,会消耗更大的存储空间。


2)Statement
MySQL Master 将每一条修改数据的 SQL 都会记录到 Binlog。

优点:相比 Row 模式,Statement 模式不需要记录每行数据变化,所以节省存储量和 IO,提高性能。

缺点:一些依赖于上下文信息的功能,比如 auto increment id,user define function, on update current_timestamp/now 等可能导致的数据不一致问题。


3)Mixed

MySQL Master 相当于 Row 和 Statement 模式的融合。

优点:根据 SQL 语句,自动选择 Row 和 Statement 模式,在数据一致性,性能和存储空间方面可以做到很好的平衡。

缺点:两种不同的模式混合在一起,解析处理起来会相对比较麻烦。


MySQL Binlog Event


了解了 MySQL Replication 和 MySQL Binlog 模式之后,终于进入到了最复杂的 MySQL Binlog Event 协议解析阶段了。


在解析 MySQL Binlog Eevent 之前,我们首先看下 MySQL Slave 在协议上是怎么和 MySQL Master 进行交互的。


Binlog dump


首先,我们需要伪造一个 Slave,向 MySQL Master 注册,这样 Master 才会发送 Binlog Event。注册很简单,就是向 Master 发送 COM_REGISTER_SLAVE 命令,带上 Slave 相关信息。这里需要注意,因为在 MySQL 的 replication topology 中,都需要使用一个唯一的 server id 来区别标示不同的 Server 实例,所以这里我们伪造的 slave 也需要一个唯一的 server id。


Binlog Event


对于一个 Binlog Event 来说,它分为三个部分,header,post-header 以及 payload。

MySQL 的 Binlog Event 有很多版本,我们只关心 v4 版本的,也就是从 MySQL 5.1.x 之后支持的版本,太老的版本应该基本上没什么人用了。


Binlog Event 的 header 格式如下:


4 bytes timestamp
1 bytes event type
4 bytes server-id
4 bytes event-size
4 bytes log pos
2 bytes flags

header 的长度固定为 19,event type 用来标识这个 event 的类型,event size 则是该 event 包括 header 的整体长度,而 log pos 则是下一个 event 所在的位置。


这个 header 对于所有的 event 都是通用的,接下来我们看看具体的 event。


FORMAT_DESCRIPTION_EVENT


在 v4 版本的 Binlog 文件中,第一个 event 就是 FORMAT_DESCRIPTION_EVENT,格式为:


2 bytes         binlog-version
string[50] mysql-server version
4 bytes create timestamp
1 byte event header length
string[p] event type header lengths

我们需要关注的就是 event type header length 这个字段,它保存了不同 event 的 post-header 长度,通常我们都不需要关注这个值,但是在解析后面非常重要的ROWS_EVENT 的时候,就需要它来判断 TableID 的长度了, 这个后续在说明。


ROTATE_EVENT


而 Binlog 文件的结尾,通常(只要 Master 不当机)就是 ROTATE_EVENT,格式如下:


Post-header
8 bytes position

Payload
string[p] name of the next binlog

它里面其实就是标明下一个 event 所在的 binlog filename 和 position。这里需要注意,当 Slave 发送 Binlog dump 之后,Master 首先会发送一个 ROTATE_EVENT,用来告知 Slave下一个 event 所在位置,然后才跟着 FORMAT_DESCRIPTION_EVENT


其实我们可以看到,Binlog Event 的格式很简单,文档都有着详细的说明。通常来说,我们仅仅需要关注几种特定类型的 event,所以只需要写出这几种 event 的解析代码就可以了,剩下的完全可以跳过。


TABLE_MAP_EVENT


上面我们提到 Syncer 使用 Row 模式的 Binlog,关于增删改的操作,对应于最核心的ROWS_EVENT ,它记录了每一行数据的变化情况。而如何解析相关的数据,是非常复杂的。在详细说明 ROWS_EVENT 之前,我们先来看看 TABLE_MAP_EVENT,该 event 记录的是某个 table 一些相关信息,格式如下:


post-header:
if post_header_len == 6 {
4 bytes table id
} else {
6 bytes table id
}
2 bytes flags

payload:
1 byte schema name length
string schema name
1 byte [00]
1 byte table name length
string table name
1 byte [00]
lenenc-int column-count
string.var_len[length=$column-count] column-def
lenenc-str column-meta-def
n bytes NULL-bitmask, length: (column-count + 8) / 7

table id 需要根据 post_header_len 来判断字节长度,而 post_header_len 就是存放到 FORMAT_DESCRIPTION_EVENT 里面的。这里需要注意,虽然我们可以用 table id 来代表一个特定的 table,但是因为 Alter Table 或者 Rotate Binlog Event 等原因,Master 会改变某个 table 的 table id,所以我们在外部不能使用这个 table id 来索引某个 table。


TABLE_MAP_EVENT 最需要关注的就是里面的 column meta 信息,后续我们解析 ROWS_EVENT 的时候会根据这个来处理不同数据类型的数据。column def 则定义了每个列的类型。


ROWS_EVENT


ROWS_EVENT 包含了 insert,update 以及 delete 三种 event,并且有 v0,v1 以及 v2 三个版本。

ROWS_EVENT 的格式很复杂,如下:


header:
if post_header_len == 6 {
4 table id
} else {
6 table id
}
2 flags
if version == 2 {
2 extra-data-length
string.var_len extra-data
}

body:
lenenc_int number of columns
string.var_len columns-present-bitmap1, length: (num of columns+7)/8
if UPDATE_ROWS_EVENTv1 or v2 {
string.var_len columns-present-bitmap2, length: (num of columns+7)/8
}

rows:
string.var_len nul-bitmap, length (bits set in 'columns-present-bitmap1'+7)/8
string.var_len value of each field as defined in table-map

if UPDATE_ROWS_EVENTv1 or v2 {
string.var_len nul-bitmap, length (bits set in 'columns-present-bitmap2'+7)/8
string.var_len value of each field as defined in table-map
}
... repeat rows until event-end

ROWS_EVENT 的 table id 跟 TABLE_MAP_EVENT 一样,虽然 table id 可能变化,但是 ROWS_EVENTTABLE_MAP_EVENT 的 table id 是能保证一致的,所以我们也是通过这个来找到对应的 TABLE_MAP_EVENT

为了节省空间,ROWS_EVENT 里面对于各列状态都是采用 bitmap 的方式来处理的。


首先我们需要得到 columns present bitmap 的数据,这个值用来表示当前列的一些状态,如果没有设置,也就是某列对应的 bit 为 0,表明该 ROWS_EVENT 里面没有该列的数据,外部直接使用 null 代替就成了。


然后就是 null bitmap,这个用来表明一行实际的数据里面有哪些列是 null 的,这里最坑爹的是 null bitmap 的计算方式并不是 (num of columns+7)/8,也就是 MySQL 计算 bitmap 最通用的方式,而是通过 columns present bitmap 的 bits set 个数来计算的,这个坑真的很大。为什么要这么设计呢,可能最主要的原因就在于 MySQL 5.6 之后 Binlog Row Image 的格式增加了 minimal 和 noblob,尤其是 minimal,update 的时候只会记录相应更改字段的数据,比如我一行有 16 列,那么用 2 个 byte 就能搞定 null bitmap 了,但是如果这时候只有第一列更新了数据,其实我们只需要使用 1 个 byte 就能记录了,因为后面的铁定全为 0,就不需要额外空间存放了。bits set 其实也很好理解,就是一个 byte 按照二进制展示的时候 1 的个数,譬如 1 的 bits set 就是1,而 3 的 bits set 就是 2,而 255 的 bits set 就是 8 了。


得到了 present bitmap 以及 null bitmap 之后,我们就能实际解析这行对应的列数据了,对于每一列,首先判断是否 present bitmap 标记了,如果为 0,则跳过用 null 表示,然后在看是否在 null bitmap 里面标记了,如果为 1,表明值为 null,最后我们就开始解析真正有数据的列了。


但是,因为我们得到的是一行数据的二进制流,我们怎么知道一列数据如何解析?这里,就要靠 TABLE_MAP_EVENT 里面的 column def 以及 meta 了。
column def 定义了该列的数据类型,对于一些特定的类型,譬如 MYSQL_TYPE_LONG, MYSQL_TYPE_TINY 等,长度都是固定的,所以我们可以直接读取对应的长度数据得到实际的值。但是对于一些类型,则没有这么简单了。这时候就需要通过 meta 来辅助计算了。


譬如对于 MYSQL_TYPE_BLOB 类型,meta 为 1 表明是 tiny blob,第一个字节就是 blob 的长度,2 表明的是 short blob,前两个字节为 blob 的长度等,而对于 MYSQL_TYPE_VARCHAR 类型,meta 则存储的是 string 长度。当然这里面还有最复杂的 MYSQL_TYPE_NEWDECIMALMYSQL_TYPE_TIME2 等类型,关于不同类型的 column 解析还是比较复杂的,可以单独开一章专门来介绍,因为篇幅关系这里就不展开介绍了,具体的可以参考官方文档。


搞定了这些,我们终于可以完整的解析一个 ROWS_EVENT 了:)


XID_EVENT

在事务提交时,不管是 Statement 还是 Row 模式的 Binlog,都会在末尾添加一个 XID_EVENT 事件代表事务的结束,里面包含事务的 ID 信息。


QUERY_EVENT


QUERY_EVENT 主要用于记录具体执行的 SQL 语句,MySQL 所有的 DDL 操作都记录在这个 event 里面。


Syncer


介绍完了 MySQL Replication 和 MySQL Binlog Event 之后,理解 Syncer 就变的比较容易了,上面已经介绍过基本的架构和功能了,在 Syncer 中, 解析和同步 MySQL Binlog,我们使用的是我们首席架构师唐刘的 go-mysql 作为核心 lib,这个 lib 已经在 github 和 bilibili 线上使用了,所以是非常安全可靠的。所以这部分我们就跳过介绍了,感兴趣的话,可以看下 github 开源的代码。这里面主要介绍几个核心问题:


MySQL Binlog 模式的选择


在 Syncer 的设计中,首先考虑的是可靠性问题,即使 Syncer 异常退出也可以直接重启起来,也不会对线上数据一致性产生影响。为了实现这个目标,我们必须处理数据同步的可重入问题。
对于 Mixed 模式来说,一个 insert 操作,在 Binlog 中记录的是 insert SQL,如果 Syncer 异常退出的话,因为 Savepoint 还没有来得及更新,会导致重启之后继续之前的 insert SQL,就会导致主键冲突问题,当然可以对 SQL 进行改写,将 insert 改成 replace,但是这里面就涉及到了 SQL 的解析和转换问题,处理起来就有点麻烦了。另外一点就是,最新版本的 MySQL 5.7 已经把 Row 模式作为默认的 Binlog 格式了。所以,在 Syncer 的实现中,我们很自然地选择 Row 模式作为 Binlog 的数据同步模式。


Savepoint 的选取


对于 Syncer 本身来说,我们更多的是考虑让它尽可能的简单和高效,所以每次 Syncer 重启都要尽可能从上次同步的 Binlog Pos 的地方做类似断点续传的同步。如何选取 Savepoint 就是一个需要考虑的问题了。

对于一个 DML 操作来说(以 Insert SQL 操作举例来看),基本的 Binlog Event 大概是下面的样子:


TABLE_MAP_EVENT
QUERY_EVENT → begin
WRITE_ROWS_EVENT
XID_EVENT

我们从 MySQL Binlog Event 中可以看到,每个 Event 都可以获取下一个 Event 开始的 MySQL Binlog Pos 位置,所以只要获取这个 Pos 信息保存下来就可以了。但是我们需要考虑的是,TABLE_MAP_EVENT 这个 event 是不能被 save 的,因为对于 WRITE_ROWS_EVENT 来说,没有 TABLE_MAP_EVENT 基本上没有办法进行数据解析,所以为什么很多人抱怨 MySQL Binlog 协议不灵活,主要原因就在这里,因为不管是 TABLE_MAP_EVENT 还是 WRITE_ROWS_EVENT 里面都没有 Schema 相关的信息的,这个信息只能在某个地方保留起来,比如 MySQL Slave,也就是 MySQL Binlog 是没有办法自解析的。


当然,对于 DDL 操作就比较简单了,DDL 本身就是一个 QUERY_EVENT


所以,Syncer 处于性能和安全性的考虑,我们会定期和遇到 DDL 的时候进行 Save。大家可能也注意到了,Savepoint 目前是存储在本地的,也就是存在一定程度的单点问题,暂时还在我们的 TODO 里面。


断点数据同步


在上面我们已经抛出过这个问题了,对于 Row 模式的 MySQL Binlog 来说,实现这点相对来说也是比较容易的。举例来说,对于一个包含 3 行 insert row 的 Txn 来说,event 大概是这样的:


TABLE_MAP_EVENT
QUERY_EVENT → begin
WRITE_ROWS_EVENT
WRITE_ROWS_EVENT
WRITE_ROWS_EVENT
XID_EVENT

所以在 Syncer 里面做的事情就比较容易了,就是把每个 WRITE_ROWS_EVENT 结合 TABLE_MAP_EVENT,去生成一个 replace into 的 SQL,为什么这里不用 insert 呢?主要是 replace into 是可重入的,重复执行多次,也不会对数据一致性产生破坏。

另外一个比较麻烦的问题就是 DDL 的操作,TiDB 的 DDL 实现是完全无阻塞的,所以根据 TiDB Lease 的大小不同,会执行比较长的时间,所以 DDL 操作是一个代价很高的操作,在 Syncer 的处理中通过获取 DDL 返回的标准 MySQL 错误来判断 DDL 是否需要重复执行。


当然,在数据同步的过程中,我们也做了很多其他的工作,包括并发 sync 支持,MySQL 网络重连,基于 DB/Table 的规则定制等等,感兴趣的可以直接看我们 tidb-tools/syncer 的开源实现,这里就不展开介绍了。


欢迎对 Syncer 这个小项目感兴趣的小伙伴们在 Github 上面和我们讨论交流,当然更欢迎各种 PR:)


原文链接

TiDB Weekly [2016.11.21]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 106 次浏览 • 2016-11-21 19:20 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 30 PRs in the TiDB repositories, 3 PRs in the TiDB docs repositories.


Added



Fixed



Improved



Document change


Add the following new guides:



Weekly update in TiKV


Last week, we landed 19 PRs in the TiKV repositories.


Fixed



Improved



原文链接

A Deep Dive into TiKV

文章分享qiuyesuifeng 发表了文章 • 1 个评论 • 200 次浏览 • 2016-11-17 17:29 • 来自相关话题

About TiKV

TiKV (The pronunciation is: /'taɪkeɪvi:/ tai-K-V, etymology: titanium) is a distributed Key-Value database ... 查看全部

About TiKV


TiKV (The pronunciation is: /'taɪkeɪvi:/ tai-K-V, etymology: titanium) is a distributed Key-Value database which is based on the design of Google Spanner, F1, and HBase, but it is much simpler without dependency on any distributed file system.


Architecture





  • Placement Driver (PD): PD is the brain of the TiKV system which manages the metadata about Nodes, Stores, Regions mapping, and makes decisions for data placement and load balancing. PD periodically checks replication constraints to balance load and data automatically.




  • Node: A physical node in the cluster. Within each node, there are one or more Stores. Within each Store, there are many Regions.




  • Store: There is a RocksDB within each Store and it stores data in local disks.



  • Region: Region is the basic unit of Key-Value data movement and corresponds to a data range in a Store. Each Region is replicated to multiple Nodes. These multiple replicas form a Raft group. A replica of a Region is called a Peer.


Protocol


TiKV uses the Protocol Buffer protocol for interactions among different components. Because Rust doesn’t support gRPC for the time being, we use our own protocol in the following format:


Message: Header + Payload 

Header: | 0xdaf4(2 bytes magic value) | 0x01(version 2 bytes) | msg\_len(4 bytes) | msg\_id(8 bytes) |

The data of Protocol Buffer is stored in the Payload part of the message. At the Network level, we will first read the 16-byte Header. According to the message length (msg_len) information in the Header, we calculate the actual length of the message, and then read the corresponding data and decode it.


The interaction protocol of TiKV is in the kvproto project and the protocol to support push-down is in the tipb project. Here, let’s focused on the kvproto project only.


About the protocol files in the kvproto project:



  • msgpb.proto: All the protocol interactions are in the same message structure. When a message is received, we will handle the message according to its MessageType.

  • metapb.proto: To define the public metadata for Store, Region, Peer, etc.

  • raftpb.proto: For the internal use of Raft. It is ported from etcd and needs to be consistent with etcd.

  • raft_serverpb.proto: For the interactions among the Raft nodes.

  • raft_cmdpb.proto: The actual command executed when Raft applies.

  • pdpb.proto: The protocol for the interaction between TiKV and PD.

  • kvrpcpb.proto: The Key-Value protocol that supports transactions.

  • mvccpb.proto: For internal Multi-Version Concurrency Control (MVCC).

  • coprocessor.proto: To support the Push-Down operations.


There are following ways for external applications to connect to TiKV:



  • For the simple Key-Value features only, implement raft_cmdpb.proto.

  • For the Transactional Key-Value features, implement kvrpcpb.proto.

  • For the Push-Down features, implement coprocessor.proto. See tipb for detailed push-down protocol.


Raft


TiKV uses the Raft algorithm to ensure the data consistency in the distributed systems. For more information, see https://raft.github.io/.


The Raft in TiKV is completely migrated from etcd. We chose etcd Raft because it is very simple to implement, very easy to migrate and it is production proven.


The Raft implementation in TiKV can be used independently. You can apply it in your project directly.


See the following details about how to use Raft:



  1. Define its own storage and implement the Raft Storage trait. See the following Storage trait interface:


    // initial_state returns the information about HardState and ConfState in Storage
fn initial_state(&self) -> Result<RaftState>;

// return the log entries in the [low, high] range
fn entries(&self, low: u64, high: u64, max_size: u64) -> Result<Vec<Entry>>;

// get the term of the log entry according to the corresponding log index
fn term(&self, idx: u64) -> Result<u64>;

// get the index from the first log entry at the current position
fn first_index(&self) -> Result<u64>;

// get the index from the last log entry at the current position
fn last_index(&self) -> Result<u64>;

// generate a current snapshot
fn snapshot(&self) -> Result<Snapshot>;



  1. Create a raw node object and pass the corresponding configuration and customized storage instance to the object. About the configuration, we need to pay attention to election_tick and heartbeat_tick. Some of the Raft logics step by periodical ticks. For every Tick, the Leader will decide if the frequency of the heartbeat elapsing exceeds the frequency of the heartbeat_tick. If it does, the Leader will send heartbeats to the Followers and reset the elapse. For a Follower, if the frequency of the election elapsing exceeds the frequency of the election_tick, the Follower will initiate an election.




  2. After a raw node is created, the tick interface of the raw node will be called periodically (like every 100ms) and drives the internal Raft Step function.




  3. If data is to be written by Raft, the Propose interface is called directly. The parameters of the Propose interface is an arbitrary binary data which means that Raft doesn’t care the exact data content that is replicated by it. It is completely up to the external logics as how to handle the data.




  4. If it is to process the membership changes, the propose_conf_change interface of the raw node can be called to send a ConfChange object to add/remove a certain node.




  5. After the functions in the raw node like Tick and Propose of the raw node are called, Raft will initiate a Ready state. Here are some details of the Ready state:


    There are three parts in the Ready state:



    • The part that needs to be stored in Raft storage, which are entries, hard state and snapshot.

    • The part that needs to be sent to other Raft nodes, which are messages.

    • The part that needs to be applied to other state machines, which are committed_entries.




After handling the Ready status, the Advance function needs be called to inform Raft of the next Ready process.


In TiKV, Raft is used through mio as in the following process:




  1. Register a base Raft tick timer (usually 100ms). Every time the timer timeouts, the Tick of the raw node is called and the timer is re-registered.




  2. Receive the external commands through the notify function in mio and call the Propose or the propose_conf_change interface.



  3. Decide if a Raft is ready in the mio tick callback (Note: The mio tick is called at the end of each event loop, which is different from the Raft tick.). If it is ready, proceed with the Ready process.


In the descriptions above, we covered how to use one Raft only. But in TiKV, we have multiple Raft groups. These Raft groups are independent to each other and therefore can be processed following the same approach.


In TiKV, each Raft group corresponds to a Region. At the very beginning, there is only one Region in TiKV which is in charge of the range (-inf, +inf). As more data comes in and the Region reaches its threshold (64 MB currently), the Region is split into two Regions. Because all the data in TiKV are sorted according to the key, it is very convenient to choose a Split Key to split the Region. See Split for the detailed splitting process.


Of course, where there is Split, there is Merge. If there are very few data in two adjacent Regions, these two regions can merge to one big Region. Region Merge is in the TiKV roadmap but it is not implemented yet.


Placement Driver


Placement Driver (PD) is in charge of the managing and scheduling of the whole TiKV cluster. It is a central service and we have to ensure that it is highly available and stable.


The first issue to be resolved is the single point of failure of PD. Our solution is to start multiple PD servers. These servers elect a Leader through the election mechanism in etcd and the leader provides services to the outside. If the leader is down, there will be another election to elect a new leader to provide services.


The second issue is the consistency of the data stored in PD. If one PD is down, how to ensure that the new elected PD has the consistent data? This is also resolved by putting PD data in etcd. Because etcd is a distributed consistent Key-Value store, it helps us ensure the consistency of the data stored in it. When the new PD is started, it only needs to load data from etcd.


At first, we used the independent external etcd service, but now we have embedded PD in etcd, which means, PD itself is an etcd. The embedment makes it simpler to deploy because there is one service less. The embedment also makes it more convenient for PD and etcd to customize and therefore improve the performance.


The current functions of PD are as follows:




  1. The Timestamp Oracle (TSO) service: to provide the globally unique timestamp for TiDB to implement distributed transactions.




  2. The generation of the globally unique ID: to enable TiKV to generate the unique IDs for new Regions and Stores.




  3. TiKV cluster auto-balance: In TiKV, the basic data movement unit is Region, so the PD auto-balance is to balance Region automatically. There are two ways to trigger the scheduling of a Region:


    1). The heartbeat triggering: Regions report the current state to PD periodically. If PD finds that there are not enough or too much replicas in one Region, PD informs this Region to initiate membership change.


    2). The regular triggering: PD checks if the whole system needs scheduling on a regular bases. If PD finds out that there is not enough space on a certain Store or that there are too many leader Regions on a certain Store and the load is too high, PD will select a Region from the Store and move the replicas to another Store.




Transaction


The transaction model in TiKV is inspired by Google Percolator and Themis from Xiaomi with the following optimizations:




  1. For a system that is similar to Percolator, there needs to be a globally unique time service, which is called Timestamp Oracle (TSO), to allocate a monotonic increasing timestamp. The functions of TSO are provided in PD in TiKV. The generation of TSO in PD is purely memory operations and stores the TSO information in etcd on a regular base to ensure that TSO is still monotonic increasing even after PD restarts.




  2. Compared with Percolator where the information such as Lock is stored by adding extra column to a specific row, TiKV uses a column family (CF) in RocksDB to handle all the information related to Lock. For massive data, there aren’t many row Locks for simultaneous transactions. So the Lock processing speed can be improved significantly by placing it in an extra and optimized CF.



  3. Another advantage about using an extra CF is that we can easily clean up the remaining Locks. If the Lock of a row is acquired by a transaction but is not cleaned up because of crashed threads or other reasons, and there are no more following-up transactions to visit this Lock, the Lock is left behind. We can easily discover and clean up these Locks by scanning the CF.


The implementation of the distributed transaction depends on the TSO service and the client that encapsulates corresponding transactional algorithm which is implemented in TiDB. The monotonic increasing timestamp can set the time series for concurrent transactions and the external clients can act as a coordinator to resolve the conflicts and unexpected terminations of the transactions.


Let’s see how a transaction is executed:




  1. The transaction starts. When the transaction starts, the client must obtain the current timestamp (startTS) from TSO. Because TSO guarantees the monotonic increasing of the timestamp, startTS can be used to identify the time series of the transaction.




  2. The transaction is in progress. During a transaction, all the read operations must carry startTS while they send RPC requests to TiKV and TiKV uses MVCC to make sure to return the data that is written before startTS. For the write operations, TiKV uses optimistic concurrency control which means the actual data is cached on the clients rather than written to the servers assuming that the current transaction doesn’t affect other transactions.




  3. The transaction commits. TiKV uses a 2-phase commit algorithm. Its difference from the common 2-phase commit is that there is no independent transaction manager. The commit state of a transaction is identified by the commit state of the PrimaryKey which is selected from one of the to-be-committed keys.


    1). During the Prewrite phase, the client submits the data that is to be written to multiple TiKV servers. When the data is stored in a server, the server sets the corresponding Key as Locked and records the the PrimaryKey of the transaction. If there is any writing conflict on any of the nodes, the transaction aborts and rolls back.


    2). When Prewrite finishes, a new timestamp is obtained from TSO and is set as commitTS.


    3). During the Commit phase, requests are sent to the TiKV servers with PrimaryKey. The process of how TiKV handles commit is to clean up the Locks from the PrimaryKey phase and write corresponding commit records with commitTS. When the PrimaryKey commit finishes, the transaction is committed. The Locks that remain on other Keys can get the commit state and the corresponding commitTS by retrieving the state of the Primarykey. But in order to reduce the cost of cleaning up Locks afterwards, the practical practice is to submit all the Keys that are involved in the transaction asynchronously on the backend.




Coprocessor


Similar to HBase, TiKV provides the Coprocessor support. But for the time being, Coprocessor cannot be dynamically loaded, it has to be statically compiled to the code.


Currently, the Coprocessor in TiKV is mainly used in two situations, Split and push-down, both to serve TiDB.




  1. For Split, before the Region split requests are truly proposed, the split key needs to be checked if it is legal. For example, for a Row in TiDB, there are many versions of it in TiKV, such as V1, V2, and V3, V3 being the latest version. Assuming that V2 is the selected split key, then the data of the Row might be split to two different Regions, which means the data in the Row cannot be handled atomically. Therefore, the Split Coprocessor will adjust the split key to V1. In this way, the data in this Row is still in the same Region during the splitting.



  2. For push-down, the Coprocessor is used to improve the performance of TiDB. For some operations like select count(*), there is no need for TiDB to get data from row to row first and then count. The quicker way is that TiDB pushes down these operations to the corresponding TiKV nodes, the TiKV nodes do the computing and then TiDB consolidates the final results.


Let’s take an example of select count(*) from t1 to show how a complete push-down process works:




  1. After TiDB parses the SQL statement, based on the range of the t1 table, TiDB finds out that all the data of t1 are in Region 1 and Region 2 on TiKV, so TiDB sends the push-down commands to Region 1 and Region 2.




  2. After Region 1 and Region 2 receive the push-down commands, they get a snapshot of their data separately by using the Raft process.




  3. Region 1 and Region 2 traverse their snapshots to get the corresponding data and and calculate count().



  4. Each Region returns the result of count() to TiDB and TiDB consolidates and outputs the total result.


Key processes analysis


Key-Value operation


When a request of Get or Put is sent to TiKV, how does TiKV process it?


As mentioned earlier, TiKV provides features such as simple Key-Value, transactional Key-Value and push-down. But no matter it’s transactional Key-Value or push-down, it will be transformed to simple Key-Value operations in TiKV. Therefore, let’s take an example of simple Key-Value operations to show how TiKV processes a request. As for how TiKV implements transaction Key-Value and push-down support, let’s cover that later.


Let’s take Put as an example to show how a complete Key-Value process works:




  1. The client sends a Put command to TiKV, such as put k1 v1. First, the client gets the Region ID for the k1 key and the leader of the Region peers from PD. Second, the client sends the Put request to the corresponding TiKV node.




  2. After the TiKV server receives the request, it notifies the internal RaftStore thread through the mio channel and takes a callback function with it.




  3. When the RaftStore thread receives the request, first it checks if the request is legal including if the request is a legal epoch. If the request is legal and the peer is the Leader of the Region, the RaftStore thread encodes the request to be a binary array, calls Propose and begins the Raft process.




  4. At the stage of handle ready, the newly generated entry will be first appended to the Raft log and sent to other followers at the same time.




  5. When the majority of the nodes of the Region have appended the entry to the log, the entry is committed. In the following Ready process, the entry can be obtained from the committed_entries, then decoded and the corresponding command can be executed. This is how the put k1 v1 command is executed in RocksDB.



  6. When the entry log is applied by the leader, the callback of the entry will be called and return the response to the client.


The same process also applies to Get, which means all the requests are not processed until they are replicated to the majority of the nodes by Raft. Of course, this is also to ensure the data linearizability in distributed systems.


Of course, we will optimize the reading requests for better performance in the following aspects:




  1. Introduce lease into the Leader. Within the lease, we can assume that the Leader is valid so that the Leader can provide the read service directly and there will be no need to go through Raft replicated log.



  2. The Follower provides the read service.


These optimizations are mentioned in the Raft paper and they have been supported by etcd. We will introduce them into TiKV as well in the future.


Membership Change


To ensure the data safety, there are multiple replicas on different stores. Each replica is another replica’s Peer. If there aren’t enough replicas for a certain Region, we will add new replicas; on the contrary, if the numbers of the replicas for a certain Region exceeds the threshold, we will remove some replicas.


In TiKV, the change of the Region replicas are completed by the Raft Membership Change. But how and when a Region changes its membership is scheduled by PD. Let’s take adding a Replica as an example to show how the whole process works:




  1. A Region sends heartbeats to PD regularly. The heartbeats include the relative information about this Region, such as the information of the peers.




  2. When PD receives the heartbeats, it will check if the number of the replicas of this Region is consistent with the setup. Assuming there are only two replicas in this Region but it’s three replicas in the setup, PD will find an appropriate Store and return the ChangePeer command to the Region.



  3. After the Region receives the ChangePeer command, if it finds it necessary to add replica to another Store, it will submit a ChangePeer request through the Raft process. When the log is applied, the new peer information will be updated in the Region meta and then the Membership Change completes.


It should be noted that even if the Membership Change completes, it only means that the Replica information is added to the meta by the Region. Later if the Leader finds that if there is no data in the new Follower, it will send snapshot to it.


It should also be noted that the Membership Change implementation in TiKV and etcd is different from what’s in the Raft paper. In the Raft paper, if a new peer is added, it is added to the Region meta at the Propose command. But to simplify, TiKV and etcd don’t add the peer information to the Region meta until the log is applied.


Split


At the very beginning, there is only one Region. As data grows, the Region needs to be split.


Within TiKV, if a Region splits, there will be two new Regions, which we call them the Left Region and the Right Region. The Left Region will use all the IDs of the old Region. We can assume that the Region just changes its range. The Right Region will get a new ID through PD. Here is a simple example:


Region 1 [a, c) -> Region 1 [a, b) + Region 2 [b, c)

The original range of Region 1 is [a, c). After splitting at the b point, the Left Region is still Region 1 but the range is now [a, b). The Right Region is a new Region, Region 2, and its range is [b, c).


Assuming the base size of Region 1 is 64MB. A complete spit process is as follows:




  1. In a given period of time, if the accumulated size of the data in Region 1 exceeds the threshold (8MB for example), Region 1 notifies the split checker to check Region 1.




  2. The split checker scans Region 1 sequentially. When it finds that the accumulated size of a certain key exceeds 64MB, it will keep a record of this key and make it the split key. Meanwhile, the split checker continues scanning and if it finds that the accumulated size of a certain key exceeds the threshold (96 MB for example), it considers this Region could split and notifies the RaftStore thread.




  3. When the RaftStore thread receives the message, it sends the AskSplit command to PD and requests PD to assign a new ID for the newly generated PD, Region 2, for example.




  4. When the ID is generated in PD, an Admin SplitRequest will be generated and sent to the RaftSore thread.




  5. Before RaftStore proposes the Admin SplitRequest, the Coprocessor will pre-process the command and decide if the split key is appropriate. If the split key is not appropriate, the Coprocessor will adjust the split key to an appropriate one.




  6. The Split request is submitted through the Raft process and then applied. For TiKV, the splitting of a Region is to change the range of the original Region and then create another Region. All these changes involves only the change of the Region meta, the real data under the hood is not moved, so it is very fast for Region to split in TiKV.



  7. When the Splitting completes, TiKV sends the latest information about the Left Region and Right Region to PD.

TiDB Weekly [2016.11.14]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 108 次浏览 • 2016-11-16 22:53 • 来自相关话题

Last week, we landed 查看全部

Last week, we landed 25 PRs in the TiDB repositories, 5 PRs in the TiDB docs repositories.


Weekly update in TiDB


Added



Fixed



Improved



Document change



Weekly update in TiKV


Last week, we landed 23 PRs in the TiKV repositories.


Added




  • Resolve locks in batches to avoid generating a huge Raft log when a transaction rolls back.




  • Add applying snapshot count to enhance the Placement Driver (PD) balance, with PR 1278, 381.



  • Check the system configuration before startup.


Fixed



Improved



原文链接

TiDB Weekly [2016.11.07]

Go开源项目qiuyesuifeng 发表了文章 • 2 个评论 • 139 次浏览 • 2016-11-07 19:17 • 来自相关话题

Last week, we landed 查看全部

Last week, we landed 42 PRs in the TiDB repositories and 29 PRs in the TiKV repositories.


New release


TiDB Beta 4


Weekly update in TiDB


Added



Fixed



Improved



Weekly update in TiKV


Added



Fixed



Improved



原文链接

TiDB Weekly [2016.10.24]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 147 次浏览 • 2016-11-04 11:01 • 来自相关话题

Last week, we landed 查看全部

Last week, we landed 30 PRs in the TiDB repositories and 26 PRs in the TiKV repositories.


Notable changes to TiDB



Notable changes to TiKV



Notable changes to Placement Driver



原文链接

TiDB Weekly [2016.10.31]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 134 次浏览 • 2016-11-04 10:53 • 来自相关话题

Last week, we landed 查看全部

Last week, we landed 24 PRs in the TiDB repositories and 28 PRs in the TiKV repositories.


Notable changes to TiDB



Notable changes to TiKV



Notable changes to Placement Driver



  • Refactor store/region cache to make code clearer, including PR 353, 365, 366.

  • Support the GetPDMembers API.


New contributors



原文链接

[校招][实习] PingCAP 招前端开发工程师

回复

招聘应聘qiuyesuifeng 发起了问题 • 1 人关注 • 0 个回复 • 318 次浏览 • 2016-10-11 14:30 • 来自相关话题

[校招][实习] PingCAP 招后端开发工程师

回复

招聘应聘qiuyesuifeng 发起了问题 • 1 人关注 • 0 个回复 • 396 次浏览 • 2016-10-11 14:27 • 来自相关话题

TiDB Weekly [2017.01.01]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 88 次浏览 • 2017-01-07 13:13 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 28 PRs in the TiDB repositories.


Added



Fixed



Improved



New contributor



Weekly update in TiKV


Last week, We landed 19 PRs in the TiKV repositories.


Added




  • Move raw_get to thread pool.




  • Add ResourceKind for operator and don't resend duplicated AddPeer response when the peer is still pending, see #449.



  • Add member command for pd-ctl.


Fixed



Improved



原文链接

TiDB Weekly [2016.12.26]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 50 次浏览 • 2017-01-07 13:12 • 来自相关话题

New Release

TiDB RC1 is released!

Weekly updat... 查看全部

New Release


TiDB RC1 is released!


Weekly update in TiDB


Last week, we landed 34 PRs in the TiDB repositories.


Added



Fixed



Improved



New contributor



Weekly update in TiKV


Last week, We landed 14 PRs in the TiKV repositories.


Added



Fixed



Improved



原文链接

TiDB Weekly [2016.12.19]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 121 次浏览 • 2016-12-25 13:46 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 32 PRs in the TiDB repositories.


Added



Fixed



Improved



New contributor



Weekly update in TiKV


Last week, we landed 11 PRs in the TiKV repositories.


Added



Fixed



Improved



原文链接

TiDB Weekly [2016.12.12]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 129 次浏览 • 2016-12-17 14:14 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 41 PRs in the TiDB repositories.


Added



Fixed



Improved



Weekly update in TiKV


Last week, we landed 34 PRs in the TiKV repositories.


Added



Fixed



Improved



原文链接

TiDB Weekly [2016.12.05]

文章分享qiuyesuifeng 发表了文章 • 0 个评论 • 123 次浏览 • 2016-12-07 11:42 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 48 PRs in the TiDB repositories, 6 PRs in the TiDB docs repositories.


Added



Fixed



Improved



Document change


The following guides are updated:



Weekly update in TiKV


Last week, we landed 22 PRs in the TiKV repositories.


Added



Fixed



Improved




  • Replace the score type with resource kind to calculate the scores more easily.




  • Replace origin concept balance with schedule and simplify configurations.



  • Use coordinator to control the speed of different schedulers.


原文链接

TiDB Weekly [2016.11.28]

文章分享qiuyesuifeng 发表了文章 • 0 个评论 • 101 次浏览 • 2016-11-30 12:12 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 44 PRs in the TiDB repositories, 3 PRs in the TiDB docs repositories.


Added



Fixed



+ Add sequence number in binlog to preserve the original mutation order.



Improved



Document change


Add the following new guides:



Weekly update in TiKV


Last week, we landed 20 PRs in the TiKV repositories.


Added



Fixed



Improved



原文链接

Percolator 和 TiDB 事务算法

文章分享qiuyesuifeng 发表了文章 • 1 个评论 • 124 次浏览 • 2016-11-24 11:24 • 来自相关话题

本文先概括的讲一下 Google Percolator 的大致流程。Percolator 是 Google 的上一代分布式事务解决方案,构建在 BigTable 之上,在 Google 内部 用于网页索引更新的业务,原始的论文查看全部

本文先概括的讲一下 Google Percolator 的大致流程。Percolator 是 Google 的上一代分布式事务解决方案,构建在 BigTable 之上,在 Google 内部 用于网页索引更新的业务,原始的论文在此。原理比较简单,总体来说就是一个经过优化的二阶段提交的实现,进行了一个二级锁的优化。TiDB 的事务模型沿用了 Percolator 的事务模型。
总体的流程如下:


读写事务


1) 事务提交前,在客户端 buffer 所有的 update/delete 操作。
2) Prewrite 阶段:


首先在所有行的写操作中选出一个作为 primary,其他的为 secondaries。


PrewritePrimary: 对 primaryRow 写入 L 列(上锁),L 列中记录本次事务的开始时间戳。写入 L 列前会检查:



  1. 是否已经有别的客户端已经上锁 (Locking)。

  2. 是否在本次事务开始时间之后,检查 W 列,是否有更新 [startTs, +Inf) 的写操作已经提交 (Conflict)。


在这两种种情况下会返回事务冲突。否则,就成功上锁。将行的内容写入 row 中,时间戳设置为 startTs。


将 primaryRow 的锁上好了以后,进行 secondaries 的 prewrite 流程:



  1. 类似 primaryRow 的上锁流程,只不过锁的内容为事务开始时间及 primaryRow 的 Lock 的信息。

  2. 检查的事项同 primaryRow 的一致。


当锁成功写入后,写入 row,时间戳设置为 startTs。


3) 以上 Prewrite 流程任何一步发生错误,都会进行回滚:删除 Lock,删除版本为 startTs 的数据。


4) 当 Prewrite 完成以后,进入 Commit 阶段,当前时间戳为 commitTs,且 commitTs> startTs :



  1. commit primary:写入 W 列新数据,时间戳为 commitTs,内容为 startTs,表明数据的最新版本是 startTs 对应的数据。

  2. 删除L列。


如果 primary row 提交失败的话,全事务回滚,回滚逻辑同 prewrite。如果 commit primary 成功,则可以异步的 commit secondaries, 流程和 commit primary 一致, 失败了也无所谓。


事务中的读操作



  1. 检查该行是否有 L 列,时间戳为 [0, startTs],如果有,表示目前有其他事务正占用此行,如果这个锁已经超时则尝试清除,否则等待超时或者其他事务主动解锁。注意此时不能直接返回老版本的数据,否则会发生幻读的问题。

  2. 读取至 startTs 时该行最新的数据,方法是:读取 W 列,时间戳为 [0, startTs], 获取这一列的值,转化成时间戳 t, 然后读取此列于 t 版本的数据内容。


由于锁是分两级的,primary 和 seconary,只要 primary 的行锁去掉,就表示该事务已经成功 提交,这样的好处是 secondary 的 commit 是可以异步进行的,只是在异步提交进行的过程中 ,如果此时有读请求,可能会需要做一下锁的清理工作。


原文链接

TiKV 的 MVCC(Multi-Version Concurrency Control)机制

文章分享qiuyesuifeng 发表了文章 • 0 个评论 • 121 次浏览 • 2016-11-24 11:20 • 来自相关话题

并发控制简介

事务隔离在数据库系统中有着非常重要的作用,因为对于用户来说数据库必须提供这样一个“假象”:当前只有这么一个用户连接到了数据库中,这样可以减轻应用层的开发难度。但是,对于数据库系统来说,因为同一时间可能会存在很多用户连接,那... 查看全部

并发控制简介


事务隔离在数据库系统中有着非常重要的作用,因为对于用户来说数据库必须提供这样一个“假象”:当前只有这么一个用户连接到了数据库中,这样可以减轻应用层的开发难度。但是,对于数据库系统来说,因为同一时间可能会存在很多用户连接,那么许多并发问题,比如数据竞争(data race),就必须解决。在这样的背景下,数据库管理系统(简称 DBMS)就必须保证并发操作产生的结果是安全的,通过可串行化(serializability)来保证。


虽然 Serilizability 是一个非常棒的概念,但是很难能够有效的实现。一个经典的方法就是使用一种两段锁(2PL)。通过 2PL,DBMS 可以维护读写锁来保证可能产生冲突的事务按照一个良好的次序(well-defined) 执行,这样就可以保证 Serializability。但是,这种通过锁的方式也有一些缺点:



  1. 读锁和写锁会相互阻滞(block)。

  2. 大部分事务都是只读(read-only)的,所以从事务序列(transaction-ordering)的角度来看是无害的。如果使用基于锁的隔离机制,而且如果有一段很长的读事务的话,在这段时间内这个对象就无法被改写,后面的事务就会被阻塞直到这个事务完成。这种机制对于并发性能来说影响很大。


多版本并发控制(Multi-Version Concurrency Control, 以下简称 MVCC)以一种优雅的方式来解决这个问题。在 MVCC 中,每当想要更改或者删除某个数据对象时,DBMS 不会在原地去删除或这修改这个已有的数据对象本身,而是创建一个该数据对象的新的版本,这样的话同时并发的读取操作仍旧可以读取老版本的数据,而写操作就可以同时进行。这个模式的好处在于,可以让读取操作不再阻塞,事实上根本就不需要锁。这是一种非常诱人的特型,以至于在很多主流的数据库中都采用了 MVCC 的实现,比如说 PostgreSQL,Oracle,Microsoft SQL Server 等。


TiKV 中的 MVCC


让我们深入到 TiKV 中的 MVCC,了解 MVCC 在 TiKV 中是如何实现的。


Timestamp Oracle(TSO)


因为TiKV 是一个分布式的储存系统,它需要一个全球性的授时服务,下文都称作 TSO(Timestamp Oracle),来分配一个单调递增的时间戳。 这样的功能在 TiKV 中是由 PD 提供的,在 Google 的 Spanner 中是由多个原子钟和 GPS 来提供的。


Storage


从源码结构上来看,想要深入理解 TiKV 中的 MVCC 部分,src/storage 是一个非常好的入手点。 Storage 是实际上接受外部命令的结构体。


pub struct Storage {
engine: Box<Engine>,
sendch: SendCh<Msg>,
handle: Arc<Mutex<StorageHandle>>,
}

impl Storage {
pub fn start(&mut self, config: &Config) -> Result<()> {
let mut handle = self.handle.lock().unwrap();
if handle.handle.is_some() {
return Err(box_err!("scheduler is already running"));
}

let engine = self.engine.clone();
let builder = thread::Builder::new().name(thd_name!("storage-scheduler"));
let mut el = handle.event_loop.take().unwrap();
let sched_concurrency = config.sched_concurrency;
let sched_worker_pool_size = config.sched_worker_pool_size;
let sched_too_busy_threshold = config.sched_too_busy_threshold;
let ch = self.sendch.clone();
let h = try!(builder.spawn(move || {
let mut sched = Scheduler::new(engine,
ch,
sched_concurrency,
sched_worker_pool_size,
sched_too_busy_threshold);
if let Err(e) = el.run(&mut sched) {
panic!("scheduler run err:{:?}", e);
}
info!("scheduler stopped");
}));
handle.handle = Some(h);

Ok(())
}
}

start 这个函数很好的解释了一个 storage 是怎么跑起来的。


Engine


首先是 EngineEngine 是一个描述了在储存系统中接入的的实际上的数据库的接口,raftkvEnginerocksdb 分别实现了这个接口。


StorageHandle


StorageHanle 是处理从sench 接受到指令,通过 mio 来处理 IO。


接下来在Storage中实现了async_getasync_batch_get等异步函数,这些函数中将对应的指令送到通道中,然后被调度器(scheduler)接收到并异步执行。


Ok,了解完Storage 结构体是如何实现的之后,我们终于可以接触到在Scheduler 被调用的 MVCC 层了。


当 storage 接收到从客户端来的指令后会将其传送到调度器中。然后调度器执行相应的过程或者调用相应的异步函数。在调度器中有两种操作类型,读和写。读操作在 MvccReader 中实现,这一部分很容易理解,暂且不表。写操作的部分是MVCC的核心。


MVCC


Ok,两段提交(2-Phase Commit,2PC)是在 MVCC 中实现的,整个 TiKV 事务模型的核心。在一段事务中,由两个阶段组成。


Prewrite

选择一个 row 作为 primary row, 余下的作为 secondary row。
对primary row 上锁. 在上锁之前,会检查是否有其他同步的锁已经上到了这个 row 上 或者是是否经有在 startTS 之后的提交操作。这两种情况都会导致冲突,一旦都冲突发生,就会回滚(rollback)
对于 secondary row 重复以上操作。


Commit

RollbackPrewrite 过程中出现冲突的话就会被调用。


Garbage Collector

很容易发现,如果没有垃圾收集器(Gabage Collector) 来移除无效的版本的话,数据库中就会存有越来越多的 MVCC 版本。但是我们又不能仅仅移除某个 safe point 之前的所有版本。因为对于某个 key 来说,有可能只存在一个版本,那么这个版本就必须被保存下来。在TiKV中,如果在 safe point 前存在Put 或者Delete,那么说明之后所有的 writes 都是可以被移除的,不然的话只有DeleteRollbackLock 会被删除。


TiKV-Ctl for MVCC


在开发和 debug 的过程中,我们发现查询 MVCC 的版本信息是一件非常频繁并且重要的操作。因此我们开发了新的工具来查询 MVCC 信息。TiKV 将 Key-Value,Locks 和Writes 分别储存在CF_DEFAULTCF_LOCKCF_WRITE中。它们以这样的格式进行编码

























default lock write
key z{encoded_key}{start_ts(desc)} z{encoded_key} z{encoded_key}{commit_ts(desc)}
value {value} {flag}{primary_key}{start_ts(varint)} {flag}{start_ts(varint)}

Details can be found here.


因为所有的 MVCC 信息在 Rocksdb 中都是储存在 CF Key-Value 中,所以想要查询一个 Key 的版本信息,我们只需要将这些信息以不同的方式编码,随后在对应的 CF 中查询即可。CF Key-Values 的表示形式


原文链接

解析 TiDB 在线数据同步工具 Syncer

文章分享qiuyesuifeng 发表了文章 • 3 个评论 • 192 次浏览 • 2016-11-24 11:02 • 来自相关话题

TiDB 是一个完全分布式的关系型数据库,从诞生的第一天起,我们就想让它来兼容 MySQL 语法,希望让原有的 MySQL 用户 (不管是单机的 MySQL,还是多机的 MySQL Sharding) 都可以在基本不修改代码的情况下,除了可以保留原有的 ... 查看全部

TiDB 是一个完全分布式的关系型数据库,从诞生的第一天起,我们就想让它来兼容 MySQL 语法,希望让原有的 MySQL 用户 (不管是单机的 MySQL,还是多机的 MySQL Sharding) 都可以在基本不修改代码的情况下,除了可以保留原有的 SQL 和 ACID 事务之外,还可以享受到分布式带来的高并发,高吞吐和 MPP 的高性能。


对于用户来说,简单易用是他们试用的最基本要求,得益于社区和 PingCAP 小伙伴们的努力,我们提供基于 Binary 和 基于 Kubernetes 的两种不同的一键部署方案来让用户可以在几分钟就可以部署起来一个分布式的 TiDB 集群,从而快速地进行体验。
当然,对于用户来说,最好的体验方式就是从原有的 MySQL 数据库同步一份数据镜像到 TiDB 来进行对于对比测试,不仅简单直观,而且也足够有说服力。实际上,我们已经提供了一整套的工具来辅助用户在线做数据同步,具体的可以参考我们之前的一篇文章:TiDB 作为 MySQL Slave 实现实时数据同步, 这里就不再展开了。后来有很多社区的朋友特别想了解其中关键的 Syncer 组件的技术实现细节,于是就有了这篇文章。


首先我们看下 Syncer 的整体架构图, 对于 Syncer 的作用和定位有一个直观的印象。


syncer


从整体的架构可以看到,Syncer 主要是通过把自己注册为一个 MySQL Slave 的方式,和 MySQL Master 进行通信,然后不断读取 MySQL Binlog,进行 Binlog Event 解析,规则过滤和数据同步。从工程的复杂度上来看,相对来说还是非常简单的,相对麻烦的地方主要是 Binlog Event 解析和各种异常处理,也是容易掉坑的地方。


为了完整地解释 Syncer 的在线同步实现,我们需要有一些额外的内容需要了解。


MySQL Replication


我们先看看 MySQL 原生的 Replication 复制方案,其实原理上也很简单:


1)MySQL Master 将数据变化记录到 Binlog (Binary Log),
2) MySQL Slave 的 I/O Thread 将 MySQL Master 的 Binlog 同步到本地保存为 Relay Log
3)MySQL Slave 的 SQL Thread 读取本地的 Relay Log,将数据变化同步到自身


mysql-replication


MySQL Binlog


MySQL 的 Binlog 分为几种不同的类型,我们先来大概了解下,也看看具体的优缺点。


1)Row
MySQL Master 将详细记录表的每一行数据变化的明细记录到 Binlog。
优点:完整地记录了行数据的变化信息,完全不依赖于存储过程,函数和触发器等等,不会出现因为一些依赖上下文信息而导致的主从数据不一致的问题。

缺点:所有的增删改查操作都会完整地记录在 Binlog 中,会消耗更大的存储空间。


2)Statement
MySQL Master 将每一条修改数据的 SQL 都会记录到 Binlog。

优点:相比 Row 模式,Statement 模式不需要记录每行数据变化,所以节省存储量和 IO,提高性能。

缺点:一些依赖于上下文信息的功能,比如 auto increment id,user define function, on update current_timestamp/now 等可能导致的数据不一致问题。


3)Mixed

MySQL Master 相当于 Row 和 Statement 模式的融合。

优点:根据 SQL 语句,自动选择 Row 和 Statement 模式,在数据一致性,性能和存储空间方面可以做到很好的平衡。

缺点:两种不同的模式混合在一起,解析处理起来会相对比较麻烦。


MySQL Binlog Event


了解了 MySQL Replication 和 MySQL Binlog 模式之后,终于进入到了最复杂的 MySQL Binlog Event 协议解析阶段了。


在解析 MySQL Binlog Eevent 之前,我们首先看下 MySQL Slave 在协议上是怎么和 MySQL Master 进行交互的。


Binlog dump


首先,我们需要伪造一个 Slave,向 MySQL Master 注册,这样 Master 才会发送 Binlog Event。注册很简单,就是向 Master 发送 COM_REGISTER_SLAVE 命令,带上 Slave 相关信息。这里需要注意,因为在 MySQL 的 replication topology 中,都需要使用一个唯一的 server id 来区别标示不同的 Server 实例,所以这里我们伪造的 slave 也需要一个唯一的 server id。


Binlog Event


对于一个 Binlog Event 来说,它分为三个部分,header,post-header 以及 payload。

MySQL 的 Binlog Event 有很多版本,我们只关心 v4 版本的,也就是从 MySQL 5.1.x 之后支持的版本,太老的版本应该基本上没什么人用了。


Binlog Event 的 header 格式如下:


4 bytes timestamp
1 bytes event type
4 bytes server-id
4 bytes event-size
4 bytes log pos
2 bytes flags

header 的长度固定为 19,event type 用来标识这个 event 的类型,event size 则是该 event 包括 header 的整体长度,而 log pos 则是下一个 event 所在的位置。


这个 header 对于所有的 event 都是通用的,接下来我们看看具体的 event。


FORMAT_DESCRIPTION_EVENT


在 v4 版本的 Binlog 文件中,第一个 event 就是 FORMAT_DESCRIPTION_EVENT,格式为:


2 bytes         binlog-version
string[50] mysql-server version
4 bytes create timestamp
1 byte event header length
string[p] event type header lengths

我们需要关注的就是 event type header length 这个字段,它保存了不同 event 的 post-header 长度,通常我们都不需要关注这个值,但是在解析后面非常重要的ROWS_EVENT 的时候,就需要它来判断 TableID 的长度了, 这个后续在说明。


ROTATE_EVENT


而 Binlog 文件的结尾,通常(只要 Master 不当机)就是 ROTATE_EVENT,格式如下:


Post-header
8 bytes position

Payload
string[p] name of the next binlog

它里面其实就是标明下一个 event 所在的 binlog filename 和 position。这里需要注意,当 Slave 发送 Binlog dump 之后,Master 首先会发送一个 ROTATE_EVENT,用来告知 Slave下一个 event 所在位置,然后才跟着 FORMAT_DESCRIPTION_EVENT


其实我们可以看到,Binlog Event 的格式很简单,文档都有着详细的说明。通常来说,我们仅仅需要关注几种特定类型的 event,所以只需要写出这几种 event 的解析代码就可以了,剩下的完全可以跳过。


TABLE_MAP_EVENT


上面我们提到 Syncer 使用 Row 模式的 Binlog,关于增删改的操作,对应于最核心的ROWS_EVENT ,它记录了每一行数据的变化情况。而如何解析相关的数据,是非常复杂的。在详细说明 ROWS_EVENT 之前,我们先来看看 TABLE_MAP_EVENT,该 event 记录的是某个 table 一些相关信息,格式如下:


post-header:
if post_header_len == 6 {
4 bytes table id
} else {
6 bytes table id
}
2 bytes flags

payload:
1 byte schema name length
string schema name
1 byte [00]
1 byte table name length
string table name
1 byte [00]
lenenc-int column-count
string.var_len[length=$column-count] column-def
lenenc-str column-meta-def
n bytes NULL-bitmask, length: (column-count + 8) / 7

table id 需要根据 post_header_len 来判断字节长度,而 post_header_len 就是存放到 FORMAT_DESCRIPTION_EVENT 里面的。这里需要注意,虽然我们可以用 table id 来代表一个特定的 table,但是因为 Alter Table 或者 Rotate Binlog Event 等原因,Master 会改变某个 table 的 table id,所以我们在外部不能使用这个 table id 来索引某个 table。


TABLE_MAP_EVENT 最需要关注的就是里面的 column meta 信息,后续我们解析 ROWS_EVENT 的时候会根据这个来处理不同数据类型的数据。column def 则定义了每个列的类型。


ROWS_EVENT


ROWS_EVENT 包含了 insert,update 以及 delete 三种 event,并且有 v0,v1 以及 v2 三个版本。

ROWS_EVENT 的格式很复杂,如下:


header:
if post_header_len == 6 {
4 table id
} else {
6 table id
}
2 flags
if version == 2 {
2 extra-data-length
string.var_len extra-data
}

body:
lenenc_int number of columns
string.var_len columns-present-bitmap1, length: (num of columns+7)/8
if UPDATE_ROWS_EVENTv1 or v2 {
string.var_len columns-present-bitmap2, length: (num of columns+7)/8
}

rows:
string.var_len nul-bitmap, length (bits set in 'columns-present-bitmap1'+7)/8
string.var_len value of each field as defined in table-map

if UPDATE_ROWS_EVENTv1 or v2 {
string.var_len nul-bitmap, length (bits set in 'columns-present-bitmap2'+7)/8
string.var_len value of each field as defined in table-map
}
... repeat rows until event-end

ROWS_EVENT 的 table id 跟 TABLE_MAP_EVENT 一样,虽然 table id 可能变化,但是 ROWS_EVENTTABLE_MAP_EVENT 的 table id 是能保证一致的,所以我们也是通过这个来找到对应的 TABLE_MAP_EVENT

为了节省空间,ROWS_EVENT 里面对于各列状态都是采用 bitmap 的方式来处理的。


首先我们需要得到 columns present bitmap 的数据,这个值用来表示当前列的一些状态,如果没有设置,也就是某列对应的 bit 为 0,表明该 ROWS_EVENT 里面没有该列的数据,外部直接使用 null 代替就成了。


然后就是 null bitmap,这个用来表明一行实际的数据里面有哪些列是 null 的,这里最坑爹的是 null bitmap 的计算方式并不是 (num of columns+7)/8,也就是 MySQL 计算 bitmap 最通用的方式,而是通过 columns present bitmap 的 bits set 个数来计算的,这个坑真的很大。为什么要这么设计呢,可能最主要的原因就在于 MySQL 5.6 之后 Binlog Row Image 的格式增加了 minimal 和 noblob,尤其是 minimal,update 的时候只会记录相应更改字段的数据,比如我一行有 16 列,那么用 2 个 byte 就能搞定 null bitmap 了,但是如果这时候只有第一列更新了数据,其实我们只需要使用 1 个 byte 就能记录了,因为后面的铁定全为 0,就不需要额外空间存放了。bits set 其实也很好理解,就是一个 byte 按照二进制展示的时候 1 的个数,譬如 1 的 bits set 就是1,而 3 的 bits set 就是 2,而 255 的 bits set 就是 8 了。


得到了 present bitmap 以及 null bitmap 之后,我们就能实际解析这行对应的列数据了,对于每一列,首先判断是否 present bitmap 标记了,如果为 0,则跳过用 null 表示,然后在看是否在 null bitmap 里面标记了,如果为 1,表明值为 null,最后我们就开始解析真正有数据的列了。


但是,因为我们得到的是一行数据的二进制流,我们怎么知道一列数据如何解析?这里,就要靠 TABLE_MAP_EVENT 里面的 column def 以及 meta 了。
column def 定义了该列的数据类型,对于一些特定的类型,譬如 MYSQL_TYPE_LONG, MYSQL_TYPE_TINY 等,长度都是固定的,所以我们可以直接读取对应的长度数据得到实际的值。但是对于一些类型,则没有这么简单了。这时候就需要通过 meta 来辅助计算了。


譬如对于 MYSQL_TYPE_BLOB 类型,meta 为 1 表明是 tiny blob,第一个字节就是 blob 的长度,2 表明的是 short blob,前两个字节为 blob 的长度等,而对于 MYSQL_TYPE_VARCHAR 类型,meta 则存储的是 string 长度。当然这里面还有最复杂的 MYSQL_TYPE_NEWDECIMALMYSQL_TYPE_TIME2 等类型,关于不同类型的 column 解析还是比较复杂的,可以单独开一章专门来介绍,因为篇幅关系这里就不展开介绍了,具体的可以参考官方文档。


搞定了这些,我们终于可以完整的解析一个 ROWS_EVENT 了:)


XID_EVENT

在事务提交时,不管是 Statement 还是 Row 模式的 Binlog,都会在末尾添加一个 XID_EVENT 事件代表事务的结束,里面包含事务的 ID 信息。


QUERY_EVENT


QUERY_EVENT 主要用于记录具体执行的 SQL 语句,MySQL 所有的 DDL 操作都记录在这个 event 里面。


Syncer


介绍完了 MySQL Replication 和 MySQL Binlog Event 之后,理解 Syncer 就变的比较容易了,上面已经介绍过基本的架构和功能了,在 Syncer 中, 解析和同步 MySQL Binlog,我们使用的是我们首席架构师唐刘的 go-mysql 作为核心 lib,这个 lib 已经在 github 和 bilibili 线上使用了,所以是非常安全可靠的。所以这部分我们就跳过介绍了,感兴趣的话,可以看下 github 开源的代码。这里面主要介绍几个核心问题:


MySQL Binlog 模式的选择


在 Syncer 的设计中,首先考虑的是可靠性问题,即使 Syncer 异常退出也可以直接重启起来,也不会对线上数据一致性产生影响。为了实现这个目标,我们必须处理数据同步的可重入问题。
对于 Mixed 模式来说,一个 insert 操作,在 Binlog 中记录的是 insert SQL,如果 Syncer 异常退出的话,因为 Savepoint 还没有来得及更新,会导致重启之后继续之前的 insert SQL,就会导致主键冲突问题,当然可以对 SQL 进行改写,将 insert 改成 replace,但是这里面就涉及到了 SQL 的解析和转换问题,处理起来就有点麻烦了。另外一点就是,最新版本的 MySQL 5.7 已经把 Row 模式作为默认的 Binlog 格式了。所以,在 Syncer 的实现中,我们很自然地选择 Row 模式作为 Binlog 的数据同步模式。


Savepoint 的选取


对于 Syncer 本身来说,我们更多的是考虑让它尽可能的简单和高效,所以每次 Syncer 重启都要尽可能从上次同步的 Binlog Pos 的地方做类似断点续传的同步。如何选取 Savepoint 就是一个需要考虑的问题了。

对于一个 DML 操作来说(以 Insert SQL 操作举例来看),基本的 Binlog Event 大概是下面的样子:


TABLE_MAP_EVENT
QUERY_EVENT → begin
WRITE_ROWS_EVENT
XID_EVENT

我们从 MySQL Binlog Event 中可以看到,每个 Event 都可以获取下一个 Event 开始的 MySQL Binlog Pos 位置,所以只要获取这个 Pos 信息保存下来就可以了。但是我们需要考虑的是,TABLE_MAP_EVENT 这个 event 是不能被 save 的,因为对于 WRITE_ROWS_EVENT 来说,没有 TABLE_MAP_EVENT 基本上没有办法进行数据解析,所以为什么很多人抱怨 MySQL Binlog 协议不灵活,主要原因就在这里,因为不管是 TABLE_MAP_EVENT 还是 WRITE_ROWS_EVENT 里面都没有 Schema 相关的信息的,这个信息只能在某个地方保留起来,比如 MySQL Slave,也就是 MySQL Binlog 是没有办法自解析的。


当然,对于 DDL 操作就比较简单了,DDL 本身就是一个 QUERY_EVENT


所以,Syncer 处于性能和安全性的考虑,我们会定期和遇到 DDL 的时候进行 Save。大家可能也注意到了,Savepoint 目前是存储在本地的,也就是存在一定程度的单点问题,暂时还在我们的 TODO 里面。


断点数据同步


在上面我们已经抛出过这个问题了,对于 Row 模式的 MySQL Binlog 来说,实现这点相对来说也是比较容易的。举例来说,对于一个包含 3 行 insert row 的 Txn 来说,event 大概是这样的:


TABLE_MAP_EVENT
QUERY_EVENT → begin
WRITE_ROWS_EVENT
WRITE_ROWS_EVENT
WRITE_ROWS_EVENT
XID_EVENT

所以在 Syncer 里面做的事情就比较容易了,就是把每个 WRITE_ROWS_EVENT 结合 TABLE_MAP_EVENT,去生成一个 replace into 的 SQL,为什么这里不用 insert 呢?主要是 replace into 是可重入的,重复执行多次,也不会对数据一致性产生破坏。

另外一个比较麻烦的问题就是 DDL 的操作,TiDB 的 DDL 实现是完全无阻塞的,所以根据 TiDB Lease 的大小不同,会执行比较长的时间,所以 DDL 操作是一个代价很高的操作,在 Syncer 的处理中通过获取 DDL 返回的标准 MySQL 错误来判断 DDL 是否需要重复执行。


当然,在数据同步的过程中,我们也做了很多其他的工作,包括并发 sync 支持,MySQL 网络重连,基于 DB/Table 的规则定制等等,感兴趣的可以直接看我们 tidb-tools/syncer 的开源实现,这里就不展开介绍了。


欢迎对 Syncer 这个小项目感兴趣的小伙伴们在 Github 上面和我们讨论交流,当然更欢迎各种 PR:)


原文链接

TiDB Weekly [2016.11.21]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 106 次浏览 • 2016-11-21 19:20 • 来自相关话题

Weekly update in TiDB

Last week, we landed 查看全部

Weekly update in TiDB


Last week, we landed 30 PRs in the TiDB repositories, 3 PRs in the TiDB docs repositories.


Added



Fixed



Improved



Document change


Add the following new guides:



Weekly update in TiKV


Last week, we landed 19 PRs in the TiKV repositories.


Fixed



Improved



原文链接

A Deep Dive into TiKV

文章分享qiuyesuifeng 发表了文章 • 1 个评论 • 200 次浏览 • 2016-11-17 17:29 • 来自相关话题

About TiKV

TiKV (The pronunciation is: /'taɪkeɪvi:/ tai-K-V, etymology: titanium) is a distributed Key-Value database ... 查看全部

About TiKV


TiKV (The pronunciation is: /'taɪkeɪvi:/ tai-K-V, etymology: titanium) is a distributed Key-Value database which is based on the design of Google Spanner, F1, and HBase, but it is much simpler without dependency on any distributed file system.


Architecture





  • Placement Driver (PD): PD is the brain of the TiKV system which manages the metadata about Nodes, Stores, Regions mapping, and makes decisions for data placement and load balancing. PD periodically checks replication constraints to balance load and data automatically.




  • Node: A physical node in the cluster. Within each node, there are one or more Stores. Within each Store, there are many Regions.




  • Store: There is a RocksDB within each Store and it stores data in local disks.



  • Region: Region is the basic unit of Key-Value data movement and corresponds to a data range in a Store. Each Region is replicated to multiple Nodes. These multiple replicas form a Raft group. A replica of a Region is called a Peer.


Protocol


TiKV uses the Protocol Buffer protocol for interactions among different components. Because Rust doesn’t support gRPC for the time being, we use our own protocol in the following format:


Message: Header + Payload 

Header: | 0xdaf4(2 bytes magic value) | 0x01(version 2 bytes) | msg\_len(4 bytes) | msg\_id(8 bytes) |

The data of Protocol Buffer is stored in the Payload part of the message. At the Network level, we will first read the 16-byte Header. According to the message length (msg_len) information in the Header, we calculate the actual length of the message, and then read the corresponding data and decode it.


The interaction protocol of TiKV is in the kvproto project and the protocol to support push-down is in the tipb project. Here, let’s focused on the kvproto project only.


About the protocol files in the kvproto project:



  • msgpb.proto: All the protocol interactions are in the same message structure. When a message is received, we will handle the message according to its MessageType.

  • metapb.proto: To define the public metadata for Store, Region, Peer, etc.

  • raftpb.proto: For the internal use of Raft. It is ported from etcd and needs to be consistent with etcd.

  • raft_serverpb.proto: For the interactions among the Raft nodes.

  • raft_cmdpb.proto: The actual command executed when Raft applies.

  • pdpb.proto: The protocol for the interaction between TiKV and PD.

  • kvrpcpb.proto: The Key-Value protocol that supports transactions.

  • mvccpb.proto: For internal Multi-Version Concurrency Control (MVCC).

  • coprocessor.proto: To support the Push-Down operations.


There are following ways for external applications to connect to TiKV:



  • For the simple Key-Value features only, implement raft_cmdpb.proto.

  • For the Transactional Key-Value features, implement kvrpcpb.proto.

  • For the Push-Down features, implement coprocessor.proto. See tipb for detailed push-down protocol.


Raft


TiKV uses the Raft algorithm to ensure the data consistency in the distributed systems. For more information, see https://raft.github.io/.


The Raft in TiKV is completely migrated from etcd. We chose etcd Raft because it is very simple to implement, very easy to migrate and it is production proven.


The Raft implementation in TiKV can be used independently. You can apply it in your project directly.


See the following details about how to use Raft:



  1. Define its own storage and implement the Raft Storage trait. See the following Storage trait interface:


    // initial_state returns the information about HardState and ConfState in Storage
fn initial_state(&self) -> Result<RaftState>;

// return the log entries in the [low, high] range
fn entries(&self, low: u64, high: u64, max_size: u64) -> Result<Vec<Entry>>;

// get the term of the log entry according to the corresponding log index
fn term(&self, idx: u64) -> Result<u64>;

// get the index from the first log entry at the current position
fn first_index(&self) -> Result<u64>;

// get the index from the last log entry at the current position
fn last_index(&self) -> Result<u64>;

// generate a current snapshot
fn snapshot(&self) -> Result<Snapshot>;



  1. Create a raw node object and pass the corresponding configuration and customized storage instance to the object. About the configuration, we need to pay attention to election_tick and heartbeat_tick. Some of the Raft logics step by periodical ticks. For every Tick, the Leader will decide if the frequency of the heartbeat elapsing exceeds the frequency of the heartbeat_tick. If it does, the Leader will send heartbeats to the Followers and reset the elapse. For a Follower, if the frequency of the election elapsing exceeds the frequency of the election_tick, the Follower will initiate an election.




  2. After a raw node is created, the tick interface of the raw node will be called periodically (like every 100ms) and drives the internal Raft Step function.




  3. If data is to be written by Raft, the Propose interface is called directly. The parameters of the Propose interface is an arbitrary binary data which means that Raft doesn’t care the exact data content that is replicated by it. It is completely up to the external logics as how to handle the data.




  4. If it is to process the membership changes, the propose_conf_change interface of the raw node can be called to send a ConfChange object to add/remove a certain node.




  5. After the functions in the raw node like Tick and Propose of the raw node are called, Raft will initiate a Ready state. Here are some details of the Ready state:


    There are three parts in the Ready state:



    • The part that needs to be stored in Raft storage, which are entries, hard state and snapshot.

    • The part that needs to be sent to other Raft nodes, which are messages.

    • The part that needs to be applied to other state machines, which are committed_entries.




After handling the Ready status, the Advance function needs be called to inform Raft of the next Ready process.


In TiKV, Raft is used through mio as in the following process:




  1. Register a base Raft tick timer (usually 100ms). Every time the timer timeouts, the Tick of the raw node is called and the timer is re-registered.




  2. Receive the external commands through the notify function in mio and call the Propose or the propose_conf_change interface.



  3. Decide if a Raft is ready in the mio tick callback (Note: The mio tick is called at the end of each event loop, which is different from the Raft tick.). If it is ready, proceed with the Ready process.


In the descriptions above, we covered how to use one Raft only. But in TiKV, we have multiple Raft groups. These Raft groups are independent to each other and therefore can be processed following the same approach.


In TiKV, each Raft group corresponds to a Region. At the very beginning, there is only one Region in TiKV which is in charge of the range (-inf, +inf). As more data comes in and the Region reaches its threshold (64 MB currently), the Region is split into two Regions. Because all the data in TiKV are sorted according to the key, it is very convenient to choose a Split Key to split the Region. See Split for the detailed splitting process.


Of course, where there is Split, there is Merge. If there are very few data in two adjacent Regions, these two regions can merge to one big Region. Region Merge is in the TiKV roadmap but it is not implemented yet.


Placement Driver


Placement Driver (PD) is in charge of the managing and scheduling of the whole TiKV cluster. It is a central service and we have to ensure that it is highly available and stable.


The first issue to be resolved is the single point of failure of PD. Our solution is to start multiple PD servers. These servers elect a Leader through the election mechanism in etcd and the leader provides services to the outside. If the leader is down, there will be another election to elect a new leader to provide services.


The second issue is the consistency of the data stored in PD. If one PD is down, how to ensure that the new elected PD has the consistent data? This is also resolved by putting PD data in etcd. Because etcd is a distributed consistent Key-Value store, it helps us ensure the consistency of the data stored in it. When the new PD is started, it only needs to load data from etcd.


At first, we used the independent external etcd service, but now we have embedded PD in etcd, which means, PD itself is an etcd. The embedment makes it simpler to deploy because there is one service less. The embedment also makes it more convenient for PD and etcd to customize and therefore improve the performance.


The current functions of PD are as follows:




  1. The Timestamp Oracle (TSO) service: to provide the globally unique timestamp for TiDB to implement distributed transactions.




  2. The generation of the globally unique ID: to enable TiKV to generate the unique IDs for new Regions and Stores.




  3. TiKV cluster auto-balance: In TiKV, the basic data movement unit is Region, so the PD auto-balance is to balance Region automatically. There are two ways to trigger the scheduling of a Region:


    1). The heartbeat triggering: Regions report the current state to PD periodically. If PD finds that there are not enough or too much replicas in one Region, PD informs this Region to initiate membership change.


    2). The regular triggering: PD checks if the whole system needs scheduling on a regular bases. If PD finds out that there is not enough space on a certain Store or that there are too many leader Regions on a certain Store and the load is too high, PD will select a Region from the Store and move the replicas to another Store.




Transaction


The transaction model in TiKV is inspired by Google Percolator and Themis from Xiaomi with the following optimizations:




  1. For a system that is similar to Percolator, there needs to be a globally unique time service, which is called Timestamp Oracle (TSO), to allocate a monotonic increasing timestamp. The functions of TSO are provided in PD in TiKV. The generation of TSO in PD is purely memory operations and stores the TSO information in etcd on a regular base to ensure that TSO is still monotonic increasing even after PD restarts.




  2. Compared with Percolator where the information such as Lock is stored by adding extra column to a specific row, TiKV uses a column family (CF) in RocksDB to handle all the information related to Lock. For massive data, there aren’t many row Locks for simultaneous transactions. So the Lock processing speed can be improved significantly by placing it in an extra and optimized CF.



  3. Another advantage about using an extra CF is that we can easily clean up the remaining Locks. If the Lock of a row is acquired by a transaction but is not cleaned up because of crashed threads or other reasons, and there are no more following-up transactions to visit this Lock, the Lock is left behind. We can easily discover and clean up these Locks by scanning the CF.


The implementation of the distributed transaction depends on the TSO service and the client that encapsulates corresponding transactional algorithm which is implemented in TiDB. The monotonic increasing timestamp can set the time series for concurrent transactions and the external clients can act as a coordinator to resolve the conflicts and unexpected terminations of the transactions.


Let’s see how a transaction is executed:




  1. The transaction starts. When the transaction starts, the client must obtain the current timestamp (startTS) from TSO. Because TSO guarantees the monotonic increasing of the timestamp, startTS can be used to identify the time series of the transaction.




  2. The transaction is in progress. During a transaction, all the read operations must carry startTS while they send RPC requests to TiKV and TiKV uses MVCC to make sure to return the data that is written before startTS. For the write operations, TiKV uses optimistic concurrency control which means the actual data is cached on the clients rather than written to the servers assuming that the current transaction doesn’t affect other transactions.




  3. The transaction commits. TiKV uses a 2-phase commit algorithm. Its difference from the common 2-phase commit is that there is no independent transaction manager. The commit state of a transaction is identified by the commit state of the PrimaryKey which is selected from one of the to-be-committed keys.


    1). During the Prewrite phase, the client submits the data that is to be written to multiple TiKV servers. When the data is stored in a server, the server sets the corresponding Key as Locked and records the the PrimaryKey of the transaction. If there is any writing conflict on any of the nodes, the transaction aborts and rolls back.


    2). When Prewrite finishes, a new timestamp is obtained from TSO and is set as commitTS.


    3). During the Commit phase, requests are sent to the TiKV servers with PrimaryKey. The process of how TiKV handles commit is to clean up the Locks from the PrimaryKey phase and write corresponding commit records with commitTS. When the PrimaryKey commit finishes, the transaction is committed. The Locks that remain on other Keys can get the commit state and the corresponding commitTS by retrieving the state of the Primarykey. But in order to reduce the cost of cleaning up Locks afterwards, the practical practice is to submit all the Keys that are involved in the transaction asynchronously on the backend.




Coprocessor


Similar to HBase, TiKV provides the Coprocessor support. But for the time being, Coprocessor cannot be dynamically loaded, it has to be statically compiled to the code.


Currently, the Coprocessor in TiKV is mainly used in two situations, Split and push-down, both to serve TiDB.




  1. For Split, before the Region split requests are truly proposed, the split key needs to be checked if it is legal. For example, for a Row in TiDB, there are many versions of it in TiKV, such as V1, V2, and V3, V3 being the latest version. Assuming that V2 is the selected split key, then the data of the Row might be split to two different Regions, which means the data in the Row cannot be handled atomically. Therefore, the Split Coprocessor will adjust the split key to V1. In this way, the data in this Row is still in the same Region during the splitting.



  2. For push-down, the Coprocessor is used to improve the performance of TiDB. For some operations like select count(*), there is no need for TiDB to get data from row to row first and then count. The quicker way is that TiDB pushes down these operations to the corresponding TiKV nodes, the TiKV nodes do the computing and then TiDB consolidates the final results.


Let’s take an example of select count(*) from t1 to show how a complete push-down process works:




  1. After TiDB parses the SQL statement, based on the range of the t1 table, TiDB finds out that all the data of t1 are in Region 1 and Region 2 on TiKV, so TiDB sends the push-down commands to Region 1 and Region 2.




  2. After Region 1 and Region 2 receive the push-down commands, they get a snapshot of their data separately by using the Raft process.




  3. Region 1 and Region 2 traverse their snapshots to get the corresponding data and and calculate count().



  4. Each Region returns the result of count() to TiDB and TiDB consolidates and outputs the total result.


Key processes analysis


Key-Value operation


When a request of Get or Put is sent to TiKV, how does TiKV process it?


As mentioned earlier, TiKV provides features such as simple Key-Value, transactional Key-Value and push-down. But no matter it’s transactional Key-Value or push-down, it will be transformed to simple Key-Value operations in TiKV. Therefore, let’s take an example of simple Key-Value operations to show how TiKV processes a request. As for how TiKV implements transaction Key-Value and push-down support, let’s cover that later.


Let’s take Put as an example to show how a complete Key-Value process works:




  1. The client sends a Put command to TiKV, such as put k1 v1. First, the client gets the Region ID for the k1 key and the leader of the Region peers from PD. Second, the client sends the Put request to the corresponding TiKV node.




  2. After the TiKV server receives the request, it notifies the internal RaftStore thread through the mio channel and takes a callback function with it.




  3. When the RaftStore thread receives the request, first it checks if the request is legal including if the request is a legal epoch. If the request is legal and the peer is the Leader of the Region, the RaftStore thread encodes the request to be a binary array, calls Propose and begins the Raft process.




  4. At the stage of handle ready, the newly generated entry will be first appended to the Raft log and sent to other followers at the same time.




  5. When the majority of the nodes of the Region have appended the entry to the log, the entry is committed. In the following Ready process, the entry can be obtained from the committed_entries, then decoded and the corresponding command can be executed. This is how the put k1 v1 command is executed in RocksDB.



  6. When the entry log is applied by the leader, the callback of the entry will be called and return the response to the client.


The same process also applies to Get, which means all the requests are not processed until they are replicated to the majority of the nodes by Raft. Of course, this is also to ensure the data linearizability in distributed systems.


Of course, we will optimize the reading requests for better performance in the following aspects:




  1. Introduce lease into the Leader. Within the lease, we can assume that the Leader is valid so that the Leader can provide the read service directly and there will be no need to go through Raft replicated log.



  2. The Follower provides the read service.


These optimizations are mentioned in the Raft paper and they have been supported by etcd. We will introduce them into TiKV as well in the future.


Membership Change


To ensure the data safety, there are multiple replicas on different stores. Each replica is another replica’s Peer. If there aren’t enough replicas for a certain Region, we will add new replicas; on the contrary, if the numbers of the replicas for a certain Region exceeds the threshold, we will remove some replicas.


In TiKV, the change of the Region replicas are completed by the Raft Membership Change. But how and when a Region changes its membership is scheduled by PD. Let’s take adding a Replica as an example to show how the whole process works:




  1. A Region sends heartbeats to PD regularly. The heartbeats include the relative information about this Region, such as the information of the peers.




  2. When PD receives the heartbeats, it will check if the number of the replicas of this Region is consistent with the setup. Assuming there are only two replicas in this Region but it’s three replicas in the setup, PD will find an appropriate Store and return the ChangePeer command to the Region.



  3. After the Region receives the ChangePeer command, if it finds it necessary to add replica to another Store, it will submit a ChangePeer request through the Raft process. When the log is applied, the new peer information will be updated in the Region meta and then the Membership Change completes.


It should be noted that even if the Membership Change completes, it only means that the Replica information is added to the meta by the Region. Later if the Leader finds that if there is no data in the new Follower, it will send snapshot to it.


It should also be noted that the Membership Change implementation in TiKV and etcd is different from what’s in the Raft paper. In the Raft paper, if a new peer is added, it is added to the Region meta at the Propose command. But to simplify, TiKV and etcd don’t add the peer information to the Region meta until the log is applied.


Split


At the very beginning, there is only one Region. As data grows, the Region needs to be split.


Within TiKV, if a Region splits, there will be two new Regions, which we call them the Left Region and the Right Region. The Left Region will use all the IDs of the old Region. We can assume that the Region just changes its range. The Right Region will get a new ID through PD. Here is a simple example:


Region 1 [a, c) -> Region 1 [a, b) + Region 2 [b, c)

The original range of Region 1 is [a, c). After splitting at the b point, the Left Region is still Region 1 but the range is now [a, b). The Right Region is a new Region, Region 2, and its range is [b, c).


Assuming the base size of Region 1 is 64MB. A complete spit process is as follows:




  1. In a given period of time, if the accumulated size of the data in Region 1 exceeds the threshold (8MB for example), Region 1 notifies the split checker to check Region 1.




  2. The split checker scans Region 1 sequentially. When it finds that the accumulated size of a certain key exceeds 64MB, it will keep a record of this key and make it the split key. Meanwhile, the split checker continues scanning and if it finds that the accumulated size of a certain key exceeds the threshold (96 MB for example), it considers this Region could split and notifies the RaftStore thread.




  3. When the RaftStore thread receives the message, it sends the AskSplit command to PD and requests PD to assign a new ID for the newly generated PD, Region 2, for example.




  4. When the ID is generated in PD, an Admin SplitRequest will be generated and sent to the RaftSore thread.




  5. Before RaftStore proposes the Admin SplitRequest, the Coprocessor will pre-process the command and decide if the split key is appropriate. If the split key is not appropriate, the Coprocessor will adjust the split key to an appropriate one.




  6. The Split request is submitted through the Raft process and then applied. For TiKV, the splitting of a Region is to change the range of the original Region and then create another Region. All these changes involves only the change of the Region meta, the real data under the hood is not moved, so it is very fast for Region to split in TiKV.



  7. When the Splitting completes, TiKV sends the latest information about the Left Region and Right Region to PD.

TiDB Weekly [2016.11.14]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 108 次浏览 • 2016-11-16 22:53 • 来自相关话题

Last week, we landed 查看全部

Last week, we landed 25 PRs in the TiDB repositories, 5 PRs in the TiDB docs repositories.


Weekly update in TiDB


Added



Fixed



Improved



Document change



Weekly update in TiKV


Last week, we landed 23 PRs in the TiKV repositories.


Added




  • Resolve locks in batches to avoid generating a huge Raft log when a transaction rolls back.




  • Add applying snapshot count to enhance the Placement Driver (PD) balance, with PR 1278, 381.



  • Check the system configuration before startup.


Fixed



Improved



原文链接

TiDB Weekly [2016.11.07]

Go开源项目qiuyesuifeng 发表了文章 • 2 个评论 • 139 次浏览 • 2016-11-07 19:17 • 来自相关话题

Last week, we landed 查看全部

Last week, we landed 42 PRs in the TiDB repositories and 29 PRs in the TiKV repositories.


New release


TiDB Beta 4


Weekly update in TiDB


Added



Fixed



Improved



Weekly update in TiKV


Added



Fixed



Improved



原文链接

TiDB Weekly [2016.10.24]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 147 次浏览 • 2016-11-04 11:01 • 来自相关话题

Last week, we landed 查看全部

Last week, we landed 30 PRs in the TiDB repositories and 26 PRs in the TiKV repositories.


Notable changes to TiDB



Notable changes to TiKV



Notable changes to Placement Driver



原文链接

TiDB Weekly [2016.10.31]

Go开源项目qiuyesuifeng 发表了文章 • 0 个评论 • 134 次浏览 • 2016-11-04 10:53 • 来自相关话题

Last week, we landed 查看全部

Last week, we landed 24 PRs in the TiDB repositories and 28 PRs in the TiKV repositories.


Notable changes to TiDB



Notable changes to TiKV



Notable changes to Placement Driver



  • Refactor store/region cache to make code clearer, including PR 353, 365, 366.

  • Support the GetPDMembers API.


New contributors



原文链接