【原创】大数据基础之Zookeeper（3）选举算法

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

【原创】⼤数据基础之Zookeeper（3）选举算法
提到zookeeper选举算法，就不得不提Paxos算法，因为zookeeper选举算法是Paxos算法的⼀个变种；
Paxos要解决的问题是：在⼀个分布式⽹络环境中有众多的参与者，但是每个参与者都不可靠，可能随时掉线等，这时这些参与者如何针对某个看法达成⼀致；
类似的问题现实⽣活中有很多，⽐如⼀个团队要组织团建，团队中有10个⼈，每个⼈都有⾃⼰想去的地⽅，如何就团建的⽬的地达成⼀致？
最简单的⽅式是把团队全体叫到会议室开会，很快就可以根据少数服从多数的原则，确定⼀个⼤多数⼈都满意的⽬的地；
如果将问题改为：团队10个⼈分别在世界的10个地⽅出差，作息时间各不相同，并且只能通过邮件联系，这时如何确定团建的⽬的地？
1 Paxos算法
Paxos is a family of protocols for solving in a network of unreliable processors. Consensus is the process of agreeing on one result among
a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.
1.1 Roles
Paxos describes the actions of the processors by their roles in the protocol: client, acceptor, proposer, learner, and leader. In typical implementations, a single processor may play one or more roles at the same time. This does not affect the correctness of the protocol—it is usual to coalesce roles to improve the latency and/or number of messages in the protocol.
Client
The Client issues a request to the distributed system, and waits for a response. For instance, a write request on a file in a
distributed file server.
Acceptor (Voters)
The Acceptors act as the fault-tolerant "memory" of the protocol. Acceptors are collected into groups called Quorums. Any
message sent to an Acceptor must be sent to a Quorum of Acceptors. Any message received from an Acceptor is ignored
unless a copy is received from each Acceptor in a Quorum.
Proposer
A Proposer advocates a client request, attempting to convince the Acceptors to agree on it, and acting as a coordinator to move
the protocol forward when conflicts occur.
Learner
Learners act as the replication factor for the protocol. Once a Client request has been agreed on by the Acceptors, the Learner may take action (i.e.: execute the request and send a response to the client). To improve availability of processing, additional
Learners can be added.
Leader
Paxos requires a distinguished Proposer (called the leader) to make progress. Many processes may believe they are leaders, but the protocol only guarantees progress if one of them is eventually chosen. If two processes believe they are leaders, they
may stall the protocol by continuously proposing conflicting updates. However, the safety properties are still preserved in that case.
client有很多个，并且每个client都有很多idea，但是只有⼀个client的⼀个idea最终会被⼤家接受；client想让⼀个idea被接收，⾸先会把idea告诉proposor，proposor收到⼀个idea之后会提交给多个acceptor进⾏表决，如果超过半数的acceptor表决通过，则表⽰idea被⼤家接受；learner会及时收到acceptor的表决结果；
由于实际表决过程是并发的，所以表决过程分为多个阶段，并且增加版本version的概念，这⾥有点类似于乐观锁；
⼀个形象的例⼦是在某个腐败的国家⾥，政府有⼀个项⽬要招标，然后有很多公司（client）都想拿到该项⽬（idea），决定该项⽬给谁的是有⼀个政府内部⾼层⼈⼠（acceptor）⼩组讨论决定，但是他们深藏不漏，公司需要通过⼀些政商通吃的中介
（proposor），给⾼层⼈⼠输送贿赂（version），每个⾼层⼈⼠收到⼀个贿赂之后会表⽰不再接受不⾼于这个贿赂的其他贿赂并且⽀持当前这个贿赂的公司，如果⼀个公司能够成功贿赂⼩组中多数⾼层⼈⼠，那么这个公司可以拿到这个项⽬；
1.2 Basic Paxos
Phase 1a: Prepare
Phase 1b: Promise
Phase 2a: Accept Request
Phase 2b: Accepted
⾸先将议员的⾓⾊分为 proposers，acceptors，和 learners（允许⾝兼数职）。

proposers 提出提案，提案信息包括提案编号和提议的value；acceptor 收到提案后可以接受（accept）提案，若提案获得多数 acceptors 的接受，则称该提案被批准（chosen）；learners 只
能“学习”被批准的提案。

划分⾓⾊后，就可以更精确的定义问题：
1. 决议（value）只有在被 proposers 提出后才能被批准（未经批准的决议称为“提案（proposal）”）；
2. 在⼀次 Paxos 算法的执⾏实例中，只批准（chosen）⼀个 value；
3. learners 只能获得被批准（chosen）的 value。

作者通过不断加强上述3个约束（主要是第⼆个）获得了 Paxos 算法。

P1：⼀个 acceptor 必须接受（accept）第⼀次收到的提案。

P1a：当且仅当acceptor没有回应过编号⼤于n的prepare请求时，acceptor接受（accept）编号为n的提案。

P2：⼀旦⼀个具有 value v 的提案被批准（chosen），那么之后批准（chosen）的提案必须具有 value v。

P2a：⼀旦⼀个具有 value v 的提案被批准（chosen），那么之后任何 acceptor 再次接受（accept）的提案必须具有 value v。

P2b：⼀旦⼀个具有 value v 的提案被批准（chosen），那么以后任何 proposer 提出的提案必须具有 value v。

P2c：如果⼀个编号为 n 的提案具有 value v，那么存在⼀个多数派，要么他们中所有⼈都没有接受（accept）编号⼩于 n
的任何提案，要么他们已经接受（accept）的所有编号⼩于 n 的提案中编号最⼤的那个提案具有 value v。

2 Zookeeper Leader Election
每个zookeeper服务器都相当于client+proposor+acceptor+learner
Vote（提案编号、提议的 value）：myid、zxid、epoch
初始值：
currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
其他⼈的Vote：
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
zookeeper的选举算法做了很多简化，将Paxos的value和version合并成Vote，是为了保证能够在最短的时间内选举出新的leader，同时避免数据丢失和数据同步，来看以下⼏种情况：
集群第⼀次启动，所有服务器的zxid和epoch⼀样，在集群的半数服务器启动后谁的myid最⼤，谁将会成为leader；⽐如5台服务器，id分别为1/2/3/4/5，当你顺序启动1/2/3/4/5的时候，3启动后3将会成为leader，4/5启动后会成为follower；当你顺序启动5/4/3/2/1的时候，3启动后5将会成为leader；
集群重启，在集群的半数服务器启动后，谁的epoch最⼤（每次选举成功后leader会将epoch+1），谁将会成为leader，如果⼤家的epoch相同，谁的zxid最⼤（即谁拥有最新的数据），谁将会成为leader；
选举核⼼类及调⽤流程：
org.apache.zookeeper.server.quorum.Election
org.apache.zookeeper.server.quorum.FastLeaderElection implements Election
lookForLeader
sendNotifications
totalOrderPredicate
termPredicate
org.apache.zookeeper.server.quorum.flexible.QuorumMaj
containsQuorum
QuorumPeer.setPeerState
是否超过半数判断
public boolean containsQuorum(HashSet<Long> set){
return (set.size() > half);
}
Vote⼤⼩判断
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) { LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
if(self.getQuorumVerifier().getWeight(newId) == 0){
return false;
}
/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 2- New epoch is the same as current epoch, but new zxid is higher
* 3- New epoch is the same as current epoch, new zxid is the same
* as current zxid, but server id is higher.
*/
return ((newEpoch > curEpoch) ||
((newEpoch == curEpoch) &&
((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
}。