Zk选举源码分析

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Zk选举源码分析
⾸先说明下 zk的源码版本是3.5.5
代码⼊⼝在 QuorumPeerMain.main
如果要以分布式⽅式启动，⾛的⽅法是
QuorumPeerMain#runFromConfig
quorumPeer = getQuorumPeer();//new ⼀个QuorumPeer，可以把QuorumPeer当成zk服务器
quorumPeer.setTxnFactory(new FileTxnSnapLog(
config.getDataLogDir(),
config.getDataDir()));
quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
quorumPeer.enableLocalSessionsUpgrading(
config.isLocalSessionsUpgradingEnabled());
//quorumPeer.setQuorumPeers(config.getAllMembers());
quorumPeer.setElectionType(config.getElectionAlg());
quorumPeer.setMyid(config.getServerId());
.... //中间是设置各种属性，配置
quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
quorumPeer.initialize();
quorumPeer.start();
quorumPeer.join();
public class QuorumPeer extends ZooKeeperThread
QuorumPeer继承⾃ZooKeeperThread，⽽ZooKeeperThread继承⾃Thread，所以主要就是看它的run⽅法的实现
QuorumPeer.run
其实核⼼就是⼀句话
setCurrentVote(makeLEStrategy().lookForLeader());
其中 Election默认的实现是 FastLeaderElection，⼀般情况下不会有⼈再zoo.cfg中配置 electionType，electionType默认值是3，也就是FastLeaderElection
FastLeaderElection#lookForLeader()
public Vote lookForLeader() throws InterruptedException {
......
try {
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
int notTimeout = finalizeWait;
synchronized(this){
logicalclock.incrementAndGet();
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());//更新本zk服务的要投票的 epoch，zxid，myid
//其实本
}
("New election. My id = " + self.getId() +
", proposed zxid=0x" + Long.toHexString(proposedZxid));
sendNotifications();
其中 updateProposal 会被调⽤多次，因为如果本zk节点收到⽐他更适合的leader投票，就会更新⾃⾝的投票
synchronized void updateProposal(long leader, long zxid, long epoch){
if(LOG.isDebugEnabled()){
LOG.debug("Updating proposal: " + leader + " (newleader), 0x"
+ Long.toHexString(zxid) + " (newzxid), " + proposedLeader
+ " (oldleader), 0x" + Long.toHexString(proposedZxid) + " (oldzxid)");
}
proposedLeader = leader;
proposedZxid = zxid;
proposedEpoch = epoch;
}
proposedLeader ，proposedZxid ，proposedEpoch
都是FastLeaderElection的成员变量，表⽰本节点所⽀持成为leader的投票，也就是该投给谁
然后就是向所有zk服务器发送投票消息
sendNotifications()
private void sendNotifications() {
for (long sid : self.getCurrentAndNextConfigVoters()) {
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(ToSend.mType.notification,
proposedLeader,
proposedZxid,
logicalclock.get(),
QuorumPeer.ServerState.LOOKING,
sid,
proposedEpoch, qv.toString().getBytes());
if(LOG.isDebugEnabled()){
LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x" +
Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get()) +
" (n.round), " + sid + " (recipient), " + self.getId() +
" (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
}
sendqueue.offer(notmsg);
}
}
这⾥简单描述下发送的过程
1 FastLeaderElection 有两个queue，⼀个是发送queue ，⼀个是接受queue
LinkedBlockingQueue<ToSend> sendqueue;
LinkedBlockingQueue<Notification> recvqueue;
2 FastLeaderElection 还有两个线程 WorkerReceiver，WorkerSender。

从名字就能知道⼀个是发送⼀个是接受
3 这两个线程都有⼀个成员变量QuorumCnxManager，它是真正进⾏⽹络通信的⼯具类
4 发送的时候把消息放到发送sendqueue⾥
5 发送线程是⼀个循环，执⾏sendqueue的poll逻辑，每次poll指定等待时间3秒，然后调⽤⽹络⼯具类进⾏发送
6 如果给本节点⾃⾝发送消息，QuorumCnxManager会直接把消息放到要交给FastLeaderElection的接收 recvqueue
注意在lookForLeader⽅法⾥有⼀个本地变量
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
这个结构也很关键，他是判断选leader何时结束的关键数据结构。

这⾥简单说下，key是long型，意义是myid，Vote就是投票。

Vote有三个成员，分别是epoch，zxid，myid。

⽐较顺序就是先⽐较epoch，然后zxid，最后myid。

原则都是越⼤优先级越⾼
上⾯的准备⼯作做完了，下⾯分析选举逻辑
在上⾯给所有的zk节点发送投票之后，就进⼊到了⼀个while循环⾥。

分为两个部分来讲，第⼀部分是收到别⼈的投票怎么处理
while ((self.getPeerState() == ServerState.LOOKING) &&
(!stop)){
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
Notification n = recvqueue.poll(notTimeout,
LISECONDS);//从接收queue⾥顺序的遍历，这⾥notTimeout是200，也就是200毫秒
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if(n == null){//如果还没有收到，要么重发，要么重连
if(manager.haveDelivered()){
sendNotifications();
} else {
manager.connectAll();
}
/*
* Exponential backoff
*/
int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?
tmpTimeOut : maxNotificationInterval);
("Notification time out: " + notTimeout);
}
//这⾥才是核⼼逻辑
else if (validVoter(n.sid) && validVoter(n.leader)) {
/*
* Only proceed if the vote comes from a replica in the current or next
* voting view for a replica in the current or next voting view.
*/
switch (n.state) {
case LOOKING://如果该消息也是LOOKING
// If notification > current, replace and send messages out
if (n.electionEpoch > logicalclock.get()) {//如果收到的消息epoch⽐⾃⼰的⼤
logicalclock.set(n.electionEpoch);//本地epoch要跟上⼤部队，logicalclock相当于是epoch的发⽣器
recvset.clear();//清楚recvset，因为消息要重发
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
//totalOrderPredicate是判断收到的别⼈的投票，是不是⽐⾃⼰更适合当leader，如果是更新⾃⼰的三个属性
} else {//因为更新过epoch了，所以要更新⾃⼰的epoch
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
sendNotifications();//从这⾥我们能看出来，本地的epoch⼩于其他服务器，会更新epoch后重新发送。

那么其他机器的epoch⼩于本机的epoch也是会再次把投票发给我们的 } else if (n.electionEpoch < logicalclock.get()) {//如果对⽅的epoch没有⾃⼰⼤，那就什么都不做，推出switch，重新到while循环⾥，继续从接收queue⾥选择消息
//对⽅会再次发送投票过来的，不必担⼼退出switch后，再也进不来switch了
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {//如果对⽅的投票⽐我们的优先级⾼
updateProposal(n.leader, n.zxid, n.peerEpoch);//更新⾃⼰的投票三个属性
sendNotifications();//重新发送投票
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
// don't care about the version if it's in LOOKING state
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
最后的 recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch)); 其实⾮常的精髓，recvset会记录每个机器的投票，甚⾄是⾃⼰的投票。

同时也要注意，每次收到消息，recvset都会更新的，因为收到消息意味着，某台服务器发现了可能⽐⾃⼰更合适的leader，⼜发过来消息，所以就得更新recvset
然后是第⼆部分，判断是否满⾜了结束条件
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
// Verify if there is any change in the proposed leader
while((n = recvqueue.poll(finalizeWait,
LISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid, logicalclock.get(),
proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
protected boolean termPredicate(Map<Long, Vote> votes, Vote vote) {
SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
voteSet.addQuorumVerifier(self.getQuorumVerifier());
if (self.getLastSeenQuorumVerifier() != null
&& self.getLastSeenQuorumVerifier().getVersion() > self
.getQuorumVerifier().getVersion()) {
voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
}
/*
* First make the views consistent. Sometimes peers will have different
* zxids for a server depending on timing.
*/
for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
//注意这⾥，votes就是我反复强调的recvset，这⾥是判断我收到消息的投票，如果和我⾃⼰的投票⼀致，就加⼊到voteSet
if (vote.equals(entry.getValue())) {
voteSet.addAck(entry.getKey());
}
}
return voteSet.hasAllQuorums();
}
public boolean hasAllQuorums() {
for (QuorumVerifierAcksetPair qvAckset : qvAcksetPairs) {
if (!qvAckset.getQuorumVerifier().containsQuorum(qvAckset.getAckset()))
return false;
}
return true;
}
QuorumMaj# containsQuorum
public boolean containsQuorum(Set<Long> ackSet) {
return (ackSet.size() > half);
}
其中half就是参与投票的服务器除2。

⽐如三台机器那么half就是1.同时 (ackSet.size() > half) 这⾥是⼤于，也就是投票要⼤于等于2才满⾜条件。

我们再回到第⼆部分，分析剩余部分
// Verify if there is any change in the proposed leader
while((n = recvqueue.poll(finalizeWait,
LISECONDS)) != null){//这⾥是继续判断，即使满⾜了结束条件也得再看看是否⼜收到了新的消息，如果收到了就break，然后再次循环if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {//代码⾛到这⾥说明没有新的消息了，⽽且也满⾜了选主条件
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());//设置⾃⼰的⾓⾊是leader 还是follower
Vote endVote = new Vote(proposedLeader,
proposedZxid, logicalclock.get(),
proposedEpoch);
leaveInstance(endVote);
return endVote;
}
看到这⾥，我们看到选出来主是每个zk服务端⾃动就会把⾃⼰的⾓⾊设置好，⽽不是选出来主，主会再发⼀次消息告诉⼤家我是主。

当每个zk服务器中接收消息的队列为空的时候，就说明该发的消息都已经发完了。

那么谁是主，就已经确定了。