RabbitMQ脑裂问题解决方案调查

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

RabbitMQ脑裂问题解决⽅案调查
现象：
RabbitMQ GUI上显⽰
Network partition detected
Mnesia reports that this RabbitMQ cluster has experienced a network partition. There is a risk of losing data. Please read RabbitMQ documentation about network partitions and the possible solutions.
原因分析：
这是由于⽹络问题导致集群出现了脑裂临时解决办法：
在相对不怎么信任的分区⾥，对那个分区的节点实⾏
在出现问题的节点上执⾏: sbin/rabbitmqctl stop_app
在出现问题的节点上执⾏: sbin/rabbitmqctl start_app
注意：mq集群不能采⽤kill -9 杀死进程，否则⽣产者和消费者不能及时识别mq的断连，会影响⽣产者和消费者正常的业务处理。

Rabbitmq network partition的判定及恢复策略的选择
RabbitMQ Network Partitions问题具体分析和解决⽅案
Clustering and Network Partitions
RabbitMQ clusters do not tolerate network partitions well. If you are thinking of clustering across a WAN, don’t. You should use federation or the shovel instead.
However, sometimes accidents happen. This page documents how to detect network partitions, some of the bad effects that may happen during partitions, and how to recover from them.
RabbitMQ stores information about queues, exchanges, bindings etc in Erlang’s distributed database, Mnesia. Many of the details of what happens around network partitions are related to Mnesia’s behaviour.
集群和⽹络分区
RabbitMQ集群并不能很好的“忍受”⽹络分区。

如果你想将RabbitMQ集群建⽴在⼴域⽹上，记住那是⾏不通的，除⾮你使⽤federation或者shovel等插件。

然⽽有时候会有⼀些意想不到的事情发⽣。

本⽂主要讲述了RabbitMQ集群如何检测⽹络分区，发⽣⽹络分区带来的影响以及如何恢复。

RabbitMQ会将queues, exchanges, bindings等信息存储在Erlang的分布式数据库——Mnesia中，许多围绕⽹络分区的⼀些细节都和这个Mnesia的⾏为有关。

Detecting network partitions
Mnesia will typically determine that a node is down if another node is unable to contact it for a minute or so (see the page on net_ticktime). If two nodes come back into contact, both having thought the other is down, Mnesia will determine that a partition has occurred. This will be written to the RabbitMQ log in a form like:
=ERROR REPORT==== 15-Oct-2012::18:02:30 ===
Mnesia(rabbit@smacmullen): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, hare@smacmullen}
RabbitMQ nodes will record whether this event has ever occurred while the node is up, and expose this information through rabbitmqctl cluster_status and the management plugin.
rabbitmqctl cluster_status will normally show an empty list for partitions:
# rabbitmqctl cluster_status
Cluster status of node rabbit@smacmullen ...
[{nodes,[{disc,[hare@smacmullen,rabbit@smacmullen]}]},
{running_nodes,[rabbit@smacmullen,hare@smacmullen]},
{partitions,[]}]
...done.
However, if a network partition has occurred then information about partitions will appear there:
# rabbitmqctl cluster_status
Cluster status of node rabbit@smacmullen ...
[{nodes,[{disc,[hare@smacmullen,rabbit@smacmullen]}]},
{running_nodes,[rabbit@smacmullen,hare@smacmullen]},
{partitions,[{rabbit@smacmullen,[hare@smacmullen]},
{hare@smacmullen,[rabbit@smacmullen]}]}]
...done.
The management plugin API will return partition information for each node under partitions in /api/nodes. The management plugin UI will show a large red warning on the overview page if a partition has occurred.
检测⽹络分区
如果另⼀个节点在⼀分钟（或者⼀个net_ticktime时间）内不能连接上⼀个节点，那么Mnesia通常任务这个节点已经挂了。

就算之后两个节点连通（译者注：应该是指⽹络上的可连通），但是这两个节点都认为对⽅已经挂了，Mnesia此时认定发送了⽹络分区的情况。

这些会被记录在RabbitMQ的⽇志中，如下所⽰：
=ERROR REPORT==== 15-Oct-2012::18:02:30 ===
Mnesia(rabbit@smacmullen): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, hare@smacmullen}
当⼀个节点起来的时候，RabbitMQ会记录是否发⽣了⽹络分区，你可以通过rabbitmqctl cluster_status这个命令或者管理插件看到相关信息。

正常情况下，通过rabbitmqctl cluster_status命令查看到的信息中partitions那⼀项是空的，就像这样：
# rabbitmqctl cluster_status
Cluster status of node rabbit@smacmullen ...
[{nodes,[{disc,[hare@smacmullen,rabbit@smacmullen]}]},
{running_nodes,[rabbit@smacmullen,hare@smacmullen]},
{partitions,[]}]
...done.
然⽽当⽹络分区发⽣时，会变成这样：
# rabbitmqctl cluster_status
Cluster status of node rabbit@smacmullen ...
[{nodes,[{disc,[hare@smacmullen,rabbit@smacmullen]}]},
{running_nodes,[rabbit@smacmullen,hare@smacmullen]},
{partitions,[{rabbit@smacmullen,[hare@smacmullen]},
{hare@smacmullen,[rabbit@smacmullen]}]}]
...done.
通过管理插件的API（under partitions in /api/nodes）可以获取到在各个节点的分区信息.
通过Web UI可以在Overview这⼀页看到⼀个⼤的红⾊的告警窗⼝，就像这样：
During a network partition
While a network partition is in place, the two (or more!) sides of the cluster can evolve independently, with both sides thinking the other has crashed. Queues, bindings, exchanges can be created or deleted separately.Mirrored queues which are split across the partition will end up with one master on each side of the partition, again with both sides acting independently. Other undefined and weird behaviour may occur.
It is important to understand that when network connectivity is restored, this state of affairs persists. The cluster will continue to act in this way until you take action to fix it.
⽹络分区期间
当⼀个集群发⽣⽹络分区时，这个集群会分成两部分（或者更多），它们各⾃为政，互相都认为对⽅分区内的节点已经挂了，包括queues, bindings, exchanges这些信息的创建和销毁都处于⾃⾝分区内，与其他分区⽆关。

如果原集群中配置了镜像队列，⽽这个镜像队列⼜牵涉到两个（或者多个）⽹络分区的节点时，每⼀个⽹络分区中都会出现⼀个master节点（译者注：如果rabbitmq版本较新，分区节点个数充⾜，也会出现新的slave节点。

），对于各个⽹络分区，此队列都是互相独⽴的。

当然也会有⼀些其他未知的、怪异的事情发⽣。

当⽹络（这⾥只⽹络连通性，network connectivity）恢复时，⽹络分区的状态还是会保持，除⾮你采取了⼀些措施去解决他。

Partitions caused by suspend / resume
While we refer to “network” partitions, really a partition is any case in which the different nodes of a cluster can have communication interrupted without any node failing. In addition to network failures, suspending and resuming an entire OS can also cause partitions when used against running cluster nodes - as the suspended node will not consider itself to have failed, or even stopped, but the other nodes in the cluster will consider it to have done so.
While you could suspend a cluster node by running it on a laptop and closing the lid, the most common reason for this to happen is for a virtual machine to have been suspended by the hypervisor. While it’s fine to run RabbitMQ clusters in virtualised environments, you should make sure that VMs are not suspended while running. Note that some virtualisation features such as migration of a VM from one host to another will tend to involve the VM being suspended.
Partitions caused by suspend and resume will tend to be asymmetrical - the suspended node will not necessarily see the other nodes as having gone down, but will be seen as down by the rest of the cluster. This has particular implications for pause_minority mode.
挂起/恢复导致的分区
当我们涉及到“⽹络分区”时，当集群中的不同的节点发⽣交互失败中断(communication interrupted)等，但是⼜没有节点挂掉这种情况下，才是发⽣了分区。

然⽽除了⽹络失败(network failures)原因，操作系统的挂起或者恢复也会导致集群内节点的⽹络分区。

因为发⽣挂起的节点不会认为⾃⾝已经失败或者停⽌⼯作，但是集群内的其他节点会这么认为。

如果⼀个集群中的⼀个节点运⾏在⼀台笔记本上，然后你合上了笔记本，这样这个节点就挂起了。

或者说⼀种更常见的现象，节点运⾏在某台虚拟机上，然后虚拟机的管理程序挂起了这个虚拟机节点，这样也可能发⽣挂起。

由于挂起/恢复导致的分区并不对称——挂起的节点将看不到其他节点是否消失，但是集群中剩余的节点可以观察到，这⼀点貌似暗⽰了pause_minority这种模式（下⾯会涉及到）。

Recovering from a network partition
To recover from a network partition, first choose one partition which you trust the most. This partition will become the authority for the state of Mnesia to use; any changes which have occurred on other partitions will be lost.
Stop all nodes in the other partitions, then start them all up again. When they rejoin the cluster they will restore state from the trusted partition.
Finally, you should also restart all the nodes in the trusted partition to clear the warning.
It may be simpler to stop the whole cluster and start it again; if so make sure that the first node you start is from the trusted partition.
从⽹络分区中恢复
未来从⽹络分区中恢复，⾸先需要挑选⼀个信任的分区，这个分区才有决定Mnesia内容的权限，发⽣在其他分区的改变将不被记录到Mnesia中⽽直接丢弃。

停⽌（stop）其他分区的节点，然后启动(start)这些节点，之后重新将这些节点加⼊到当前信任的分区之中。

最后，你应该重启(restart)信任的分区中所有的节点，以去除告警。

你也可以简单的关闭整个集群的节点，然后再启动每⼀个节点，当然，你要确保你启动的第⼀个节点在你所信任的分区之中。

Automatically handling partitions
RabbitMQ also offers three ways to deal with network partitions automatically: pause-minority mode, pause-if-all-down mode and autoheal mode. (The default behaviour is referred to as ignore mode).
In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause as soon as a partition starts, and will start again when the partition ends.
In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node. This is close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead of relying on the context. For instance, if the cluster is made of two nodes in rack A and two nodes in rack B, and the link between racks is lost, pause-minority mode will pause all nodes. In pause-if-all-down mode, if the administrator listed the two nodes in rack A, only nodes in rack B will pause. Note that it is possible the listed nodes get split across both sides of a partition: in this situation, no node will pause. That is why there is an additional ignore/autoheal argument to indicate how to recover from the partition.
In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode it therefore takes effect when a partition ends, rather than when one starts.
The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).
You can enable either mode by setting the configuration parameter cluster_partition_handling for therabbit application in your configuration file to:
● pause_minority
● {pause_if_all_down, [nodes], ignore | autoheal}
● autoheal
⾃动处理分区
RabbitMQ提供了三种⽅法⾃动的解决⽹络分区：pause-minority mode, pause-if-all-down mode以及autoheal mode。

（默认的是ignore模式）
在pause-minority mode下，顾名思义，当发⽣⽹络分区时，集群中的节点在观察到某些节点“丢失”时，会⾃动检测其⾃⾝是否处于少数派（⼩于或者等于集群中⼀半的节点数），RabbitMQ会⾃动关闭这些节点的运作。

根据CAP原理来说，这⾥保障了P，即分区耐受性（partition tolerance）。

这样确保了在发⽣⽹络分区的情况下，⼤多数节点（当然这些节点在同⼀个分区中）可以继续运⾏。

“少数派”中的节点在分区发⽣时会关闭，当分区结束时⼜会启动。

在pause-if-all-down mode下，RabbitMQ在集群中的节点不能和list中的任何节点交互时才会关闭集群的节点（{pause_if_all_down, [nodes], ignore | autoheal}，list即[nodes]中的节点）。

也就是说，只有在list中所有的节点失败时才会关闭集群的节点。

这个模式和pause-minority mode有点相似，但是，这个模式允许管理员的任命⽽挑选信任的节点，⽽不是根据上下⽂关系。

举个案例，⼀个集群，有四个节点，2个节点在A机架上，另2个节点在B机架上，此时A机架和B机架的连接丢失，那么根据pause-minority mode所有的节点都将被关闭。

在autoheal mode下，当认为发⽣⽹络分区时，RabbitMQ会⾃动决定⼀个获胜（winning）的分区，然后重启不在这个分区中的节点。

⼀个获胜的分区（a winning partition）是指客户端连接最多的⼀个分区。

（如果产⽣⼀个平局，即有两个（或多个）分区的客户端连接数⼀样多，那么节点数最多的⼀个分区就是a winning partition. 如果此时节点数也⼀样多，将会以⼀个未知的⽅式挑选winning partition.）
你可以通过在RabbitMQ配置⽂件中设置cluster_partition_handling参数使下⾯任何⼀种模式⽣效：
pause_minority
{pause_if_all_down, [nodes], ignore | autoheal}
autoheal
Which mode should I pick?
It’s important to understand that allowing RabbitMQ to deal with network partitions automatically does not make them less of a problem. Network partitions will always cause problems for RabbitMQ clusters; you just get some degree of choice over what kind of problems you get. As stated in the introduction, if you want to connect RabbitMQ clusters over generally unreliable links, you should use federation or the shovel.
With that said, you might wish to pick a recovery mode as follows:
● ignore - Your network really is reliable. All your nodes are in a rack, connected with a switch, and that switch is also the route to the outside world. You don’t want to run any risk of any of your cluster shutting down if any other part of it fails (or you have a two node cluster).
● pause_minority - Your network is maybe less reliable. You have clustered across 3 AZs in EC2, and you assume that only one AZ will fail at once. In that scenario you want the remaining two AZs to continue working and the nodes from the failed AZ to rejoin automatically and without fuss when the AZ comes back.
● autoheal - Your network may not be reliable. You are more concerned with continuity of service than with data integrity. You may have a two node cluster.
我该挑选那种模式？
有⼀点必须要清楚，允许RabbitMQ能够⾃动的处理⽹络分区并不⼀定会有正⾯的成效，也有能会带来更多的问题。

⽹络分区会导致RabbitMQ集群产⽣众多的问题，你需要对你所遇到的问题作出⼀定的选择。

就像本⽂开篇所说的，如果你置RabbitMQ集群于⼀个不可靠的⽹络环境下，你需要使⽤federation或者shovel插件。

你可能选择如下的恢复模式：
ignore: 你的⽹络很可靠，所有的节点都在⼀个机架上，连接在同⼀个交换机上，这个交换机也连接在WAN上，你不需要冒险⽽关闭部分节点。

（或者适合只有两个节点的集群。

）
pause_minority: 你的⽹络相对没有那么的可靠。

⽐如你在EC2上建⽴了三个节点的集群，假设其中⼀个节点宕了，在这种策略下，剩余的两个节点还可以继续⼯作，失败的节点可以在恢复之后重新加⼊集群
autoheal: 你的⽹络⾮常不可靠，你更关⼼服务的连续性⽽不是数据的完整性。

适合有两个节点的集群。

More about pause-minority mode
The Erlang VM on the paused nodes will continue running but the nodes will not listen on any ports or do any other work. They will check once per second to see if the rest of the cluster has reappeared, and start up again if it has.
Note that nodes will not enter the paused state at startup, even if they are in a minority then. It is expected that any such minority at startup is due to the rest of the cluster not having been started yet.
Also note that RabbitMQ will pause nodes which are not in a strict majority of the cluster - i.e. containing more than half of all nodes. It is therefore not a good idea to enable pause-minority mode on a cluster of two nodes since in the event of any network partition or node failure, both nodes will pause. However, pause_minoritymode is likely to be safer than ignore mode for clusters of more than two nodes, especially if the most likely form of network partition is that a single minority of nodes drops off the network.
Finally, note that pause_minority mode will do nothing to defend against partitions caused by cluster nodes being suspended. This is because the suspended node will never see the rest of the cluster vanish, so will have no trigger to disconnect itself from the cluster.
有关pause-minority模式的更多信息
关闭的RabbitMQ节点所在主机上的Erlang虚拟机还是在正常运⾏，但是此节点并不会监听任何端⼝也不会执⾏其他任务。

这些节点每秒会检测⼀次剩下的集群节点是否会再次出现，如果出现，就启动⾃⼰继续运⾏。

注意上⾯所说的“关闭的RabbitMQ节点”并不会在启动时就进⼊关闭状态，即使它们在“少数派（minority）”。

这些“少数派”可能在“剩余的集群节点”没有启动好之前就启动了。

同样需要注意的是RabbitMQ也会关闭不是严格意义上的“⼤多数（majority）”——数量超过集群的⼀半。

因此在⼀个集群只有两个节点的时候并不适合采⽤pause-minority模式，因为由于其中任何⼀个节点失败⽽发⽣⽹络分区时，两个节点都会被关闭。

然⽽如果集群中的节点个数远⼤于两个时，pause_minority模式⽐ignore模式更加的可靠，特别是⽹络分区通常是由于单个节点掉出⽹络。

最后，需要注意的是pause_minority模式将不会防⽌由于集群节点被挂起⽽导致的分区。

这是因为挂起的节点将永远不会看到集群的其余部分的消失，因此将没有触发器将其从集群中断开。