IT技术互动交流平台

NHB网络心跳丢失的模拟过程分析

来源:IT165收集  发布日期:2016-04-20 21:21:50

环境:11.2.0.4 RHEL6.5 RAC,两节点

问题描述:故意将网络心跳线去掉,分析两节点的心路历程

分析过程:

1.去掉心跳线

2.查看ocssd.log

节点1:

 

2016-04-19 00:19:59.407: [    CSSD][299706112]clssnmPollingThread: node rac2 (2) at 50% heartbeat fatal, removal in 14.440 seconds
2016-04-19 00:19:59.407: [    CSSD][299706112]clssnmPollingThread: node rac2 (2) is impending reconfig, flag 2229260, misstime 15560
节点1发现节点2已经在连续一段时间内丢失网络心跳了,集群在14.440s后重新配置
节点2:

 

 

2016-04-19 00:19:59.349: [    CSSD][3818866432]clssnmPollingThread: node rac1 (1) at 50% heartbeat fatal, removal in 14.230 seconds
2016-04-19 00:19:59.349: [    CSSD][3818866432]clssnmPollingThread: node rac1 (1) is impending reconfig, flag 2491406, misstime 15770
节点2也发先节点1已经连续一段时间丢失网络心跳了,集群在14.230s后重新配置

 

节点1:

 

2016-04-19 00:20:06.409: [    CSSD][299706112]clssnmPollingThread: node rac2 (2) at 75% heartbeat fatal, removal in 7.420 seconds
2016-04-19 00:20:06.410: [    CSSD][315549440]clssnmvDHBValidateNcopy: node 2, rac2, has a disk HB, but no network HB, DHB has rcfg 356437458, wrtcnt, 169517, LATS 39917994, lastSeqNo 169514, uniqueness 1460995653, timestamp 1460996406/44133904
75%了,就要重新配置了!
节点2:

 

 

2016-04-19 00:20:06.353: [    CSSD][3818866432]clssnmPollingThread: node rac1 (1) at 75% heartbeat fatal, removal in 7.200 seconds
2016-04-19 00:20:06.353: [    CSSD][4030301952]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 356437458, wrtcnt, 164891, LATS 44133864, lastSeqNo 164888, uniqueness 1460956156, timestamp 1460996406/39917744
节点2也表示,75%了!

 

节点1:

 

2016-04-19 00:20:13.831: [    CSSD][299706112]clssnmPollingThread: Removal started for node rac2 (2), flags 0x22040c, state 3, wt4c 0
2016-04-19 00:20:13.831: [    CSSD][299706112]clssnmMarkNodeForRemoval: node 2, rac2 marked for removal
2016-04-19 00:20:13.831: [    CSSD][299706112]clssnmDiscHelper: rac2, node(2) connection failed, endp (0x1dae5a), probe(0x7f2b00000000), ninf->endp 0x1dae5a
2016-04-19 00:20:13.831: [    CSSD][299706112]clssnmDiscHelper: node 2 clean up, endp (0x1dae5a), init state 5, cur state 5

节点1表示要清理节点2了
节点2:

 

 

2016-04-19 00:20:13.556: [    CSSD][3818866432]clssnmPollingThread: Removal started for node rac1 (1), flags 0x26040e, state 3, wt4c 0
2016-04-19 00:20:13.556: [    CSSD][3818866432]clssnmMarkNodeForRemoval: node 1, rac1 marked for removal
2016-04-19 00:20:13.556: [    CSSD][3818866432]clssnmDiscHelper: rac1, node(1) connection failed, endp (0x5577), probe(0x7f4e00000000), ninf->endp 0x5577
2016-04-19 00:20:13.556: [    CSSD][3818866432]clssnmDiscHelper: node 1 clean up, endp (0x5577), init state 5, cur state 5

节点2页表示要收拾节点1了!
节点1:

 

 

2016-04-19 00:20:13.833: [    CSSD][296552192]clssnmCheckDskInfo: Checking disk info...
2016-04-19 00:20:13.833: [    CSSD][296552192]clssnmCheckSplit: Node 2, rac2, is alive, DHB (1460996413, 44140634) more than disk timeout of 27000 after the last NHB (1460996383, 44111344)

检查磁盘信息,发现节点2是正常的。
节点2:

 

 

2016-04-19 00:20:13.558: [    CSSD][3815712512]clssnmCheckDskInfo: Checking disk info...
2016-04-19 00:20:13.558: [    CSSD][3815712512]clssnmCheckSplit: Node 1, rac1, is alive, DHB (1460996413, 39924894) more than disk timeout of 27000 after the last NHB (1460996383, 39895134)

检查磁盘信息,发现节点1是正常的。

脑裂就要发生了!

 

这个时候:

节点1:

 

2016-04-19 00:20:13.833: [    CSSD][296552192]clssnmCheckDskInfo: My cohort: 1
2016-04-19 00:20:13.833: [    CSSD][296552192]clssnmRemove: Start
2016-04-19 00:20:13.833: [    CSSD][296552192](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, rac2, from the cluster in incarnation 356437458, node birth incarnation 356437457, death incarnation 356437458, stateflags 0x224000 uniqueness value 1460995653

好吧,节点2被踢出去了
节点2:

 

 

2016-04-19 00:20:13.558: [    CSSD][3815712512]clssnmCheckDskInfo: My cohort: 2
2016-04-19 00:20:13.558: [    CSSD][3815712512]clssnmCheckDskInfo: Surviving cohort: 1
2016-04-19 00:20:13.558: [    CSSD][3815712512](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, rac2, is smaller than cohort of 1 nodes led by node 1, rac1, based on map type 2

节点2出去了。。

集群中,节点号小的节点幸存!

 

Tag标签: 过程   网络  
  • 专题推荐

About IT165 - 广告服务 - 隐私声明 - 版权申明 - 免责条款 - 网站地图 - 网友投稿 - 联系方式
本站内容来自于互联网,仅供用于网络技术学习,学习中请遵循相关法律法规