Written By: |
Gary Tefft |
---|---|
Manufacturer: |
Avaya |
Product: |
Communications Manager |
Version: |
Applicable to any version of Communication manager on VMware |
Patch Information: |
N/A |
Ticket Number(s): |
Description:
ARB alarm occurred and CM interchanged (Active to Standby)
Applicable to any versions of Communication manager on VMware
Related Articles:
Problem Clarification:
Server Interchange.
init@cm-server1> restartcause
RESTART CAUSES for cm-server1
Cause Action Escalated Mode Time
Interchange 1 (WARM) no Active 08/29 21:01
Interchange 4 (RELOAD) no Standby 08/31 21:06
Internal Request 1 (WARM) no Standby 08/31 21:06
Internal Request 1 (WARM) no Standby 08/31 21:07
Interchange 1 (WARM) no Active 09/05 21:00
Interchange-Craft 4 (RELOAD) no Standby 09/07 11:44
Internal Request 1 (WARM) no Standby 09/07 11:49
Internal Request 1 (WARM) no Standby 09/07 11:49
Interchange-Craft 1 (WARM) no Active 09/07 12:00
Interchange-Craft 4 (RELOAD) no Standby 09/07 12:28
Internal Request 1 (WARM) no Standby 09/07 12:29
Internal Request 1 (WARM) no Standby 09/07 12:29
Interchange 1 (WARM) no Active 07/02 22:13
Interchange 4 (RELOAD) no Standby 11/22 5:33
Internal Request 1 (WARM) no Standby 11/22 5:33
Internal Request 1 (WARM) no Standby 11/22 5:34
Heartbeat timeout messages, dual active-mode detection, dual-active back off and backup cur_max messages printed on the ECS logs of both active and standby servers.
20161122:053322133:197114881:Arbiter(4500):MED:[Heartbeat Timeout from alt side]------------->Standby lost heartbeat to Active
20161122:053322133:197114882:Arbiter(4500):MED:[ with errno=Interrupted system call[4]]
20161122:053322133:197114883:Arbiter(4500):HIGH:[Heartbeat Timeout from ACTIVE]
20161122:053322133:197114884:Arbiter(4500):MED:[STANDBY->ACTIVE :other side reset]
20161122:053322133:197114885:Arbiter(4500):MED:[finite_state_machine():State Transition: STANDBY to ACTIVE ]
20161122:053322133:197114886:Arbiter(4500):MED:[reset_active_timer: bucket 16.0 to 0]
20161122:053322133:197114887:Arbiter(4500):HIGH:[Interchange STANDBY/AUXSTBY TO ACTIVE]--------------> Becomes Active
20161122:053322134:197114888:Arbiter(4500):MED:[finite_state_machine(): Bumping escalation level from 0 to 1 due to req_esc_lvl]
20161122:053322134:197114889:Arbiter(4500):MED:[finite_state_machine(): Interchange Reason: Hung Server]
Previously cm-server1 active (now standby) server ECS logs after the server is no longer in a hung state and now in a dual active state:
0161122:053340228:56375800:Arbiter(4513):MED:[dual-active detection: me 0@16 vs msg 0@0] -------------> Active/Active State
20161122:053340228:56375801:Arbiter(4513):MED:[update: Received IPAT_ACTV message from the Standby server]
20161122:053340228:56375802:Arbiter(4513):MED:[remot: gmm 0300, pcd 00/00, dup a30, wd 01, hmm 00, pe 0304, actv 000]
20161122:053340228:56375803:Arbiter(4513):MED:[State of other side is ACTIVE ]
20161122:053340228:56375804:Arbiter(4513):MED:[ACTIVE ->STANDBY:dual-active backoff] -------------> Dual Active resolution.
20161122:053340228:56375805:Arbiter(4513):MED:[finite_state_machine():State Transition: ACTIVE to STANDBY]
20161122:053340228:56375806:Arbiter(4513):HIGH:[Both Servers thought they were active]
On Current standby Server:
20161122:053340227:56375799:tim(11098):MED:[def_main: backed up cur_max 1083 max 1083]
On Current Active Server :
20160702:221353596:197113268:tim(8443):MED:[def_main: backed up cur_max 76 max 215]
Line 3704: 20160702:225352126:197114439:tim(13646):MED:[def_main: backed up cur_max 157 max 157]
Line 3705: 20160702:225355264:197114440:tim(13646):MED:[def_main: caught up cur_max 157 max 157]
Note the "cur_max" value shows the amount of time (in 20ms intervals) the server was backed up/hung at the time of the log entry, and the "max" value shows the largest amount of time (in 20ms intervals) the server has been backed up/hung historically. For example, the "cur_max 63" entry means the server was backed up/hung for 63 x 20ms = 1,260 ms or 1.26 seconds.
Cause:
Scenario 1:
The performance reports , from VMWare logs, indicate high CPU ready times on the CM1 and CM2 virtual machines in general. With no reservations on the CM virtual machines, and high CPU ready times indicating that the hosts are experiencing latencies while trying to provide CPU resources to the CM VMs, we have seen this cause issues with interchanges and dual actives on CM. CM is a real time application that requires CPU/memory/storage resources when it needs it.
The CPU reservation had to have been changed by someone post-VM deployment. By default it deploys with reservations, but by checking with the vmtoolbox-cmd it showed no reservations. Some customers have requirements by their IT department of no reservations allowed. If that is the case here, then they are going to have to do something to give the CM VMs some priority (this is done by increasing the number of shares a VM has compared to other VMs within the same host or cluster). In my lab models I am seeing less than .3%-.4% ready time on my duplex CM VM’s. Ready Time is a performance statistic for VM’s that indicates what percentage of time a virtual machine’s CPU is waiting for its turn to run on a physical CPU of the ESXi server. Some OS’s and applications can deal with this kind of latency a lot easier than real time latency sensitive applications (such as CM duplex’s arbiter process that is very sensitive to detecting that something is wrong and instigating interchanges. If you are looking at the Real Time performance graph of the CM Virtual Machine for CPU Usage and Ready Time summation it shows how many ms the VM is waiting for CPU. In our case the CM Duplex VM has 3 CPUs and real time graphs are sampling at 20 ms intervals.
In a different case where similar ecs log excerpts were found in CM VMware it turned to be a possiblenetwork issue by finding a lot of “CRC” on the port of the Storage VM “acm1b” which interchanged/rebooted around the time when the fault was reported.
“smb-n5548-1# sh int Eth 1/25 | grep -i crc
0 runts 0 giants 978085 CRC 0 no buffer”
Scenario 2:
For server interchange: This is a known issue with CM running on VMWare and check the PSN (psn020151u) to fix this issue from VMWare side.
For Phones rebooting: Network Issue (Network Fluctuations) or power or voltage issue at the site, from messages logs we found this was affecting only for one particular port network and when checked a particular station where this issue was occurring we found error 1537.
Error Type 1537 indicates that the link has gone down between the terminal and its gateway to the switch. This likely means that the terminal has unregistered (IP terminal only).
Scenario 3:
- As per tier 4 this is the first time they saw an issue where CD-ROM change caused the CM to get hung.
- The below logs from ESXi for CM-B that shows user connection and changes to the CD-ROM (Note that the ESXi time is 6 hours ahead of CM)
2018-11-19T15:44:10.240Z| mks| I125: SOCKET 5 (144) ===> Creating VNC remote connection.
2018-11-19T15:44:17.990Z| vmx| I125: [msg.cdromdisconnect.locked] ===> The guest operating system has locked the CD-ROM door and is probably using the CD-ROM, which can prevent the guest from recognizing media changes.2018-11-19T15:44:39.624Z| vcpu-0| I125: Tools: ===>Tools heartbeat timeout.
2018-11-19T15:45:23.145Z| vmx| I125: VigorTransport_ServerSendResponse opID=98BDAE1200000186-35-a4-93c2 seq=1191154: ===> Completed Disk request
2018-11-19T15:45:54.388Z| mks| I125: SOCKET 5 (144) recv error 104: ===>Connection reset by peer.
2018-11-19T15:45:54.388Z| mks| I125: SOCKET 5 (144) ===>VNC Remote Disconnect
The below logs will explain the issue in detail.
1. CM-A, which is active at the time, the below logged message shows when it detect CM-B hung (Issue caused by the CD-ROM changes shown in the logs above)
20181119:094412965:[s1 Heartbeat Timeout from alt side]
20181119:094420007:[s1 pe2_soh_tst: /usr/sbin/arping -I 'eth0' -f -c 3 -w 1 '10.80.139.52'FAILED eth0]
2. CM-A detects a dual active condition (Decided to back off, i.e. went to STANDBY mode by doing a RELOAD
20181119:094518334:[s1 dual-active detection: me 0@16 vs msg 0@17]
20181119:094518337:[s1 ACTIVE ->STANDBY:dual-active backoff]
20181119:094518338:[s1 Have border do spontaneous RELOAD]
3. CM-B logged the following message when it came out from the “hung” state/mode, and 4068 ticks is proximately 81.36 seconds
20181119:094523248:[s2 def_main: ACHTUNG! DELTA 4068 ticks! WON'T PLAY CATCH-UP]
20181119:094527249:[s2 Have border do spontaneous RELOAD]
4. CM -B identified CM-A as Hung (Server mode), this is because CM-A is doing RELOAD(as explained above). The difference here is CM-A is hung because it’s reloading and the events happened in ESXi made the CM-B to go in hung state
20181119:094527249:[s2 Heartbeat Timeout from ACTIVE]
20181119:094527249:[s2 finite_state_machine(): Interchange Reason: Hung Server]
20181119:094527251:[s2 restart_req: source interchange, request RELOAD, current escalation level 0, doing RELOAD]
5. CM-A came out of the “RELOAD” state
20181119:094531121:[s1 Release information: R017x.01.0.532.0:cm71integ:bambooagent:/usr/add-on/field_base4/cm7.1/SP/cm7-532-24811SP.pj@07/23/18 09:52:39 AM]
20181119:094531137:[s1 begin: restartcause level RELOAD (0x484), cause "interchange" (0xdb00), mode "Standby" (1)]
6. CM-B is doing RELOAD to go active because it thought the CM-A is Hung.
20181119:094539500:[s2 begin: restartcause level RELOAD (0x484), cause "interchange" (0xdb00), mode "Active" (0)]
From the logs, we can know active and standby server lost heartbeat, then CM went into dual-active mode which caused server interchange. This is a known issue which is due to the VMware virtualized environment.
Solution:
Addressed in PSN020151u: https://support.avaya.com/css/public/documents/101001901
Manufacturer Release notes:
Disclaimer: intlx Solutions Knowledge Base
The information contained in this knowledge base ("Content") is provided for informational purposes only and is intended to be a general resource. intlx Solutions does not guarantee the accuracy, completeness, or timeliness of the Content.
Use at Your Own Risk: By accessing and using the Content, you agree that you do so at your own risk. intlx Solutions assumes no responsibility for any errors or omissions in the Content, nor for any damages or losses you may suffer arising out of or related to the use of the Content.
Current Customers: If you are a current intlx Solutions customer and have questions or require further clarification on any information presented here, please do not hesitate to contact our support team directly. They are available to assist you and ensure you have the most up-to-date information specific to your needs.
Not a Customer? If you are not currently an intlx Solutions customer but are interested in learning more about our solutions and how we can help your business, please click here. We look forward to the opportunity to discuss your needs and explore how a partnership with intlx Solutions can benefit you.
Thank you for your understanding.
This article contains data that is aimed at helping fix an issue with a product. Please use this information at your own risk as intlx Solutions is not responsible for actions taken by the steps or procedures shown in these articles.