Cisco Unity Connection Cluster Operation and Troubleshooting
This document discusses Unity Connection Clustering, the Unity Connection voicemail system's redundancy option. Before proceeding to this, please first read through the Cisco Unity Connection Cluster Configuration and Administration Guide.
Introduction
The Cisco Unity Connection (CUC) 7.0 failover feature--at the database layer--uses IDS' Enterprise Replication (ER). It utilizes a multimaster database replication topology for all databases except for the mailbox databases. To avoid the possibility of a message being received simultaneously on two servers (and one message effectively overwriting the other), only one copy of the mailbox database will be writable at a time. Since the directory database could also encounter such collisions, however, a "last write wins" mechanism is employed.
On top of that, UC has two roles, the Primary and Secondary that are independent of the server's pub/sub status in the cluster. In normal operation, the Publisher should be the Primary and the Subscriber the Secondary. The Primary server is responsible for running services that can only run on one node in a cluster at a time (the so-called "singleton" processes), such as Notifier, MTA, and SysAgent-tasks. The Primary handles all writes to message store database(s) and certain master files such as the encrypt key and certificates are managed in the primary and then replicated to secondary. The benefit is that in this method the system is completely manageable and allows for changes and user updates, even if the Publisher node is offline. Since singleton processes only run on one server, notifications for MWI or other outside numbers as well as outbound VPIM traffic or email notifications will always originate from the Primary node. Inbound VPIM traffic must always reach the Primary server. This is an important consideration for any design.
These roles are managed via the Server Redundancy Manager (SRM). This service, which runs on both nodes, implements the failover logic and figures out which side is the Primary. It maintains a heartbeat between the servers and is responsible for running Split-brain recovery (SBR), which occurs when both servers assumed the Primary role (e.g. they couldn't talk to one another; became Primary; and now they can again). SRM talks to ServM to determine the status of critical processes and requests ServM to start singleton processes after a failover.
Besides database redundancy, Connection uses a file replication mechanism (the "Connection File Syncer" or CuFileSync) to replicate messages and audio files, which are not stored in the database. This is done via SFTP. The following files must be replicated:
- Message audio files (/var/opt/cisco/connection/mail/*)
- Spoken names/greeting (/var/opt/cisco/connection/lib/*)
- Certificates (/var/opt/cisco/connection/security/*)
- CuEncrypt key files (/opt/cisco/connection/.security/*)
This document is not intended to go over all aspects of database replication and file synchronization, but rather to look at the operation of the SRM, since it triggers many of these other functions.
UC Failover in Operation
The Connection Server Role Manager (SRM) is the service that runs on both servers that keeps track of the current role of each server. It has a 5-second keepalive mechanism. The communication as per UC7.1.3 is via TCP where the subscriber chooses a random source port and connects to port 22001 of the publisher.
From the Serviceability page, one can access Tools > Cluster Management to view the cluster status or issue the 'show cluster status' command from CLI.
Cluster Status
NOTE: These show commands are taken from UC7.1.3. Prior releases did not indicate the replication status (and it is not possible to easily find the replication status of the CUC databases from the CLI).
admin:show cuc cluster status Server Name Member ID Server State Internal State Reason ----------- --------- ------------ -------------- ------ cuc1a 0 Primary Pri Active Normal cuc1b 1 Secondary Sec Active Normal SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_ciscounity_cuc1a 100 Active Local 0 g_ciscounity_cuc1b 101 Active Connected 0 Jul 14 15:13:14
This output is from the "cuc1a" node (note the replication status is "Local"). We see that it is the primary and is the publisher, since it its Member ID is 0. Furthermore, replication is active and it is connected to the sub's database. The cluster status (Primary/Secondary is learned via replication, so if replication is not working, this output may not be correct).
For example, let's say the network between the two servers is interrupted. Now the status may appear as follows. Note that this command what the local server where the command is run is seeing. The status from the other side is learned via replication, so if there's a network disconnect that's obviously not happening.
admin:show cuc cluster status Server Name Member ID Server State Internal State Reason ----------- --------- ------------ ----------------------- ------ cuc1a 0 Primary Pri Active Disconnected Normal cuc1b 1 Secondary Sec Active Normal SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_ciscounity_cuc1a 100 Active Local 0 g_ciscounity_cuc1b 101 Active Connecting 2259 Jul 29 10:29:52
This output is from the publisher again. Note that its internal state is Primary Active Disconnected, but the state about the subscriber is basically incorrect. The replication status is "Connection", so it's clear it isn't getting updates. When we look at the same command on the secondary, we see:
Server Name Member ID Server State Internal State Reason ----------- --------- ------------ ---------------------------- ------ cuc1a 0 Primary Pri Active Normal cuc1b 1 Primary Sec Act Primary Disconnected Normal SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_ciscounity_cuc1a 100 Active Connecting 2938 Jul 29 10:29:59 g_ciscounity_cuc1b 101 Active Local 0
Both servers show the Primary status.
Once things come back, one may initially see the following on the Primary:
admin:show cuc cluster status Server Name Member ID Server State Internal State Reason ----------- --------- ---------------------- -------------- ------ cuc1a 0 Split Brain Resolution Pri SBR Normal cuc1b 1 Secondary Sec Active Normal SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_ciscounity_cuc1a 100 Active Local 0 g_ciscounity_cuc1b 101 Active Dropped 5554742 Jul 14 15:08:14
There the Pub/Primary has kicked off SBR. Replication is still down, but as things start working, both server's state will become 'Split Brain Resolution'.
admin:show cuc cluster status Server Name Member ID Server State Internal State Reason ----------- --------- ---------------------- -------------- ------ cuc1a 0 Split Brain Resolution Pri SBR Normal cuc1b 1 Split Brain Resolution Sec SBR Normal SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_ciscounity_cuc1a 100 Active Local 0 g_ciscounity_cuc1b 101 Active Connected 17905 Jul 15 09:49:58
There are several intermediate states before returning back to normal, however these should be completed fairly quickly.
SRM Traces
First thing to look at when troubleshooting failover-related issues is to look at the SRM logs. The following is a portion of the trace from the Primary/Publisher, which was disconnected from the network:
07/15/2009 10:36:14.268 |18039,,,SRM,3,<Timer-241> Ending session due to missing heartbeats| ... 07/15/2009 10:36:14.322 |15252,,,-1,-1,alarm id 9202| 07/15/2009 10:36:14.322 |GenAlarm: AlarmName = NoConnectionToPeer, DeviceName = cuc1a, AlarmMsg = NoConnectionToPeer AppID : CuSrm ClusterID : NodeID : cuc1a | 07/15/2009 10:36:14.322 |15252,,,SRM,5,<evt> [9202] Lost communication with the remote server cuc1b.vnt.cisco.com in the cluster. The remote server may be down.| ... 07/15/2009 10:36:14.631 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [disconnect] [PUB_PRIMARY_DISCONNECTED]| 07/15/2009 10:36:14.711 |15252,,,SRM,5,<evt> PUB_PRIMARY_DISCONNECTED - Role_activate Start - PUBLISHER_PRIMARY| 07/15/2009 10:36:14.711 |15252,,,SRM,5,<evt> PUB_PRIMARY_DISCONNECTED - Role_activate Done!! - PUBLISHER_PRIMARY [ 0secs. ]| 07/15/2009 10:36:15.011 |15252,,,SRM,5,<evt> [PUB_PRIMARY_DISCONNECTED] [db_state_change] ignored| ...
On the subscriber it looks like this. Note that it starts the MTA and Notifier services and makes itself primary.
07/15/2009 10:36:14.964 |28031,,,SRM,3,<Timer-9> Ending session due to missing heartbeats| ... 07/15/2009 10:36:14.975 |GenAlarm: AlarmName = NoConnectionToPeer, DeviceName = cuc1b, AlarmMsg = NoConnectionToPeer AppID : CuSrm ClusterID : NodeID : cuc1b | ... 07/15/2009 10:36:14.975 |14161,,,SRM,5,<evt> [9202] Lost communication with the remote server cuc1a.vnt.cisco.com in the cluster. The remote server may be down.| 07/15/2009 10:36:15.078 |14161,,,SRM,5,<evt> [SUB_SECONDARY] [disconnect] [SUB_PRIMARY_DISCONNECTED]| 07/15/2009 10:36:15.095 |14161,,,SRM,5,<evt> Entering SUB_PRIMARY_DISCONNECTED without token| 07/15/2009 10:36:15.101 |14161,,,SRM,5,<evt> SUB_PRIMARY_DISCONNECTED - Role_activate Start - SUBSCRIBER_PRIMARY| 07/15/2009 10:36:15.267 |13952,,,SRM,5,<dbmon> Cluster information has been changed in the database. ClusterMasterObjectId: c435dcac-d7b8-45ba-b879-785db246f060| 07/15/2009 10:36:15.499 |14161,,,SRM,5,<evt> Activate requested for the following service(s): Connection Message Transfer Agent, Connection Notifier, Connection Groupware Caching Service, | 07/15/2009 10:37:05.504 |14161,,,SRM,5,<evt> SUB_PRIMARY_DISCONNECTED - Role_activate Done!! - SUBSCRIBER_PRIMARY [ 50secs. ]| 07/15/2009 10:37:05.504 |14161,,,SRM,5,<evt> [SUB_PRIMARY_DISCONNECTED] [db_state_change] ignored| ...
When the network connectivity is restored, the primary shows the following:
07/15/2009 10:43:14.233 |15254,,,SRM,5,<listener> Received connection from /14.84.158.21:57590| 07/15/2009 10:43:14.864 |29106,,,SRM,5,<sess-241> [rcv] Type: Register ID: c435dcac-d7b8-45ba-b879-785db246f060 HavePrimaryToken: false NeedSbr: true Version: 7.1.2.39000-35| 07/15/2009 10:43:16.396 |29106,,,SRM,5,<sess-241> [snd] Type: Register ID: ece97bfd-f8b1-409a-8dca-4402311555cb HavePrimaryToken: true NeedSbr: false Version: 7.1.2.39000-35| 07/15/2009 10:43:16.396 |15252,,,SRM,5,<evt> [snd] Type: State State: PRIMARY| 07/15/2009 10:43:16.396 |15252,,,SRM,5,<evt> [PUB_PRIMARY_DISCONNECTED] [connect] ignored| 07/15/2009 10:43:16.396 |15252,,,SRM,5,<evt> [PUB_PRIMARY_DISCONNECTED] [sbr_required] [PUB_SBR]| 07/15/2009 10:43:16.397 |15252,,,-1,-1,alarm id 9216| 07/15/2009 10:43:16.397 |15252,,,SRM,5,<evt> [9216] Split-brain resolution procedure has been initiated.| 07/15/2009 10:43:16.398 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: PRIMARY| 07/15/2009 10:43:16.565 |15252,,,SRM,5,<evt> PUB_SBR - Role_activate Start - PUBLISHER_PRIMARY_SBR| 07/15/2009 10:43:16.871 |15252,,,SRM,5,<evt> Deactivate requested for the following service(s): Connection File Syncer, Connection Message Transfer Agent, Connection System Agent, | 07/15/2009 10:44:06.674 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: IN_SBR| 07/15/2009 10:44:06.882 |15252,,,SRM,5,<evt> PUB_SBR - Role_activate Done!! - PUBLISHER_PRIMARY_SBR [ 50secs. ]| 07/15/2009 10:44:06.882 |15252,,,SRM,5,<evt> [snd] Type: State State: IN_SBR| ... 07/15/2009 11:01:33.670 |15048,,,SRM,5,<CM> Command: /opt/cisco/connection/bin/sbrscript execution completed successfully| 07/15/2009 11:01:33.671 |15252,,,SRM,5,<evt> [PUB_SBR] [sbr_done] [PUB_CHOOSE_ROLE]| 07/15/2009 11:01:33.671 |15252,,,-1,-1,alarm id 9217| 07/15/2009 11:01:33.671 |15252,,,SRM,5,<evt> [9217] Split-brain resolution procedure run successfully.| 07/15/2009 11:01:34.409 |15252,,,SRM,5,<evt> [snd] Type: SbrComplete| 07/15/2009 11:01:34.409 |15252,,,SRM,5,<evt> [snd] Type: State State: INITIALIZING| 07/15/2009 11:01:34.409 |15252,,,SRM,5,<evt> [PUB_CHOOSE_ROLE] [go_primary] [PUB_PRIMARY]| 07/15/2009 11:01:34.477 |15252,,,SRM,5,<evt> PUB_PRIMARY - Role_activate Start - PUBLISHER_PRIMARY| 07/15/2009 11:01:34.638 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: INITIALIZING| 07/15/2009 11:01:34.991 |15252,,,SRM,5,<evt> Activate requested for the following service(s): Connection Message Transfer Agent, Connection System Agent, Connection File Syncer, | 07/15/2009 11:02:19.993 |15252,,,SRM,5,<evt> PUB_PRIMARY - Role_activate Done!! - PUBLISHER_PRIMARY [ 45secs. ]| 07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [snd] Type: State State: PRIMARY| 07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [peer_initializing] ignored| 07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored| 07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored| 07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored| 07/15/2009 11:02:33.172 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored| 07/15/2009 11:02:40.232 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: IN_DB_SYNC| 07/15/2009 11:02:40.234 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [peer_in_db_sync] ignored| 07/15/2009 11:02:40.721 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored| 07/15/2009 11:02:40.800 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored| 07/15/2009 11:03:30.832 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: SECONDARY| 07/15/2009 11:03:30.833 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [peer_secondary] [PUB_PRIMARY]|
The Secondary shows the following:
07/15/2009 10:43:14.238 |14162,,,SRM,5,<session> Connection to cuc1a.vnt.cisco.com/14.84.158.20:22001| 07/15/2009 10:43:14.828 |14162,,,SRM,5,<session> [snd] Type: Register ID: c435dcac-d7b8-45ba-b879-785db246f060 HavePrimaryToken: false NeedSbr: true Version: 7.1.2.39000-35| 07/15/2009 10:43:16.402 |14162,,,SRM,5,<session> [rcv] Type: Register ID: ece97bfd-f8b1-409a-8dca-4402311555cb HavePrimaryToken: true NeedSbr: false Version: 7.1.2.39000-35| 07/15/2009 10:43:16.403 |14161,,,SRM,5,<evt> [snd] Type: State State: PRIMARY| 07/15/2009 10:43:16.403 |14162,,,SRM,5,<session> [rcv] Type: State State: PRIMARY| 07/15/2009 10:43:16.403 |14161,,,SRM,5,<evt> [SUB_PRIMARY_DISCONNECTED] [connect] ignored| 07/15/2009 10:43:16.404 |14161,,,SRM,5,<evt> [SUB_PRIMARY_DISCONNECTED] [sbr_required] [SUB_SBR]| 07/15/2009 10:43:16.414 |14161,,,-1,-1,alarm id 9216| 07/15/2009 10:43:16.414 |14161,,,SRM,5,<evt> [9216] Split-brain resolution procedure has been initiated.| 07/15/2009 10:43:16.436 |14161,,,SRM,5,<evt> SUB_SBR - Role_activate Start - SUBSCRIBER_SECONDARY_SBR| 07/15/2009 10:43:16.675 |14161,,,SRM,5,<evt> Deactivate requested for the following service(s): Connection File Syncer, Connection System Agent, Connection Groupware Caching Service, Connection Message Transfer Agent, Connection Notifier, | 07/15/2009 10:44:06.679 |14161,,,SRM,5,<evt> SUB_SBR - Role_activate Done!! - SUBSCRIBER_SECONDARY_SBR [ 50secs. ]| 07/15/2009 10:44:06.679 |14161,,,SRM,5,<evt> [snd] Type: State State: IN_SBR| 07/15/2009 10:44:06.694 |14161,,,-1,-1,alarm id 9224| 07/15/2009 10:44:06.694 |14161,,,SRM,5,<evt> [9224] Regained communication with the remote server cuc1a.vnt.cisco.com in the cluster.| 07/15/2009 10:44:06.694 |14161,,,SRM,5,<evt> [SUB_SBR] [peer_primary] ignored| 07/15/2009 10:44:06.694 |14161,,,SRM,5,<evt> [SUB_SBR] [db_state_change] ignored| ... 07/15/2009 10:44:06.928 |14162,,,SRM,5,<session> [rcv] Type: State State: IN_SBR| 07/15/2009 10:44:06.928 |14161,,,SRM,5,<evt> [SUB_SBR] [peer_in_sbr] ignored| 07/15/2009 10:44:07.090 |14161,,,SRM,5,<evt> [SUB_SBR] [db_state_change] ignored| 07/15/2009 10:44:07.100 |14161,,,SRM,5,<evt> [SUB_SBR] [db_state_change] ignored| 07/15/2009 11:01:34.420 |14162,,,SRM,5,<session> [rcv] Type: SbrComplete| 07/15/2009 11:01:34.420 |14161,,,SRM,5,<evt> [SUB_SBR] [sbr_done] [SUB_CHOOSE_ROLE]| 07/15/2009 11:01:34.421 |14161,,,-1,-1,alarm id 9217| 07/15/2009 11:01:34.421 |14161,,,SRM,5,<evt> [9217] Split-brain resolution procedure run successfully.| 07/15/2009 11:01:34.421 |14162,,,SRM,5,<session> [rcv] Type: State State: INITIALIZING| 07/15/2009 11:01:34.615 |14161,,,SRM,5,<evt> [snd] Type: State State: INITIALIZING| 07/15/2009 11:01:34.616 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [peer_initializing] ignored| 07/15/2009 11:01:34.650 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [db_state_change] ignored| 07/15/2009 11:01:34.663 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [db_state_change] ignored| 07/15/2009 11:01:34.675 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [db_state_change] ignored| 07/15/2009 11:02:20.153 |14162,,,SRM,5,<session> [rcv] Type: State State: PRIMARY| 07/15/2009 11:02:20.174 |14161,,,-1,-1,alarm id 9224| 07/15/2009 11:02:20.174 |14161,,,SRM,5,<evt> [9224] Regained communication with the remote server cuc1a.vnt.cisco.com in the cluster.| 07/15/2009 11:02:20.174 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [peer_primary] [DB_SYNC]| 07/15/2009 11:02:40.199 |14161,,,SRM,5,<evt> [snd] Type: State State: IN_DB_SYNC| 07/15/2009 11:02:40.199 |14161,,,SRM,5,<evt> [DB_SYNC] [db_sync_unnecessary] [SUB_SECONDARY]| 07/15/2009 11:02:40.258 |14161,,,SRM,5,<evt> Cluster information has been changed in the database. ClusterMasterObjectId: ece97bfd-f8b1-409a-8dca-4402311555cb| 07/15/2009 11:02:40.264 |14161,,,SRM,5,<evt> SUB_SECONDARY - Role_activate Start - SUBSCRIBER_SECONDARY| 07/15/2009 11:02:40.778 |14161,,,SRM,5,<evt> Activate requested for the following service(s): Connection System Agent, Connection File Syncer, | 07/15/2009 11:03:30.784 |14161,,,SRM,5,<evt> SUB_SECONDARY - Role_activate Done!! - SUBSCRIBER_SECONDARY [ 50secs. ]| 07/15/2009 11:03:30.800 |14161,,,SRM,5,<evt> [snd] Type: State State: SECONDARY| 07/15/2009 11:03:30.800 |14161,,,SRM,5,<evt> [SUB_SECONDARY] [db_state_change] [SUB_SECONDARY]|
Split-brain Recovery (SBR)
SBR is the process that runs when both publisher and subscriber have figured out that they are both primary (i.e. they weren't able to communicate to one another). A script is run, /opt/cisco/connection/bin/sbrscript which calls a check-aborted-transaction script followed by another script to resolve the mailbox databases. This whole process can take a few minutes up to an hour. After an hour, the process is interrupted. If the process (due to a defect or some other issue) takes the full hour, you'll see the FsmTimer and the SRM logs will look like this:
07/29/2009 07:15:52.949 |9380,,,SRM,5,Thread=EH [9216] Split-brain resolution procedure has been initiated.| ... 07/29/2009 08:16:58.318 |9379,,,SRM,5,Thread=FsmTimer [9217] Split-brain resolution procedure run successfully.|
Note that it's clearly an hour later and the FsmTimer expired. When it works normally, it will appear as "SRM,5,<evt> [9217] Split-brain resolution procedure run successfully."
check-aborted-transactions
The bulk of time spent in SBR is running this script. Logs for this are written to the aborted-transactions-resolution-<date/time> logs (file list activelog cuc/aborted-transactions-resolution*). They do not currently have timestamps for the commands other than the log file name itself. If there are aborted transactions, they are written and compressed to aborted-transactions-<date/time>.tgz. So what does this script do?
- It waits for CDR to process its replication queues
- Checks for aborted transactions. You'll see in the logs something like "Transactions have been aborted. Processing 959 ATS files."
- It tar's and zip's up those aborted transactions and puts them into the aborted-transactions-<date/time>.tgz file.
- Next it repairs the databases. This is the longest step and the databases are merged. Upon completion, a "Repair successful" message is logged. This is the end of the script. For each database, the repair
- disables the database constraints
- runs a cdr check replset which compares data on the servers and repairs any inconsistencies. For each,you'll get a report showing any inconsistencies.
- re-enables constraints
This entire process is logged in a file ""aborted-transactions-resolution-yy-mm-ddT<timestamp>""
SBR Errors during mailbox reconciliation
At the end of SBR, the mailbox databases must be consolidated to ensure no messages are lost and they are numbered properly. It follows the following logic for each mailbox database:
- Find the time of the last known good communication
- Find all mailstores
- Unmount the first one
- Connect to the mailstore
- Renumber messages
- Refresh mail counts and mailbox sizes
- Remount mailstore
- Repeat for other mailbox store databases
- Initiate a full MWI resync
Obviously, when the mailbox database is down, a user won't be able to access their messages. Even on large databases, this typically doesn't take longer than about a minute or two. The user will hear "your messages are currently not available" and similar prompts.
If there are errors during the mailbox resolution process, in UC7.1.3 they are written to the mbxsbr.txt file. Errors could potentially look something like this:
Error resolving splitbrain, database: UnityMbxDb1 objectid: dfacd9cf-2fad-4b86-8a04-31daae6b61d8 SQLCODE -458 in EXECUTE: IX000: Long transaction aborted. IX000: RSAM error: Long transaction detected. Error resolving splitbrain, database: UnityMbxDb2 objectid: 09bef2f6-31b7-416f-adbc-89b70b835f54 SQLCODE -387 in CONNECT: IX000: No connect permission. IX000: ISAM error: no record found.
Other Conditions
Replication Queue Filling up
By default, if the servers are disconnected so long that the replication queue grows to 90% of the maximum threshold, then replication is stopped. The SRM diagnostics will show something like:
07/28/2009 23:53:56.159 |10334,,,SRM,3,<Timer-0> Replication queue size: 93.0 has exceeded the maximum threshold value. Stopping replication.| ... 07/28/2009 23:54:04.452 |10333,,,SRM,5,<CM> Command: /opt/cisco/connection/lib/config-modules/dbscripts/repl-helper stop_replication execution completed successfully| 07/28/2009 23:54:06.161 |10334,,,SRM,5,<Timer-0> DB replication has been stopped.| 07/28/2009 23:54:06.224 |10334,,,SRM,5,<Timer-0> [9222] Database replication queue size has exceeded the maximum threshold. Replication between redundant servers has been stopped.|
An alarm is also sent so this condition is visible in RTMT. After this, the replication has to be set up again by accessing the Serviceability page and activating the subscriber again under Tools > Cluster Management or with the utils cuc cluster activate command.
Drives filling up
Any time the drives fill up, it's a bad thing. RTMT generates alerts for this. Also, from the CLI, show diskusage <partition> will also help with identify this issue.
Replication Processes hung or coredump
In some instances, the replication processes themselves may get hung or coredump. A core dump (from CDR, for example), will be in /var/log/active/core, the same place as other cores. It is possible for a cdr process to hang and SRM to continue working fine, so it is often important to look for hung processes (ps auxf | grep cdr will grab cdr-related stuff). If something's been running for ages, then chances are, it's hung.
utils cuc cluster CLI commands
- utils cuc cluster activate - described earlier
- utils cuc cluster deactivate - disables replication and forces node to deactivated. Sometimes used when replacing the publisher.
- utils cuc cluster makeprimary - run from the node that is currently the primary. Makes the other server the primary.
- utils cuc cluster overwritedb - copies data from the publisher to the subscriber
- utils cuc cluster renegotiate - used when a publisher is being replaced to join the new publisher to the cluster and then copy over the database from the subscriber to the publisher
RTMT
Alarms
In many cases, the only indication that something went wrong is when RTMT generates an alarm on on each node ("No Connection To Peer") from CuSrm (obviously the failed node may not be in a position to report a failure). There should also be a system-level ServerDown alarm. The subscriber will also indicate an AutoFailoverSucceeded alarm. Keep in mind that if the nodes are not communicating with one another, then RTMT traffic will be affected, too. So often only information from the node being monitored will be visible. There is currently no alarm that SBR has initiated or completed.
Performance Counters
The Replication state for UC databases is not tracked in RTMT. Currently the counter (""Number of Replicates Created and State of ReplicationCount)\Replicate_State"") exists, however it refers to the CUCM database and therefore not relevant to CUC. There is also no way to directly see the number of messages in the drop folder (which is where messages get queued before entering the MTA queue to be processed), although one can monitor \CUC Message Store\Queued Messages Current on the primary. This is the queue where messages are moved to when they're ready to be delivered (and is drained at 100 messages/minute). So when the drop folder is full, this queue will stay around 300 (its max) for the period. Once the drop folder is cleared, then this will drop down to typically some small number, depending on the load on the system.