Monday, March 30, 2015

Basic SIP Call Flows & Troubleshooting Commands

Introduction

This document explains the basic SIP Call flow between the PBX, Gateways and SIP Phones in detail. Idea of creating this document is to help the beginners to understand the Various SIP Call flows and messages. Also this document covers the SIP Troubleshooting commands.

Prerequisites

Basic knowledge about the SIP Protocol and the call flow Messages.

Call Flow Examples

1. Call Flow between PBX to Cisco SIP IP Phone—Successful Setup and Disconnect

Below diagram illustrates a successful gateway-to-Cisco SIP IP phone call setup and disconnect. In this scenario, the two end users are User A and User B. User A is located at PBX A. PBX A is connected to Gateway 1 (SIP Gateway) via a T1/E1. User B is located at a Cisco SIP IP phone. Gateway 1 is connected to the Cisco SIP IP phone over an IP network.

The call flow is as follows:

1. User A calls User B.

2. User B answers the call.

3. User B disconnects the call.

Step	Action	Description
1.	Setup—PBX A to Gateway 1	Call Setup is initiated between PBX A and Gateway 1. The Call Setup includes the standard transactions that take place as User A attempts to call User B.
2.	INVITE—Gateway 1 to Cisco SIP IP phone	Gateway 1 maps the SIP URL phone number to a dial peer. The dial peer includes the IP address and the port number of the SIP-enabled entity to contact. Gateway 1 sends a SIP INVITE request to the address it receives as the dial peer, which, in this scenario, is the Cisco SIP IP phone. In the INVITE request: The IP address of the Cisco SIP IP phone is inserted in the Request-URI field. PBX A is identified as the call session initiator in the From field. A unique numeric identifier is assigned to the call and is inserted in the Call-ID field. The transaction number within a single call leg is identified in the CSeq field. The media capability User A is ready to receive is specified. The port on which the Gateway is prepared to receive the RTP data is specified.
3.	Call Proceeding—Gateway 1 to PBX A	Gateway 1 sends a Call Proceeding message to PBX A to acknowledge the Call Setup request.
4.	100 Trying—Cisco SIP IP phone to Gateway 1	The Cisco SIP IP phone sends a SIP 100 Trying response to Gateway 1. The 100 Trying response indicates that the INVITE request has been received by the Cisco SIP IP phone.
5.	180 Ringing—Cisco SIP IP phone to Gateway 1	The Cisco SIP IP phone sends a SIP 180 Ringing response to Gateway 1. The 180 Ringing response indicates that the user is being alerted.
6.	Alerting—Gateway 1 to PBX A	Gateway 1 sends an Alert message to User A. The Alert message indicates that Gateway 1 has received a 180 Ringing response from the Cisco SIP IP phone. User A hears the ringback tone that indicates that User B is being alerted.
7.	200 OK—Cisco SIP IP phone to Gateway 1	The Cisco SIP IP phone sends a SIP 200 OK response to Gateway 1. The 200 OK response notifies Gateway 1 that the connection has been made.
8.	Connect—Gateway 1 to PBX A	Gateway 1 sends a Connect message to PBX A. The Connect message notifies PBX A that the connection has been made.
9.	Connect ACK—PBX A to Gateway 1	PBX A acknowledges Gateway 1's Connect message.
10.	ACK—Gateway 1 to Cisco SIP IP phone	Gateway 1 sends a SIP ACK to the Cisco SIP IP phone. The ACK confirms that Gateway 1 has received the 200 OK response. The call session is now active.
11.	BYE—Cisco SIP IP phone to Gateway 1	User B terminates the call session at his Cisco SIP IP phone and the phone sends a SIP BYE request to Gateway 1. The BYE request indicates that User B wants to release the call.
12.	Disconnect—Gateway 1 to PBX A	Gateway 1 sends a Disconnect message to PBX A.
13.	Release—PBX A to Gateway 1	PBX A sends a Release message to Gateway 1.
14.	200 OK—Gateway 1 to Cisco SIP IP phone	Gateway 1 sends a SIP 200 OK response to the Cisco SIP IP phone. The 200 OK response notifies the phone that Gateway 1 has received the BYE request.
15.	Release Complete—Gateway 1 to PBX A	Gateway 1 sends a Release Complete message to PBX A and the call session is terminated.

2. Call flow between Gateway-to-Cisco SIP IP Phone Call—Successful Call Setup and Call Hold

Below diagram illustrates a successful gateway-to-Cisco SIP IP phone call setup and call hold. In this scenario, the two end users are User A and User B. User A is located at PBX A. PBX A is connected to Gateway 1 (SIP Gateway) via a T1/E1. User B is located at a Cisco SIP IP phone. Gateway 1 is connected to the Cisco SIP IP phone over an IP network.

The call flow is as follows:

1. User A calls User B.

2. User B answers the call.

3. User B puts User A on hold.

4. User B takes User A off hold.

Step	Action	Description
1.	Setup—PBX A to Gateway 1	Call setup is initiated between PBX A and Gateway 1. The call setup includes the standard transactions that take place as User A attempts to call User B.
2.	INVITE—Gateway 1 to Cisco SIP IP phone	Gateway 1 maps the SIP URL phone number to a dial peer. The dial peer includes the IP address and the port number of the SIP enabled entity to contact. Gateway 1 sends a SIP INVITE request to the address it receives as the dial peer, which, in this scenario, is the Cisco SIP IP phone. In the INVITE request: The IP address of the Cisco SIP IP phone is inserted in the Request-URI field. PBX A is identified as the call session initiator in the From field. A unique numeric identifier is assigned to the call and is inserted in the Call-ID field. The transaction number within a single call leg is identified in the CSeq field. The media capability User A is ready to receive is specified. The port on which the gateway is prepared to receive the RTP data is specified.
3.	Call Proceeding—Gateway 1 to PBX A	Gateway 1 sends a Call Proceeding message to PBX A to acknowledge the Call Setup request.
4.	100 Trying—Cisco SIP IP phone to Gateway 1	The Cisco SIP IP phone sends a SIP 100 Trying response to Gateway 1. The 100 Trying response indicates that the INVITE request has been received by the Cisco SIP IP phone.
5.	180 Ringing—Cisco SIP IP phone to Gateway 1	The Cisco SIP IP phone sends a SIP 180 Ringing response to Gateway 1. The 180 Ringing response indicates that the user is being alerted.
6.	Alerting—Gateway 1 to PBX A	Gateway 1 sends an Alert message to User A. The Alert message indicates that Gateway 1 has received a 180 Ringing response from the Cisco SIP IP phone. User A hears the ringback tone that indicates that User B is being alerted.
7.	200 OK—Cisco SIP IP phone to Gateway 1	The Cisco SIP IP phone sends a SIP 200 OK response to Gateway 1. The 200 OK response notifies Gateway 1 that the connection has been made.
8.	Connect—Gateway 1 to PBX A	Gateway 1 sends a Connect message to PBX A. The Connect message notifies PBX A that the connection has been made.
9.	ACK—Gateway 1 to Cisco SIP IP phone	Gateway 1 sends a SIP ACK to the Cisco SIP IP phone. The ACK confirms that User A has received the 200 OK response. The call session is now active.
10.	Connect ACK—PBX A to Gateway 1	PBX A acknowledges Gateway 1's Connect message.
11.	INVITE—Cisco SIP IP phone to Gateway 1	User B puts User A on hold. The Cisco SIP IP phone sends a SIP INVITE request to Gateway 1.
12.	200 OK—Gateway 1 to Cisco SIP IP phone	Gateway 1 sends a SIP 200 OK response to the Cisco SIP IP phone. The 200 OK response notifies the Cisco SIP IP phone that the INVITE was successfully processed.
13.	ACK—Cisco SIP IP phone to Gateway 1	The Cisco SIP IP phone sends a SIP ACK to Gateway 1. The ACK confirms that the Cisco SIP IP phone has received the 200 OK response. The call session is now temporarily inactive. No RTP packets are being sent.
14.	INVITE—Cisco SIP IP phone to Gateway 1	User B takes User A off hold. The Cisco SIP IP phone sends a SIP INVITE request to Gateway 1.
15.	200 OK—Gateway 1 to Cisco SIP IP phone	Gateway 1 sends a SIP 200 OK response to the Cisco SIP IP phone. The 200 OK response notifies the Cisco SIP IP phone that the INVITE was successfully processed.
16.	ACK—Cisco SIP IP phone to Gateway 1	The Cisco SIP IP phone sends a SIP ACK to Gateway 1. The ACK confirms that the Cisco SIP IP phone has received the 200 OK response. The call session is now active.

3. Call flow between Cisco SIP IP Phone-to-Cisco SIP IP Phone Simple Call Hold

Below diagram illustrates a successful call between Cisco SIP IP phones in which one of the participants places the other on hold and then returns to the call. In this call flow scenario, the two end users are User A and User B. User A and User B are both using Cisco SIP IP phones, which are connected via an IP network.

The call flow scenario is as follows:

1. User A calls User B.

2. User B answers the call.

3. User B places User A on hold.

4. User B takes User A off hold.

5. The call continues.

Step	Action	Description
1.	INVITE—Cisco SIP IP phone A to Cisco SIP IP phone B	Cisco SIP IP phone A sends a SIP INVITE request to Cisco SIP IP phone B. The INVITE request is an invitation to User B to participate in a call session. In the INVITE request: The phone number of User B is inserted in the Request-URI field in the form of a SIP URL. The SIP URL identifies the address of User B and takes a form similar to an e-mail address (user@host, where user is the telephone number and host is either a domain name or a numeric network address). For example, the Request-URI field in the INVITE request to User B appears as "INVITE sip:555-0002@companyb.com; user=phone." The "user=phone" parameter distinquishes that the Request-URI address is a telephone number rather than a username. Cisco SIP IP phone A is identified as the call session initiator in the From field. A unique numeric identifier is assigned to the call and is inserted in the Call-ID field. The transaction number within a single call leg is identified in the CSeq field. The media capability User A is ready to receive is specified.
2.	180 Ringing—Cisco SIP IP phone B to Cisco SIP IP phone A	Cisco SIP IP phone B sends a SIP 180 Ringing response to Cisco SIP IP phone A.
3.	200 OK—Cisco SIP IP phone B to Cisco SIP IP phone A	Cisco SIP IP phone B sends a SIP 200 OK response to Cisco SIP IP phone A. The 200 OK response notifies Cisco SIP IP phone A that the connection has been made. If Cisco SIP IP phone B supports the media capability advertised in the INVITE message sent by Cisco SIP IP phone A, it advertises the intersection of its own and Cisco SIP IP phone A's media capability in the 200 OK response. If Cisco SIP IP phone B does not support the media capability advertised by Cisco SIP IP phone A, it sends back a 400 Bad Request response with a 304 Warning header field.
4.	ACK—Cisco SIP IP phone A to Cisco SIP IP phone B	Cisco SIP IP phone A sends a SIP ACK to Cisco SIP IP phone B. The ACK confirms that Cisco SIP IP phone A has received the 200 OK response from Cisco SIP IP phone B. The ACK might contain a message body with the final session description to be used by Cisco SIP IP phone B. If the message body of the ACK is empty, Cisco SIP IP phone B uses the session description in the INVITE request.
A two-way RTP channel is established between Cisco SIP IP phone A and Cisco SIP IP phone B.
5.	INVITE—Cisco SIP IP phone B to Cisco SIP IP phone A	Cisco SIP IP phone B sends a mid-call INVITE to Cisco SIP IP phone A with new Session Description Protocol (SDP) session parameters (IP address), which are used to place the call on hold. Call_ID=1 SDP: c=IN IP4 0.0.0.0 The c= SDP field of the SIP INVITE contains an 0.0.0.0. This places the call in hold.
6.	200 OK—Cisco SIP IP phone A to Cisco SIP IP phone B	Cisco SIP IP phone A sends a SIP 200 OK response to Cisco SIP IP phone B.
7.	ACK—Cisco SIP IP phone B to Cisco SIP IP phone A	Cisco SIP IP phone B sends a SIP ACK to Cisco SIP IP phone A. The ACK confirms that Cisco SIP IP phone B has received the 200 OK response from Cisco SIP IP phone A.
The RTP channel between Cisco SIP IP phone A and Cisco SIP IP phone B is torn down.
8.	INVITE—Cisco SIP IP phone B to Cisco SIP IP phone A	Cisco SIP IP phone B sends a mid-call INVITE to Cisco SIP IP phone A with the same call ID as the previous INVITE and new SDP session parameters (IP address), which are used to reestablish the call. Call_ID=1 SDP: c=IN IP4 181.23.250.2 To reestablish the call between phone A and phone B, the IP address of phone B is inserted into the c= SDP field.
9.	200 OK—Cisco SIP IP phone A to Cisco SIP IP phone B	Cisco SIP IP phone A sends a SIP 200 OK response to Cisco SIP IP phone B.
10.	ACK—Cisco SIP IP phone B to Cisco SIP IP phone A	Cisco SIP IP phone B sends a SIP ACK to Cisco SIP IP phone A. The ACK confirms that Cisco SIP IP phone B has received the 200 OK response from Cisco SIP IP phone A.
A two-way RTP channel is reestablished between IP phone A and IP phone B.

Defining SIP Port in Cisco Unified Communications Manager

SIP Troubleshooting

On Unified Communications Manager use RTMT to check SIP traces in UC Manager
"debug ccsip calls"
"debug ccsip all"
"debug ccsip error"
"debug ccsip events"
"debug ccsip messages"
"debug ccsip states"
"show sip-ua call"
- Displays active UAC and user agent server (UAS) information on sip calls
" show call active voice brief"
- Displays active call information for voice calls or fax transmission in progress

Check MTP configuration
- MTP is required for Early offer
- MTP is required on older UC Manager versions to provide supplementary services
Check layer 2/3 variables

Understanding SIP Traces

More details on the SIP Trace is:

https://supportforums.cisco.com/docs/DOC-27105

Related Information

Thursday, March 19, 2015

Nature of calls during split brain

Effects on Calls in Progress When Server Status Changes

When the status of a Cisco Unity Connection server changes, the effects on calls in progress depend on the final status of the server that is handling a call and on the condition of the network. Table 3-3 describes the effects.

Table 3-3 Effects on Calls in Progress When Server Status Changes

Status Change	Effects
Primary to Secondary	When the status change is initiated manually, calls in progress are not affected. When the status change is automatic, effects on calls in progress depend on the critical service that stopped.
Secondary to Primary	When the status change is initiated manually, calls in progress are not affected. When the status change is automatic, effects on calls in progress depend on the critical service that stopped.
Secondary to Deactivated	Calls in progress are dropped. To prevent dropped calls, on the Cluster Management page in Cisco Unity Connection Serviceability, click Stop Taking Calls for the server, wait until all calls have ended, and deactivate the server.
Primary or Secondary to Replicating Data	Calls in progress are not affected.
Primary or Secondary to Split Brain Recovery	Calls in progress are not affected.

SPLIT BRAIN

Cisco Unity Connection Cluster Operation and Troubleshooting

This document discusses Unity Connection Clustering, the Unity Connection voicemail system's redundancy option. Before proceeding to this, please first read through the Cisco Unity Connection Cluster Configuration and Administration Guide.

Introduction

The Cisco Unity Connection (CUC) 7.0 failover feature--at the database layer--uses IDS' Enterprise Replication (ER). It utilizes a multimaster database replication topology for all databases except for the mailbox databases. To avoid the possibility of a message being received simultaneously on two servers (and one message effectively overwriting the other), only one copy of the mailbox database will be writable at a time. Since the directory database could also encounter such collisions, however, a "last write wins" mechanism is employed.

On top of that, UC has two roles, the Primary and Secondary that are independent of the server's pub/sub status in the cluster. In normal operation, the Publisher should be the Primary and the Subscriber the Secondary. The Primary server is responsible for running services that can only run on one node in a cluster at a time (the so-called "singleton" processes), such as Notifier, MTA, and SysAgent-tasks. The Primary handles all writes to message store database(s) and certain master files such as the encrypt key and certificates are managed in the primary and then replicated to secondary. The benefit is that in this method the system is completely manageable and allows for changes and user updates, even if the Publisher node is offline. Since singleton processes only run on one server, notifications for MWI or other outside numbers as well as outbound VPIM traffic or email notifications will always originate from the Primary node. Inbound VPIM traffic must always reach the Primary server. This is an important consideration for any design.

These roles are managed via the Server Redundancy Manager (SRM). This service, which runs on both nodes, implements the failover logic and figures out which side is the Primary. It maintains a heartbeat between the servers and is responsible for running Split-brain recovery (SBR), which occurs when both servers assumed the Primary role (e.g. they couldn't talk to one another; became Primary; and now they can again). SRM talks to ServM to determine the status of critical processes and requests ServM to start singleton processes after a failover.

Besides database redundancy, Connection uses a file replication mechanism (the "Connection File Syncer" or CuFileSync) to replicate messages and audio files, which are not stored in the database. This is done via SFTP. The following files must be replicated:

Message audio files (/var/opt/cisco/connection/mail/*)
Spoken names/greeting (/var/opt/cisco/connection/lib/*)
Certificates (/var/opt/cisco/connection/security/*)
CuEncrypt key files (/opt/cisco/connection/.security/*)

This document is not intended to go over all aspects of database replication and file synchronization, but rather to look at the operation of the SRM, since it triggers many of these other functions.

UC Failover in Operation

The Connection Server Role Manager (SRM) is the service that runs on both servers that keeps track of the current role of each server. It has a 5-second keepalive mechanism. The communication as per UC7.1.3 is via TCP where the subscriber chooses a random source port and connects to port 22001 of the publisher.

From the Serviceability page, one can access Tools > Cluster Management to view the cluster status or issue the 'show cluster status' command from CLI.

Cluster Status

NOTE: These show commands are taken from UC7.1.3. Prior releases did not indicate the replication status (and it is not possible to easily find the replication status of the CUC databases from the CLI).

admin:show cuc cluster status

Server Name  Member ID  Server State  Internal State  Reason
-----------  ---------  ------------  --------------  ------
cuc1a     0          Primary       Pri Active      Normal
cuc1b     1          Secondary     Sec Active      Normal

SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_ciscounity_cuc1a  100 Active   Local           0                
g_ciscounity_cuc1b  101 Active   Connected       0 Jul 14 15:13:14

This output is from the "cuc1a" node (note the replication status is "Local"). We see that it is the primary and is the publisher, since it its Member ID is 0. Furthermore, replication is active and it is connected to the sub's database. The cluster status (Primary/Secondary is learned via replication, so if replication is not working, this output may not be correct).

For example, let's say the network between the two servers is interrupted. Now the status may appear as follows. Note that this command what the local server where the command is run is seeing. The status from the other side is learned via replication, so if there's a network disconnect that's obviously not happening.

admin:show cuc cluster status

Server Name  Member ID  Server State  Internal State           Reason
-----------  ---------  ------------  -----------------------  ------
cuc1a     0          Primary       Pri Active Disconnected  Normal
cuc1b     1          Secondary     Sec Active               Normal

SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_ciscounity_cuc1a  100 Active   Local           0                
g_ciscounity_cuc1b  101 Active   Connecting   2259 Jul 29 10:29:52

This output is from the publisher again. Note that its internal state is Primary Active Disconnected, but the state about the subscriber is basically incorrect. The replication status is "Connection", so it's clear it isn't getting updates. When we look at the same command on the secondary, we see:

Server Name  Member ID  Server State  Internal State                Reason
-----------  ---------  ------------  ----------------------------  ------
cuc1a     0          Primary       Pri Active                    Normal
cuc1b     1          Primary       Sec Act Primary Disconnected  Normal

SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_ciscounity_cuc1a  100 Active   Connecting   2938 Jul 29 10:29:59
g_ciscounity_cuc1b  101 Active   Local           0

Both servers show the Primary status.

Once things come back, one may initially see the following on the Primary:

admin:show cuc cluster status

Server Name  Member ID  Server State            Internal State  Reason
-----------  ---------  ----------------------  --------------  ------
cuc1a     0          Split Brain Resolution  Pri SBR         Normal
cuc1b     1          Secondary               Sec Active      Normal

SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_ciscounity_cuc1a  100 Active   Local           0                
g_ciscounity_cuc1b  101 Active   Dropped    5554742 Jul 14 15:08:14

There the Pub/Primary has kicked off SBR. Replication is still down, but as things start working, both server's state will become 'Split Brain Resolution'.

admin:show cuc cluster status

Server Name  Member ID  Server State            Internal State  Reason
-----------  ---------  ----------------------  --------------  ------
cuc1a     0          Split Brain Resolution  Pri SBR         Normal
cuc1b     1          Split Brain Resolution  Sec SBR         Normal

SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_ciscounity_cuc1a  100 Active   Local           0                
g_ciscounity_cuc1b  101 Active   Connected   17905 Jul 15 09:49:58

There are several intermediate states before returning back to normal, however these should be completed fairly quickly.

SRM Traces

First thing to look at when troubleshooting failover-related issues is to look at the SRM logs. The following is a portion of the trace from the Primary/Publisher, which was disconnected from the network:

07/15/2009 10:36:14.268 |18039,,,SRM,3,<Timer-241> Ending session due to missing heartbeats|
...
07/15/2009 10:36:14.322 |15252,,,-1,-1,alarm id 9202|
07/15/2009 10:36:14.322 |GenAlarm: AlarmName = NoConnectionToPeer, DeviceName = cuc1a, AlarmMsg = NoConnectionToPeer
AppID : CuSrm
ClusterID :
NodeID : cuc1a
|
07/15/2009 10:36:14.322 |15252,,,SRM,5,<evt> [9202] Lost communication with the remote server cuc1b.vnt.cisco.com in the cluster. The remote server may be down.|
...
07/15/2009 10:36:14.631 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [disconnect] [PUB_PRIMARY_DISCONNECTED]|
07/15/2009 10:36:14.711 |15252,,,SRM,5,<evt> PUB_PRIMARY_DISCONNECTED - Role_activate Start - PUBLISHER_PRIMARY|
07/15/2009 10:36:14.711 |15252,,,SRM,5,<evt> PUB_PRIMARY_DISCONNECTED - Role_activate Done!! - PUBLISHER_PRIMARY [ 0secs. ]|
07/15/2009 10:36:15.011 |15252,,,SRM,5,<evt> [PUB_PRIMARY_DISCONNECTED] [db_state_change] ignored|
...

On the subscriber it looks like this. Note that it starts the MTA and Notifier services and makes itself primary.

07/15/2009 10:36:14.964 |28031,,,SRM,3,<Timer-9> Ending session due to missing heartbeats|
...
07/15/2009 10:36:14.975 |GenAlarm: AlarmName = NoConnectionToPeer, DeviceName = cuc1b, AlarmMsg = NoConnectionToPeer
AppID : CuSrm
ClusterID :
NodeID : cuc1b
|
...
07/15/2009 10:36:14.975 |14161,,,SRM,5,<evt> [9202] Lost communication with the remote server cuc1a.vnt.cisco.com in the cluster. The remote server may be down.|
07/15/2009 10:36:15.078 |14161,,,SRM,5,<evt> [SUB_SECONDARY] [disconnect] [SUB_PRIMARY_DISCONNECTED]|
07/15/2009 10:36:15.095 |14161,,,SRM,5,<evt> Entering SUB_PRIMARY_DISCONNECTED without token|
07/15/2009 10:36:15.101 |14161,,,SRM,5,<evt> SUB_PRIMARY_DISCONNECTED - Role_activate Start - SUBSCRIBER_PRIMARY|
07/15/2009 10:36:15.267 |13952,,,SRM,5,<dbmon> Cluster information has been changed in the database.  ClusterMasterObjectId: c435dcac-d7b8-45ba-b879-785db246f060|
07/15/2009 10:36:15.499 |14161,,,SRM,5,<evt> Activate requested for the following service(s): Connection Message Transfer Agent, Connection Notifier, Connection Groupware Caching Service, |
07/15/2009 10:37:05.504 |14161,,,SRM,5,<evt> SUB_PRIMARY_DISCONNECTED - Role_activate Done!! - SUBSCRIBER_PRIMARY [ 50secs. ]|
07/15/2009 10:37:05.504 |14161,,,SRM,5,<evt> [SUB_PRIMARY_DISCONNECTED] [db_state_change] ignored|
...

When the network connectivity is restored, the primary shows the following:

07/15/2009 10:43:14.233 |15254,,,SRM,5,<listener> Received connection from /14.84.158.21:57590|
07/15/2009 10:43:14.864 |29106,,,SRM,5,<sess-241> [rcv] Type: Register ID: c435dcac-d7b8-45ba-b879-785db246f060 HavePrimaryToken: false NeedSbr: true Version: 7.1.2.39000-35|
07/15/2009 10:43:16.396 |29106,,,SRM,5,<sess-241> [snd] Type: Register ID: ece97bfd-f8b1-409a-8dca-4402311555cb HavePrimaryToken: true NeedSbr: false Version: 7.1.2.39000-35|
07/15/2009 10:43:16.396 |15252,,,SRM,5,<evt> [snd] Type: State State: PRIMARY|
07/15/2009 10:43:16.396 |15252,,,SRM,5,<evt> [PUB_PRIMARY_DISCONNECTED] [connect] ignored|
07/15/2009 10:43:16.396 |15252,,,SRM,5,<evt> [PUB_PRIMARY_DISCONNECTED] [sbr_required] [PUB_SBR]|
07/15/2009 10:43:16.397 |15252,,,-1,-1,alarm id 9216|
07/15/2009 10:43:16.397 |15252,,,SRM,5,<evt> [9216] Split-brain resolution procedure has been initiated.|
07/15/2009 10:43:16.398 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: PRIMARY|
07/15/2009 10:43:16.565 |15252,,,SRM,5,<evt> PUB_SBR - Role_activate Start - PUBLISHER_PRIMARY_SBR|
07/15/2009 10:43:16.871 |15252,,,SRM,5,<evt> Deactivate requested for the following service(s): Connection File Syncer, Connection Message Transfer Agent, Connection System Agent, |
07/15/2009 10:44:06.674 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: IN_SBR|
07/15/2009 10:44:06.882 |15252,,,SRM,5,<evt> PUB_SBR - Role_activate Done!! - PUBLISHER_PRIMARY_SBR [ 50secs. ]|
07/15/2009 10:44:06.882 |15252,,,SRM,5,<evt> [snd] Type: State State: IN_SBR|
...
07/15/2009 11:01:33.670 |15048,,,SRM,5,<CM> Command: /opt/cisco/connection/bin/sbrscript execution completed successfully|
07/15/2009 11:01:33.671 |15252,,,SRM,5,<evt> [PUB_SBR] [sbr_done] [PUB_CHOOSE_ROLE]|
07/15/2009 11:01:33.671 |15252,,,-1,-1,alarm id 9217|
07/15/2009 11:01:33.671 |15252,,,SRM,5,<evt> [9217] Split-brain resolution procedure run successfully.|
07/15/2009 11:01:34.409 |15252,,,SRM,5,<evt> [snd] Type: SbrComplete|
07/15/2009 11:01:34.409 |15252,,,SRM,5,<evt> [snd] Type: State State: INITIALIZING|
07/15/2009 11:01:34.409 |15252,,,SRM,5,<evt> [PUB_CHOOSE_ROLE] [go_primary] [PUB_PRIMARY]|
07/15/2009 11:01:34.477 |15252,,,SRM,5,<evt> PUB_PRIMARY - Role_activate Start - PUBLISHER_PRIMARY|
07/15/2009 11:01:34.638 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: INITIALIZING|
07/15/2009 11:01:34.991 |15252,,,SRM,5,<evt> Activate requested for the following service(s): Connection Message Transfer Agent, Connection System Agent, Connection File Syncer, |
07/15/2009 11:02:19.993 |15252,,,SRM,5,<evt> PUB_PRIMARY - Role_activate Done!! - PUBLISHER_PRIMARY [ 45secs. ]|
07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [snd] Type: State State: PRIMARY|
07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [peer_initializing] ignored|
07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:02:20.143 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:02:33.172 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:02:40.232 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: IN_DB_SYNC|
07/15/2009 11:02:40.234 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [peer_in_db_sync] ignored|
07/15/2009 11:02:40.721 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:02:40.800 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [db_state_change] ignored|
07/15/2009 11:03:30.832 |29106,,,SRM,5,<sess-241> [rcv] Type: State State: SECONDARY|
07/15/2009 11:03:30.833 |15252,,,SRM,5,<evt> [PUB_PRIMARY] [peer_secondary] [PUB_PRIMARY]|

The Secondary shows the following:

07/15/2009 10:43:14.238 |14162,,,SRM,5,<session> Connection to cuc1a.vnt.cisco.com/14.84.158.20:22001|
07/15/2009 10:43:14.828 |14162,,,SRM,5,<session> [snd] Type: Register ID: c435dcac-d7b8-45ba-b879-785db246f060 HavePrimaryToken: false NeedSbr: true Version: 7.1.2.39000-35|
07/15/2009 10:43:16.402 |14162,,,SRM,5,<session> [rcv] Type: Register ID: ece97bfd-f8b1-409a-8dca-4402311555cb HavePrimaryToken: true NeedSbr: false Version: 7.1.2.39000-35|
07/15/2009 10:43:16.403 |14161,,,SRM,5,<evt> [snd] Type: State State: PRIMARY|
07/15/2009 10:43:16.403 |14162,,,SRM,5,<session> [rcv] Type: State State: PRIMARY|
07/15/2009 10:43:16.403 |14161,,,SRM,5,<evt> [SUB_PRIMARY_DISCONNECTED] [connect] ignored|
07/15/2009 10:43:16.404 |14161,,,SRM,5,<evt> [SUB_PRIMARY_DISCONNECTED] [sbr_required] [SUB_SBR]|
07/15/2009 10:43:16.414 |14161,,,-1,-1,alarm id 9216|
07/15/2009 10:43:16.414 |14161,,,SRM,5,<evt> [9216] Split-brain resolution procedure has been initiated.|
07/15/2009 10:43:16.436 |14161,,,SRM,5,<evt> SUB_SBR - Role_activate Start - SUBSCRIBER_SECONDARY_SBR|
07/15/2009 10:43:16.675 |14161,,,SRM,5,<evt> Deactivate requested for the following service(s): Connection File Syncer, Connection System Agent, Connection Groupware Caching Service, Connection Message Transfer Agent, Connection Notifier, |
07/15/2009 10:44:06.679 |14161,,,SRM,5,<evt> SUB_SBR - Role_activate Done!! - SUBSCRIBER_SECONDARY_SBR [ 50secs. ]|
07/15/2009 10:44:06.679 |14161,,,SRM,5,<evt> [snd] Type: State State: IN_SBR|
07/15/2009 10:44:06.694 |14161,,,-1,-1,alarm id 9224|
07/15/2009 10:44:06.694 |14161,,,SRM,5,<evt> [9224] Regained communication with the remote server cuc1a.vnt.cisco.com in the cluster.|
07/15/2009 10:44:06.694 |14161,,,SRM,5,<evt> [SUB_SBR] [peer_primary] ignored|
07/15/2009 10:44:06.694 |14161,,,SRM,5,<evt> [SUB_SBR] [db_state_change] ignored|
...
07/15/2009 10:44:06.928 |14162,,,SRM,5,<session> [rcv] Type: State State: IN_SBR|
07/15/2009 10:44:06.928 |14161,,,SRM,5,<evt> [SUB_SBR] [peer_in_sbr] ignored|
07/15/2009 10:44:07.090 |14161,,,SRM,5,<evt> [SUB_SBR] [db_state_change] ignored|
07/15/2009 10:44:07.100 |14161,,,SRM,5,<evt> [SUB_SBR] [db_state_change] ignored|
07/15/2009 11:01:34.420 |14162,,,SRM,5,<session> [rcv] Type: SbrComplete|
07/15/2009 11:01:34.420 |14161,,,SRM,5,<evt> [SUB_SBR] [sbr_done] [SUB_CHOOSE_ROLE]|
07/15/2009 11:01:34.421 |14161,,,-1,-1,alarm id 9217|
07/15/2009 11:01:34.421 |14161,,,SRM,5,<evt> [9217] Split-brain resolution procedure run successfully.|
07/15/2009 11:01:34.421 |14162,,,SRM,5,<session> [rcv] Type: State State: INITIALIZING|
07/15/2009 11:01:34.615 |14161,,,SRM,5,<evt> [snd] Type: State State: INITIALIZING|
07/15/2009 11:01:34.616 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [peer_initializing] ignored|
07/15/2009 11:01:34.650 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [db_state_change] ignored|
07/15/2009 11:01:34.663 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [db_state_change] ignored|
07/15/2009 11:01:34.675 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [db_state_change] ignored|
07/15/2009 11:02:20.153 |14162,,,SRM,5,<session> [rcv] Type: State State: PRIMARY|
07/15/2009 11:02:20.174 |14161,,,-1,-1,alarm id 9224|
07/15/2009 11:02:20.174 |14161,,,SRM,5,<evt> [9224] Regained communication with the remote server cuc1a.vnt.cisco.com in the cluster.|
07/15/2009 11:02:20.174 |14161,,,SRM,5,<evt> [SUB_CHOOSE_ROLE] [peer_primary] [DB_SYNC]|
07/15/2009 11:02:40.199 |14161,,,SRM,5,<evt> [snd] Type: State State: IN_DB_SYNC|
07/15/2009 11:02:40.199 |14161,,,SRM,5,<evt> [DB_SYNC] [db_sync_unnecessary] [SUB_SECONDARY]|
07/15/2009 11:02:40.258 |14161,,,SRM,5,<evt> Cluster information has been changed in the database.  ClusterMasterObjectId: ece97bfd-f8b1-409a-8dca-4402311555cb|
07/15/2009 11:02:40.264 |14161,,,SRM,5,<evt> SUB_SECONDARY - Role_activate Start - SUBSCRIBER_SECONDARY|
07/15/2009 11:02:40.778 |14161,,,SRM,5,<evt> Activate requested for the following service(s): Connection System Agent, Connection File Syncer, |
07/15/2009 11:03:30.784 |14161,,,SRM,5,<evt> SUB_SECONDARY - Role_activate Done!! - SUBSCRIBER_SECONDARY [ 50secs. ]|
07/15/2009 11:03:30.800 |14161,,,SRM,5,<evt> [snd] Type: State State: SECONDARY|
07/15/2009 11:03:30.800 |14161,,,SRM,5,<evt> [SUB_SECONDARY] [db_state_change] [SUB_SECONDARY]|

Split-brain Recovery (SBR)

SBR is the process that runs when both publisher and subscriber have figured out that they are both primary (i.e. they weren't able to communicate to one another). A script is run, /opt/cisco/connection/bin/sbrscript which calls a check-aborted-transaction script followed by another script to resolve the mailbox databases. This whole process can take a few minutes up to an hour. After an hour, the process is interrupted. If the process (due to a defect or some other issue) takes the full hour, you'll see the FsmTimer and the SRM logs will look like this:

07/29/2009 07:15:52.949 |9380,,,SRM,5,Thread=EH  [9216] Split-brain resolution procedure has been initiated.|
...
07/29/2009 08:16:58.318 |9379,,,SRM,5,Thread=FsmTimer  [9217] Split-brain resolution procedure run successfully.|

Note that it's clearly an hour later and the FsmTimer expired. When it works normally, it will appear as "SRM,5,<evt> [9217] Split-brain resolution procedure run successfully."

check-aborted-transactions

The bulk of time spent in SBR is running this script. Logs for this are written to the aborted-transactions-resolution-<date/time> logs (file list activelog cuc/aborted-transactions-resolution*). They do not currently have timestamps for the commands other than the log file name itself. If there are aborted transactions, they are written and compressed to aborted-transactions-<date/time>.tgz. So what does this script do?

It waits for CDR to process its replication queues
Checks for aborted transactions. You'll see in the logs something like "Transactions have been aborted. Processing 959 ATS files."
It tar's and zip's up those aborted transactions and puts them into the aborted-transactions-<date/time>.tgz file.
Next it repairs the databases. This is the longest step and the databases are merged. Upon completion, a "Repair successful" message is logged. This is the end of the script. For each database, the repair
1. disables the database constraints
2. runs a cdr check replset which compares data on the servers and repairs any inconsistencies. For each,you'll get a report showing any inconsistencies.
3. re-enables constraints

This entire process is logged in a file ""aborted-transactions-resolution-yy-mm-ddT<timestamp>""

SBR Errors during mailbox reconciliation

At the end of SBR, the mailbox databases must be consolidated to ensure no messages are lost and they are numbered properly. It follows the following logic for each mailbox database:

Find the time of the last known good communication
Find all mailstores
Unmount the first one
Connect to the mailstore
Renumber messages
Refresh mail counts and mailbox sizes
Remount mailstore
Repeat for other mailbox store databases
Initiate a full MWI resync

Obviously, when the mailbox database is down, a user won't be able to access their messages. Even on large databases, this typically doesn't take longer than about a minute or two. The user will hear "your messages are currently not available" and similar prompts.

If there are errors during the mailbox resolution process, in UC7.1.3 they are written to the mbxsbr.txt file. Errors could potentially look something like this:

Error resolving splitbrain, database: UnityMbxDb1 objectid: dfacd9cf-2fad-4b86-8a04-31daae6b61d8
SQLCODE -458 in EXECUTE:
IX000: Long transaction aborted.
IX000: RSAM error: Long transaction detected.

Error resolving splitbrain, database: UnityMbxDb2 objectid: 09bef2f6-31b7-416f-adbc-89b70b835f54
SQLCODE -387 in CONNECT:
IX000: No connect permission.
IX000: ISAM error:  no record found.

Other Conditions

Replication Queue Filling up

By default, if the servers are disconnected so long that the replication queue grows to 90% of the maximum threshold, then replication is stopped. The SRM diagnostics will show something like:

07/28/2009 23:53:56.159 |10334,,,SRM,3,<Timer-0> Replication queue size: 93.0 has exceeded the maximum threshold value. Stopping replication.|
...
07/28/2009 23:54:04.452 |10333,,,SRM,5,<CM> Command: /opt/cisco/connection/lib/config-modules/dbscripts/repl-helper stop_replication execution completed successfully|
07/28/2009 23:54:06.161 |10334,,,SRM,5,<Timer-0> DB replication has been stopped.|
07/28/2009 23:54:06.224 |10334,,,SRM,5,<Timer-0> [9222] Database replication queue size has exceeded the maximum threshold. Replication between redundant servers has been stopped.|

An alarm is also sent so this condition is visible in RTMT. After this, the replication has to be set up again by accessing the Serviceability page and activating the subscriber again under Tools > Cluster Management or with the utils cuc cluster activate command.

Drives filling up

Any time the drives fill up, it's a bad thing. RTMT generates alerts for this. Also, from the CLI, show diskusage <partition> will also help with identify this issue.

Replication Processes hung or coredump

In some instances, the replication processes themselves may get hung or coredump. A core dump (from CDR, for example), will be in /var/log/active/core, the same place as other cores. It is possible for a cdr process to hang and SRM to continue working fine, so it is often important to look for hung processes (ps auxf | grep cdr will grab cdr-related stuff). If something's been running for ages, then chances are, it's hung.

utils cuc cluster CLI commands

utils cuc cluster activate - described earlier
utils cuc cluster deactivate - disables replication and forces node to deactivated. Sometimes used when replacing the publisher.
utils cuc cluster makeprimary - run from the node that is currently the primary. Makes the other server the primary.
utils cuc cluster overwritedb - copies data from the publisher to the subscriber
utils cuc cluster renegotiate - used when a publisher is being replaced to join the new publisher to the cluster and then copy over the database from the subscriber to the publisher

RTMT

Alarms

In many cases, the only indication that something went wrong is when RTMT generates an alarm on on each node ("No Connection To Peer") from CuSrm (obviously the failed node may not be in a position to report a failure). There should also be a system-level ServerDown alarm. The subscriber will also indicate an AutoFailoverSucceeded alarm. Keep in mind that if the nodes are not communicating with one another, then RTMT traffic will be affected, too. So often only information from the node being monitored will be visible. There is currently no alarm that SBR has initiated or completed.

Performance Counters

The Replication state for UC databases is not tracked in RTMT. Currently the counter (""Number of Replicates Created and State of ReplicationCount)\Replicate_State"") exists, however it refers to the CUCM database and therefore not relevant to CUC. There is also no way to directly see the number of messages in the drop folder (which is where messages get queued before entering the MTA queue to be processed), although one can monitor \CUC Message Store\Queued Messages Current on the primary. This is the queue where messages are moved to when they're ready to be delivered (and is drained at 100 messages/minute). So when the drop folder is full, this queue will stay around 300 (its max) for the period. Once the drop folder is cleared, then this will drop down to typically some small number, depending on the load on the system.