System Failover Strategies
InterSystems IRIS® data platform provides several high availability (HA) solutions, and easily integrates with all common HA configurations supplied by operating system providers.
The primary mechanism for maintaining high system availability is called failover. Under this approach, a failed primary system is replaced by a backup system; that is, processing fails over to the backup system. Many HA configurations also provide mechanisms for disaster recovery, which is the resumption of system availability when failover mechanisms have been unable to keep the system available.
There are five general approaches to InterSystems IRIS instance failover for HA (including not implementing an HA strategy). This chapter provides an overview of these approaches, while the remainder of this guide provides procedures for implementing them.
It is important to remember that in all of these approaches except mirroring, a single storage failure can be disastrous. For this reason, disk redundancy, database journaling as described in the “Journaling” chapter of the Data Integrity Guide, and good backup procedures, as described in the “Backup and Restore” chapter of the Data Integrity Guide, must always be part of your approach, as they are vital to mitigating the consequences of disk failure.
If you require detailed information to help you develop failover and disaster recovery strategies tailored to your environment, or to review your current practices, please contact the InterSystems Worldwide Response Center (WRC).
No Failover Strategy
The integrity of your InterSystems IRIS database is always protected from production system failure by the features described in the Data Integrity Guide. Structural database integrity is maintained by InterSystems write image journal (WIJ) technology, while logical integrity is maintained through journaling and transaction processing. Automatic WIJ and journal recovery are fundamental components of the InterSystems “bulletproof” database architecture.
With no failover strategy in place, however, a failure can result in significant down time, depending on the cause of the failure and your ability to isolate and resolve it. For many applications that are not business-critical, this risk may be acceptable.
Customers that adopt this approach share the following traits:
Clear and detailed operational recovery procedures, including journaling and backup and restore
Disk redundancy (RAID and/or disk mirroring)
Ability to replace hardware quickly
24x7 maintenance contracts with all vendors
Management acceptance and application user tolerance of moderate downtime caused by failures
A common approach to achieving HA is the failover cluster, in which the primary production system is supplemented by a (typically identical) standby system, with shared storage and a cluster IP address that follows the active member. In the event of a production system failure, the standby assumes the production workload, taking over the programs and services formerly running on the failed primary, including InterSystems IRIS.
InterSystems IRIS is designed to integrate easily with failover solutions provided at the operating system level, such as Microsoft Windows Server Clusters, IBM PowerHA SystemMirror, Red Hat Enterprise Linux HA, and others. A single instance of InterSystems IRIS is installed on the shared storage device so that both cluster members recognize the instance, then added to the failover cluster configuration so it will be started automatically as part of failover. If the active node becomes unavailable for a specified period of time, the failover technology transfers control of the cluster IP address and shared storage to the standby and restarts InterSystems IRIS on the new primary. On restart, the system automatically performs the normal startup recovery, with WIJ, journaling, and transaction processing maintaining structural and data integrity exactly as if InterSystems IRIS had been restarted on the failed system.
The standby server must be capable of handling normal production workloads for as long as it may take to restore the failed primary. Optionally, the standby can become the primary, with the failed primary becoming the standby once it is restored.
Under this approach, failure of the shared storage device is disastrous. For this reason, disk redundancy, journaling and good backup and restore procedures are critically important in providing adequate recovery capability.
Virtualization platforms generally provide HA capabilities, which typically monitor the status of both the guest operating system and the hardware it is running on. On the failure of either, the virtualization platform automatically restarts the failed virtual machine, on alternate hardware as required. When the InterSystems IRIS instance restarts, it automatically performs the normal startup recovery, with WIJ, global journaling, and transaction processing maintaining structural and data integrity as if InterSystems IRIS had been restarted on a physical server.
In addition, virtualization platforms allow the relocation of virtual machines to alternate hardware for maintenance purposes, enabling upgrade of physical servers, for example, without any down time. Virtualization HA shares the major disadvantage of the failover cluster and concurrent cluster, however: failure of shared storage is disastrous.
InterSystems IRIS Mirroring
InterSystems IRIS database mirroring with automatic failover provides an effective and economical high availability solution for planned and unplanned outages. Mirroring relies on data replication rather than shared storage, avoiding significant service interruptions due to storage failures.
An InterSystems IRIS mirror consists of two physically independent InterSystems IRIS systems, called failover members. Each failover member maintains a copy of each mirrored database in the mirror; application updates are made on the primary failover member, while the backup failover member’s databases are kept synchronized with the primary through the application of journal files from the primary. (See the “Journaling” chapter of the Data Integrity Guide for information about journaling.)
The mirror automatically assigns the role of primary to one of the two failover members, while the other failover member automatically becomes the backup system. When the primary InterSystems IRIS instance fails or otherwise becomes unavailable, the backup automatically and rapidly takes over and becomes primary.
A third system, called the arbiter, maintains continuous contact with the failover members, providing them with the context needed to safely make failover decisions when they cannot communicate directly. Agent processes running on each failover system host, called ISCAgents, also help with automatic failover logic. The backup cannot take over unless it can confirm that the primary is really down or unavailable and will not attempt to operate as primary. Between the arbiter and the ISCAgents, this can be accomplished under almost every outage scenario.
Alternatively, when using a hybrid virtualization and mirroring HA approach (as discussed later in this section), the virtualization platform can restart the failed host system, allowing mirroring to determine the status of the former primary instance and proceed as required.
When the mirror is configured to use a virtual IP address (VIP), redirection of application connections to the new primary is transparent. If connections are by ECP, they are automatically reset to the new primary. Other mechanisms for redirection of application connections are available.
When the primary instance is restored to operation, it automatically becomes the backup. Operator-initiated failover can also be used to maintain availability during planned outages for maintenance or upgrades.
The use of mirroring in a virtualized environment creates a hybrid high availability solution combining the benefits of both. While the mirror provides the immediate response to planned or unplanned outages through automatic failover, virtualization HA software automatically restarts the virtual machine hosting a mirror member following an unplanned machine or OS outage. This allows the failed member to quickly rejoin the mirror to act as backup (or to take over as primary if necessary).
For complete information about InterSystems IRIS mirroring, see the “Mirroring” chapter of this guide.
Using Distributed Caching with a Failover Strategy
Whatever approach you take to HA, a distributed cache cluster enabled by the Enterprise Cache Protocol (ECP) can be used to provide a layer of insulation between the users and the database server. The application servers in a distributed cache cluster are designed to preserve the state of the running application across a failover of the data server. Users remain connected to the application servers when the data server fails, and user sessions that are actively accessing the database during the outage pause until the data server becomes available again through either completion of failover. Depending on the nature of the application activity and the failover mechanism, some users may experience a pause until failover completes, but can then continue operating without interrupting their workflow.
Data servers in a distributed cache cluster can be mirrored for high availability in the same way as a stand-alone InterSystems IRIS instance, and application servers can be set to automatically redirect connections to the backup in the event of failover. For detailed information about the use of mirroring in a distributed cache cluster, see Configuring ECP Connections to a Mirror in the “Mirroring” chapter of this guide.
The other failover strategies detailed in this chapter can also be used in a distributed cache cluster. Regardless of the failover strategy employed for the data server, the application servers reconnect and recover their states following a failover, allowing application processing to continue where it left off prior to the failure.
Bear in mind, however, that the primary purpose of distributed caching is horizontal scaling; deploying a cluster simply as a component of your HA strategy can add costs, such as increased complexity and additional points of failure, as well as benefits.
For information about distributed caching, see the “Horizontally Scaling for User Volume with Distributed Caching” chapter of the Scalability Guide.