Overview of Distributed Caching

When vertical scaling alone proves insufficient for scaling your InterSystems IRIS® data platform to meet your workload’s requirements, you can consider distributed caching, an architecturally straightforward, application-transparent, low-cost approach to horizontal scaling. Distributed caching relies on ECP (Enterprise Cache Protocol), which is a core component of InterSystems IRIS.

The InterSystems IRIS distributed caching architecture scales horizontally for user volume by distributing both application logic and caching across a tier of application servers sitting in front of a data server, enabling partitioning of users across this tier. Each application server handles user requests and maintains its own database cache, which is automatically kept in sync with the data server, while the data server handles all data storage and management. Interrupted connections between application servers and data server are automatically recovered or reset, depending on the length of the outage.

Distributed caching allows each application server to maintain its own, independent working set of the data, which avoids the expensive necessity of having enough memory to contain the entire working set on a single server and lets you add inexpensive application servers to handle more users. Distributed caching can also help when an application is limited by available CPU capacity; again, capacity is increased by adding commodity application servers rather than obtaining an expensive processor for a single server.

Distributed Cache Cluster

3 groups of users connect to 3 application servers, each with its own cache, that query a single data server

This architecture is enabled by the use of ECP for communication between the application servers and the data server.

The distributed caching architecture and application server tier are entirely transparent to the user and to application code. You can easily convert an existing standalone InterSystems IRIS instance that is serving data into the data server of a cluster by adding application servers.

The following sections provide more details about distributed caching:

Distributed Caching Architecture

To better understand distributed caching architecture, review the following information about how data is stored and accessed by InterSystems IRIS:

InterSystems IRIS stores data in a file in the local operating system called a database. An InterSystems IRIS instance may (and usually does) have multiple databases.
InterSystems IRIS applications access data by means of a namespace, which provides a logical view of the data stored in one or more databases. A InterSystems IRIS instance may (and usually does) have multiple namespaces.
Each InterSystems IRIS instance maintains a database cache — a local shared memory buffer used to cache data retrieved from the databases, so that repeated instances of the same query can retrieve results from memory rather than storage, providing a very significant performance benefit.

The architecture of a distributed cache cluster is conceptually simple, using these elements in the following manner:

An InterSystems IRIS instance becomes an application server by adding another instance as a remote server, and then adding any or all of its databases as remote databases. This makes the second instance a data server for the first instance.
Local namespaces on the application server are mapped to remote databases on the data server in the same way they are mapped to local databases. The difference between local and remote databases is entirely transparent to an application querying a namespace on the application server.
The application server maintains its own database cache in the same manner as it would if using only local databases. ECP efficiently shares data, locks, and executable code among multiple InterSystems IRIS instances, as well as synchronizing the application server caches with the data server.

In practice, a distributed cache cluster of multiple application servers and a data server works as follows:

The data server continues to store, update, and serve the data. The data server also synchronizes and maintains the coherency of the application servers’ caches to ensure that users do not receive or keep stale data, and manages locks across the cluster.
Each query against the data is made in a namespace on one of the various application servers, each of which uses its own individual database cache to cache the results it receives; as a result, the total set of cached data is distributed across these individual caches. If there are multiple data servers, the application server automatically connects to the one storing the requested data. Each application server also monitors its data server connections and, if a connection is interrupted, attempts to recover it.
User requests can be distributed round-robin across the application servers by a load balancer, but the most effective approach takes full advantage of distributed caching by directing users with similar requests to the same application server, increasing cache efficiency. For example, a health care application might group clinical users who run one set of queries on one application server and front-desk staff running a different set on another. If the cluster handles multiple applications, each application's users can be directed to a separate application server. The illustrations that follow compare a single InterSystems IRIS instance to a cluster in which user connections are distributed in this manner. (Load balancing user requests can even be detrimental in some circumstances; for more information see Evaluate the Effects of Load Balancing User Requests.)
The number of application servers in a cluster can be increased (or reduced) without requiring other reconfiguration of the cluster or operational changes, so you can easily scale as user volume increases.

Local databases mapped to local namespaces on a single InterSystems IRIS instance

People using different applications query different namespaces and databases on a single server with a single cache

Remote databases on a data server mapped to namespaces on application servers in a distributed cache cluster

People using different applications query namespaces on separate application servers, each with its own cache

In a distributed cache cluster, the data server is responsible for the following:

Storing data in its local databases.
Synchronizing the application server database caches with the databases so the application servers do not see stale data.
Managing the distribution of locks across the network.
Monitoring the status of all application servers connections and taking action if a connection is interrupted for a specific amount of time (see ECP Recovery).

In a distributed cache cluster, each application server is responsible for the following:

Establishing connections to a specific data server whenever an application requests data that is stored on that server.
Maintaining, in its cache, data retrieved across the network.
Monitoring the status of all connections to the data server and taking action if a connection is interrupted for a specific amount of time (see ECP Recovery).

Note:

A distributed cache cluster can include more than one data server (although this is uncommon). An InterSystems IRIS instance can simultaneously act as both a data server and an application server, but cannot act as a data server for the data it receives as an application server.

ECP Features

ECP supports the distributed cache architecture by providing the following features:

Automatic, fail-safe operation. Once configured, ECP automatically establishes and maintains connections between application servers and data servers and attempts to recover from any disconnections (planned or unplanned) between application server and data server instances (see ECP Recovery). ECP can also preserve the state of a running application across a failover of the data server (see Distributed Caching and High Availability).
Along with keeping data available to applications, these features make a distributed cache cluster easier to manage; for example, it is possible to temporarily take a data server offline or fail over as part of planned maintenance without having to perform any operations on the application server instances.
Heterogeneous networking. InterSystems IRIS systems in a distributed cache cluster can run on different hardware and operating system platforms. ECP automatically manages any required data format conversions.
A robust transport layer based on TCP/IP. ECP uses the standard TCP/IP protocol for data transport, making it easy to configure and maintain.
Efficient use of network bandwidth. ECP takes full advantage of high-performance networking infrastructures.

ECP Recovery

ECP is designed to automatically recover from interruptions in connectivity between an application server and the data server. In the event of such an interruption, ECP executes a recovery protocol that differs depending on the nature of the failure and on the configured timeout intervals. The result is that the connection is either recovered, allowing the application processes to continue as though nothing had happened, or reset, forcing transaction rollback and rebuilding of the application processes.

For more information on ECP connections, see Monitoring Distributed Applications; for more information on ECP recovery, see ECP Recovery Protocol and ECP Recovery Process, Guarantees, and Limitations.

Distributed Caching and High Availability

While ECP recovery handles interrupted application server connections to the data server, the application servers in a distributed cache cluster are also designed to preserve the state of the running application across a failover of the data server. Depending on the nature of the application activity and the failover mechanism, some users may experience a pause until failover completes, but can then continue operating without interrupting their workflow.

Data servers can be mirrored for high availability in the same way as a stand-alone InterSystems IRIS instance, and application servers can be set to automatically redirect connections to the backup in the event of failover. (It is not necessary or even possible to mirror an application server, as it does not store any data.) For detailed information about the use of mirroring in a distributed cache cluster, see Configuring ECP Connections to a Mirror.

The other failover strategies detailed in Failover Strategies for High Availability can also be used in a distributed cache cluster. Regardless of the failover strategy employed for the data server, the application servers reconnect and recover their states following a failover, allowing application processing to continue where it left off prior to the failure.

Overview of Distributed Caching

Distributed Caching Architecture

ECP Features

ECP Recovery

Distributed Caching and High Availability

See Also