docs.intersystems.com
Home  /  Architecture  /  Scalability Guide  /  Horizontally Scaling Systems for User Volume with InterSystems Distributed Caching


Scalability Guide
Horizontally Scaling Systems for User Volume with InterSystems Distributed Caching
[Back]  [Next] 
InterSystems: The power behind what matters   
Search:  


When vertical scaling alone proves insufficient for scaling your InterSystems IRIS data platform to meet your workload’s requirements, you can consider distributed caching, an architecturally straightforward, application-transparent, low-cost approach to horizontal scaling.
Overview of Distributed Caching
The InterSystems IRIS™ distributed caching architecture scales horizontally for user volume by distributing both application logic and caching across a tier of application servers sitting in front of a data server, enabling partitioning of users across this tier. Each application server handles user requests and maintains its own database cache, which is automatically kept in sync with the data server, while the data server handles all data storage and management. Interrupted connections between application servers and data server are automatically recovered or reset, depending on the length of the outage.
Distributed caching allows each application server to maintain its own, independent working set of the data, which avoids the expensive necessity of having enough memory to contain the entire working set on a single server and lets you add inexpensive application servers to handle more users. Distributed caching can also help when an application is limited by available CPU capacity; again, capacity is increased by adding commodity application servers rather than obtaining an expensive processor for a single server.
Distributed Cache Cluster
This architecture is enabled by the use of the Enterprise Cache Protocol (ECP), a core component of InterSystems IRIS® Data Platform™, for communication between the application servers and the data server.
The distributed caching architecture and application server tier are entirely transparent to the user and to application code. You can easily convert an existing standalone InterSystems IRIS instance that is serving data into the data server of a cluster by adding application servers.
The following sections provide more details about distibuted caching:
Distributed Caching Architecture
To better understand distributed caching architecture, review the following information about how data is stored and accessed by InterSystems IRIS:
The architecture of a distributed cache cluster is conceptually simple, using these elements in the following manner:
In practice, a distributed cache cluster of multiple application servers and a data server works as follows:
Local databases mapped to local namespaces on a single InterSystems IRIS instance
Remote databases on a data server mapped to namespaces on application servers in a distributed cache cluster
In a distributed cache cluster, the data server is responsible for the following:
In a distributed cache cluster, each application server is responsible for the following:
Note:
A distributed cache cluster can include more than one data server (although this is uncommon). An InterSystems IRIS instance can simultaneously act as both a data server and an application server, but cannot act as a data server for the data it receives as an application server.
ECP Features
ECP supports the distributed cache architecture by providing the following features:
ECP Recovery
ECP is designed to automatically recover from interruptions in connectivity between an application server and the data server. In the event of such an interruption, ECP executes a recovery protocol that differs depending on the nature of the failure and on the configured timeout intervals. The result is that the connection is either recovered, allowing the application processes to continue as though nothing had happened, or reset, forcing transaction rollback and rebuilding of the application processes.
For more information on ECP connections, see Monitoring Distributed Applications; for more information on ECP recovery, see ECP Recovery Protocol and ECP Recovery Process, Guarantees, and Limitations.
Distributed Caching and High Availability
While ECP recovery handles interrupted application server connections to the data server, the application servers in a distributed cache cluster are also designed to preserve the state of the running application across a failover of the data server. Depending on the nature of the application activity and the failover mechanism, some users may experience a pause until failover completes, but can then continue operating without interrupting their workflow.
Data servers can be mirrored for high availability in the same way as a stand-alone InterSystems IRIS instance, and application servers can be set to automatically redirect connections to the backup in the event of failover. (It is not necessary or even possible to mirror an application server, as it does not store any data.) For detailed information about the use of mirroring in a distributed cache cluster, see Configuring ECP Connections to a Mirror the “Mirroring” chapter in the High Availability Guide.
The other failover strategies detailed in the System Failover Strategies chapter of the High Availability Guide can also be used in a distributed cache cluster. Regardless of the failover strategy employed for the data server, the application servers reconnect and recover their states following a failover, allowing application processing to continue where it left off prior to the failure.
Deploying a Distributed Cache Cluster
An InterSystems IRIS distributed cache cluster consists of a data server providing data to one or more application servers, which in turn provide it to the application. This section describes procedures for deploying a distributed cache cluster.
The recommended method for deploying InterSystems IRIS Data Platform is InterSystems Cloud Manager (ICM). By combining plain text declarative configuration files, a simple command line interface, the widely-used Terraform infrastructure as code tool, and InterSystems IRIS deployment in Docker containers, ICM provides you with a simple, intuitive way to provision cloud or virtual infrastructure and deploy the desired InterSystems IRIS architecture on that infrastructure, along with other services. ICM can deploy distributed cache clusters and other InterSystems IRIS configurations on Amazon Web Services, Google Cloud Platform, Microsoft Azure or VMware vSphere. ICM can also deploy InterSystems IRIS on an existing physical or virtual cluster.
Deploy the Cluster with InterSystems Cloud Manager offers an overview of the process of using ICM to deploy the distributed cache cluster.
Note:
For a brief introduction to ICM including a hands-on exploration of deploying InterSystems IRIS, see First Look: InterSystems Cloud Manager. For complete ICM documentation, see the InterSystems Cloud Manager Guide.
You can also deploy a distributed cluster by using the management portal to configure existing or newly installed InterSystems IRIS instances; instructions for this procedure are provided in Deploy the Cluster Using the Management Portal.
Information about securing the cluster after deployment is provided in Distributed Cache Cluster Security.
Note:
The most typical distributed cache cluster configuration involves one InterSystems IRIS instance per system, and one cluster role per instance — that is, either data server or application server. When deploying using ICM, this configuration is the only option. The provided procedure for using the management portal assumes this configuration as well.
Data Server/Application Server Compatibility
While the data server and application server hosts can be of different operating systems and/or endianness, all InterSystems IRIS instances in a distributed cache cluster must use the same locale (see Using the NLS Settings Page of the management portal in the “Configuring InterSystems IRIS” chapter of the System Administration Guide).
Deploy the Cluster with InterSystems Cloud Manager
The process of using ICM to deploy an InterSystems IRIS distributed cache cluster are reviewed in this section. The InterSystems Cloud Manager Guide provides complete documentation of ICM, including details of each of the five stages covered here:
They are briefly outlined below, and covered in full detail in the InterSystems Cloud Manager Guide.
Launch ICM
ICM is provided as a Docker image. Everything required by ICM to carry out its provisioning, deployment, and management tasks is included in the ICM container. To launch ICM, on a system on which Docker is installed, you use the docker run command with the ICM image from the InterSystems repository to start the ICM container.
The ICM container includes a /Samples directory that provides you with samples of the elements required by ICM for provisioning, configuration, and deployment. Eventually, however, you will probably want to use locations outside the container to store these elements, your InterSystems IRIS licenses, and your ICM configuration files and state directory. For this purpose, you can use the --volume option with the docker run command, which lets you make persistent storage outside the container available inside the container.
Of course, the ICM image can also be run by custom tools and scripts, and this can help you accomplish goals such as making external locations available within the container to store your configuration files and your state directory.
For detailed information about launching ICM, see Launch ICM in the “Using ICM” chapter of the InterSystems Cloud Manager Guide.
Note:
Before defining your deployment, you must obtain security-related files including cloud provider credentials and keys for SSH and SSL/TLS; for more information, see Obtain Security-Related Files in the “Using ICM” chapter.
Define the Deployment
ICM uses JSON files as both input and output. To provide the needed parameters to ICM, you must represent your target configuration and the platform on which it is to be deployed in two of ICM’s configuration files. A limited number of parameters can also be specified on the ICM command line.
Note:
To deploy the cluster in a public cloud, you must have an account with the providers and credentials for that account. To deploy on VMware vSphere, you must have the appropriate administrator credentials.
ICM Input Files
Two user-defined JSON-format input files are required for ICM deployment:
Some fields in the defaults and definitions files can be overridden by command-line options. In addition, most fields have defaults.
ICM Node Types
Part of the definition of an ICM deployment is the number of each type of node to be included. The following table briefly lists the types of node (determined by the Role field in the definitions file) that ICM can provision, configure, and deploy services on, with the primary distributed cache cluster roles highlighted; other node types, including AR, WS, and LB, can also be part of a distributed cache configuration. For detailed descriptions of the node types, see ICM Node Types in the “ICM Reference” chapter of the InterSystems Cloud Manager Guide.
ICM Node Types
Node Type Cluster Role
Data server (can also serve as shard master data server in sharded cluster and stand-alone instance)
Application server (can also serve as shard master application server in sharded cluster)
DS
Shard data server
QS
Shard query server
AR
Mirror arbiter (for mirrored deployments)
LB
Load balancer (can be automatically provisioned with AM, WS and GG nodes)
WS
Web server
VM
Virtual machine (general purpose)
Sample Configuration Files for Distributed Cache Cluster Deployment
You can create your defaults and definitions files by adapting the template defaults.json and definitions.json files provided with ICM in the /Samples directory (/Samples/AWS for AWS deployments).
To illustrate their use, the “Sample Defaults File” and “Sample Definitions File” tables list the contents of defaults and definitions files you might use for a distributed cache cluster comprising the data server and three application servers, provisioned in the Amazon Web Services (AWS) public cloud.
For information about the listed fields, see ICM Configuration Parameters in the “ICM Reference” chapter of the InterSystems Cloud Manager Guide):
Sample Defaults File for Distributed Cache Cluster Deployment
Requirement Definition Field: Value
Provisioning platform
Platform details
Amazon Web Services.
Required details.
"Provider": "AWS"
"Zone": "us-west-1c"
"Region": "us-west-1"
Default machine image, sizing, and ports
Default AWS AMI and instance type to provision.
Default OS volume size.
Default data volume size.
Default superserver port.
Default web server port.
"AMI": "ami-d1315fb1"
"InstanceType": "m4.4xlarge"
“OSVolumeSize”: “10”
“DataVolumeSize”: “10”
"SuperServerPort": "51773"
"WebServerPort": "52773"
SSHUser
Nonroot account with sudo access used by ICM for access to provisioned nodes.
"SSHUser": "ec2-user"
Locations of credentials
Security-related files.
"Credentials": "/Samples/AWS/credentials"
"SSHKey": "/Samples/ssh/insecure-ssh2.pub"
"KeyFile": "/Samples/ssh/insecure"
"TLSKeyDir": "/Samples/tls"
Monitoring
Install Weave Scope.
"Monitor": "scope"
Image to deploy and repository credentials
Latest InterSystems IRIS image.
Credentials to log into InterSystems Docker repository.
"DockerImage": "iscrepo/iris:stable"
"DockerUsername": "..."
"DockerPassword": "..."
Note:
The pathnames provided in the fields specifying security files in this sample defaults file assume you have placed your AWS credentials in the /Samples/AWS directory, and used the keygen*.sh scripts to generate the keys, as described in Obtain Security-Related Files in the “Using ICM” chapter of the InterSystems Cloud Manager Guide. If you have generated or obtained your own keys, these may be internal paths to external storage volumes mounted when the ICM container is run, as described in Launch ICM in the “Using ICM” chapter. For information about these files, see Security-Related Parameters in the “ICM Reference” chapter.
Sample Definitions File for Distributed Cache Cluster Deployment
Requirement Definition Field: Value
Nodes for target InterSystems IRIS configuration, including provider-specific characteristics
Data server using a standard InterSystems IRIS license.
Data volume size overrides defaults file.
"Role": "DM"
"ISCLicense": "/License/Standard/iris.key"
DataVolumeSize”: “60”
 
Three application servers, using separate standard InterSystems IRIS licenses.
Load balancer is provisioned automatically when LoadBalancer is True.
"Role": "AM"
"Count": "3"
“StartCount”: “2”
"ISCLicense": "/License/Standard/iris1.key,/License/Standard/iris2.key,/License/Standard/iris3.key"
“LoadBalancer”: “true”
Naming scheme for provisioned nodes
ECP-role-TEST-NNNN
"Label": "ECP"
"Tag": "TEST"
For a complete list of the fields you can include in these files, see ICM Configuration Parameters in the “ICM Reference” chapter of the InterSystems Cloud Manager Guide.
For more detailed information about the definition phase, see Define the Deployment in the “Using ICM” chapter.
Deploying a Mirrored Data Server with ICM
To deploy two mirrored instances as the custer’s data server and configure the application server connections to the data server to automatically redirect to the new primary following failover, simply do the following:
ICM does all of the needed configuration automatically.
Provision the Infrastructure
When your definitions files are complete, begin the provisioning phase by issuing the command icm provision on the ICM command line. This command allocates and configures the nodes specified in the definitions file. At completion, ICM also provides a summary of the compute nodes and associated components that have been provisioned, and outputs a command line which can be used to delete the infrastructure at a later date, for example:
Machine           IP Address      DNS Name                      
-------            ---------       -------                      
ECP-DM-TEST-0001  00.53.183.209   ec2-00-53-183-209.us-west-1.compute.amazonaws.com
ECP-AM-TEST-0002  00.56.59.42     ec2-00-56-59-42.us-west-1.compute.amazonaws.com
ECP-AM-TEST-0003  00.67.1.11      ec2-00-67-1-11.us-west-1.compute.amazonaws.com
ECP-AM-TEST-0004  00.193.117.217  ec2-00-193-117-217.us-west-1.compute.amazonaws.com
ECP-LB-TEST-0000  (virtual AM)    ECP-AM-TEST-1546467861.amazonaws.com
To destroy: icm unprovision  stateDir /Samples/AWS/ICM-8620265620732464265  provider AWS [-cleanUp] [-force]
Once your infrastructure is provisioned, you can use the following infrastructure management commands:
For more detailed information about the provisioning phase, see Provision the Infrastructure in the “Using ICM” chapter of the InterSystems Cloud Manager Guide.
Deploy and Manage Services
ICM carries out deployment of InterSystems IRIS and other software services using Docker images, which it runs as containers by making calls to Docker. Containerized deployment supports ease of use and DevOps adaptation while avoiding the risks of manual upgrade. In addition to Docker, ICM also carries out some InterSystems IRIS-specific configuration over JDBC.
There are many container management tools available that can be used to extend ICM’s deployment and management capabilities.
The icm run command downloads, creates, and starts the specified container on the provisioned nodes. The icm run command has a number of useful options, and also lets you specify Docker options to be included, so there are many versions on the command line depending on your needs. Here are just two examples:
For a full discussion of the use of the icm run command, see The icm run Command in the “Using ICM” chapter of the InterSystems Cloud Manager Guide.
At deployment completion, ICM sends a link to the appropriate node’s management portal, for example:
Management Portal available at: http://ec2-00-153-49-109.us-west-1.compute.amazonaws.com:52773/csp/sys/UtilHome.csp 
In the case of a distributed cache cluster, the provided link is for the data server instance.
Once your containers are deployed, you can use the following container management commands. Most can be run on all containers in the deployment or restricted to specific container types, specific roles or nodes, or a combination.
You can also use the following service management commands. A significant feature of ICM is the ability it provides, through the icm ssh (see Provision the Infrastructure), icm exec, and icm session commands to interact with the nodes of your deployment on several levels — with the node itself, with the container deployed on it, and with the running InterSystems IRIS instance inside the container.
For more detailed information about the service deployment and management phase, see Deploy and Manage Services in the “Using ICM” chapter of the InterSystems Cloud Manager Guide.
Unprovision the Infrastructure
Because public cloud platform instances continually generate charges and unused instances in private clouds consume resources to no purpose, it is important to unprovision infrastructure in a timely manner. The icm unprovision command deallocates the provisioned infrastructure based on the state files created during provisioning. As described in Provision the Infrastructure, you must specify the state directory with this command; the needed command line is provided when the provisioning phase is complete, and is also contained in the ICM log file, for example:
To destroy: icm unprovision  stateDir /Samples/AWS/ICM-8620265620732464265  provider AWS [-cleanUp] [-force]
For more detailed information about the unprovisioning phase, see Unprovision the Infrastructure in the “Using ICM” chapter of the InterSystems Cloud Manager Guide.
Deploy the Cluster Using the Management Portal
Once you have installed or identified the InterSystems IRIS instances you intend to include, and arranged network access of sufficient bandwidth among their hosts, configuring a distributed cache cluster using the management portal involves the following steps:
You perform these steps on the ECP Settings page of the management portal (System Administration > Configuration > Connectivity > ECP Settings).
Preparing the Data Server
An InterSystems IRIS instance cannot actually operate as the data server in a distributed cache cluster until it is configured as such on the application servers. The procedure for preparing the instance to be a data server, however, includes one required action and two optional actions.
To prepare an instance to be a data server, navigate to the ECP Settings page by selecting System Administration on the management portal home page, then Configuration, then Connectivity, then ECP Settings, then do the following:
The data server is now ready to accept connections from valid application servers.
Configuring an Application Server
Configuring an InterSystems IRIS instance as an application server in a distributed cache cluster involves two steps:
To add the data server to the application server, do the following:
  1. As described for the data server in Preparing the Data Server, navigate to the ECP Settings page and enable the ECP service.
  2. Click Data Servers to display the ECP Data Servers page and click Add Server.
  3. In the ECP Data Server dialog, enter the following information for the data server:
  4. Click Save. The data server appears in the data server list; you can remove or edit the data server definition, or change its status (see Monitoring Distributed Applications) using the available links.
To add each desired database on the data server as a remote database on the application server, you must create a namespace on the application server and map it to that database, as follows:
  1. Navigate to the Namespaces page by selecting System Administration on the management portal home page, then Configuration, then System Configuration, then Namespaces. Click Create New Namespace to display the New Namespace page.
  2. Enter a name for the new namespace, which typically reflects the name of the remote database it is mapped to.
  3. At The default database for Globals in this namespace is a, select Remote Database, then select Create New Database to open the Create Remote Database dialog. In this dialog,
  4. At The default database for Routines in this namespace is a, select Remote Database, then select the database you just created from the drop-down.
  5. The namespace does not need to be an interoperability-enabled namespace; to save time, you can clear the Make this a poduction-enabled namespace check box.
  6. Select Save. The new namespace now appears in the list on the Namespaces list.
Once you have added a data server database as a remote database on the application server, applications can query that database through the namespace it is mapped to on the application server.
Note:
Remember that even though a namespace on the application server is mapped to a database on the data server, changes to the namespace mapped to that database on the data server are unknown to the application server. For example, suppose the namespace DATA on the data server has the default globals database DATA; on the application server, the namespace REMOTEDATA is mapped to the same (remote) database, DATA. If you create a mapping in the DATA namespace on the data server mapping the global ^DATA2 to the DATA2 database (see Global Mappings in the “Configuring InterSystems IRIS” chapter of the System Administration Guide), this mapping is not propagated to the application server. Therefore, if you do not add DATA2 as a remote database on the application server and create the same mapping in the REMOTEDATA namespace, queries the application server receives will not be able to read the ^DATA2 global.
Distributed Cache Cluster Security
All InterSystems instances in a distributed cache cluster need to be within the secured InterSystems IRIS perimeter (that is, within an externally secured environment). This is because ECP is a basic security service, rather than a resource-based service, so there is no way to regulate which users have access to it. (For more information on basic and resource-based services, see the Available Services section of the “Services” chapter of the Security Administration Guide.)
However, access to the data on the data server through the application servers can be restricted in two ways:
Note:
When databases are encrypted on the data servers, you should also encrypt the IRISTEMP database on all connected application servers. The same or different keys can be used. For more information on database encryption, see the Managed Key Encryption chapter of the Security Administration Guide.
Restricting Application Server Access
By default, any InterSystems IRIS instance on which the data server instance is configured as a data server (as described in the previous section) can connect to the data server. However, you can restrict which instances can act as application servers for the data server by specifying the hosts from which incoming connections are allowed; if you do this, hosts not explicitly listed cannot connect to the data server. Do this by performing the following steps on the data server:
  1. On the Services page (from the portal home page, select Security and then Services), click %Service_ECP. The Edit Service dialog displays.
  2. By default, the Allowed Incoming Connections box is empty, which means any application server can connect to this instance if the ECP service is enabled; click Add and enter a single IP address (such as 192.9.202.55) or fully-qualified domain name (such as mycomputer.myorg.com), or a range of IP addresses (for example,8.61.202–210.* or 18.68.*.*). Once there are one or more entries on the list and you click Save in the Edit Service dialog, only the hosts specified by those entries can connect.
You can always access the list as described and use a Delete to delete the host from the list or an Edit link to specify the roles associated with the host, as described in Controlling Access with Roles and Privileges.
Controlling Access with Roles and Privileges
InterSystems uses a security model in which assets, including databases, are assigned to resources, and resources are assigned permissions, such as read and write. A combination of a resource and a permission is called a privilege. Privileges are assigned to roles, to which users can belong. In this way, roles are used to control user access to resources. For information about this model, see Authorization: Controlling User Access in the “About InterSystems Security” chapter of the Security Administration Guide.
To be granted access to a database on the data server, the role held by the user initiating the process on the application server and the role set for the ECP connection on the data server must both include permissions for the same resource representing that database. For example, if a user belongs to a role on an application server that grants the privilege of read permission for a particular database resource, and the role set for the ECP connection on the data server also includes this privilege, the user can read data from the database on the application server.
By default, InterSystems IRIS grants ECP connections on the data server the %All privilege when the data server runs on behalf of an application server. This means that whatever privileges the user on the application server has are matched on the data server, and access is therefore controlled only on the application server. For example, a user on the application server who has privileges only for the %DB_USER resource but not the %DB_IRISLIB resource can access data in the USER database on the data server, but attempting to access the IRISLIB database on the data server results in a <PROTECT> error. If a different user on the application server has privileges for the %DB_IRISLIB resource, the IRISLIB database is available to that user.
Note:
InterSystems recommends the use of an LDAP server to implement centralized security. including user roles and privileges, across the application servers of a distributed cache cluster. For information about using LDAP with InterSystems IRIS, see the Using LDAP chapter of the Security Administration Guide.
However, you can also restrict the roles available to ECP connections on the data server based on the application server host. For example, on the data server you can specify that when interacting with a specific application server, the only available role is %DB_USER. In this case, users on the application server granted the %DB_USER role can access the USER database on the data server, but no users on the application server can access any other database on the data server regardless of the roles they are granted.
Caution:
InterSystems strongly recommends that you secure the cluster by specifying available roles for all application servers in the cluster, rather than allowing the data server to continue to grant the %All privilege to all ECP connections.
The following are exceptions to this behavior:
To specify the available roles for ECP connections from a specific application server on the data server, do the following:
  1. Go to the Services page (from the portal home page, select Security and then Services) and click %Service_ECP to deiplay the Edit Service dialog.
  2. Click the Edit link for the application server host you want to restrict to display the Select Roles area.
  3. To specify roles for the host, select roles from those listed under Available and click the right arrow to add them to the Selected list.
  4. To remove roles from the Selected list, select them and then click the left arrow.
  5. To add all roles to the Selected list, click the double right arrow; to remove all roles from the Selected list, click the double left arrow.
  6. Click Save to associate the roles with the IP address.
By default, a listed host holds the %All role, but if you specify one or more other roles, these roles are the only roles that the connection holds. Therefore, a connection from a host or IP range with the %Operator role has only the privileges associated with that role, while a connection from a host with no associated roles (and therefore %All) has all privileges.
Changes to the roles available to application server hosts and to the public permissions on resources on the data server require a restart of InterSystems IRIS before taking effect.
Security-Related Error Reporting
The behavior of security-related error reporting with ECP varies depending on whether the check fails on the application server or the data server and the type of operation:
Monitoring Distributed Cache Applications
A running distributed cache cluster consists of a data server instance — a data provider — connected to one or more application server systems—data consumers. Between each application server and the data server, there is an ECP connection — a TCP/IP connection that ECP uses to send data and commands.
You can monitor the status of the servers and connections in a distributed cache cluster on the ECP Settings page (System Administration > Configuration > Connectivity > ECP Settings).
The ECP Settings page has two subsections:
  1. This System as an ECP Data Server displays settings for the data server as well as the status of the ECP service.
  2. This System as an ECP Application Server displays settings for the application server.
The following sections describe status information for connections:
ECP Connection Information
Click the Data Servers button on the ECP Data Servers Settings page (System Administration > Configuration > Connectivity > ECP Settings) to display the ECP Data Servers page, which lists the current data server connections on the application server. The ECP Application Servers page, which you can display by clicking the Application Servers button on the ECP Settings page, contains a list of the current application server connections on the data server.
Data Server Connections
The ECP Data Servers page displays the following information for each data server connection:
Server Name
The logical name of the data server system on this connection, as entered when the server was added to the application server configuration.
Host Name
The host name of the data server system, as entered when the server was added to the application server configuration.
IP Port
The IP port number used to connect to the data server.
Status
The current status of this connection. Connection states are described in the ECP Connection States section.
Edit
If the current status of this connection is Not Connected or Disabled, you can edit the port and host information of the data server.
Change Status
From each data server row you can change the status of an existing ECP connection with that data server; see the ECP Connection Operations section for more information.
Delete
You can delete the data server information from the application server.
Application Server Connections
Click ECP Application Servers on the ECP Settings page (System Administration > Configuration > Connectivity > ECP Settings) to view the ECP Application Servers page with a list of application server connections on this data server:
Client Name
The logical name of the application server on this connection.
Status
The current status of this connection. Connection states are described in the ECP Connection States section.
Client IP
The host name or IP address of the application server
IP Port
The port number used to connect to the application server.
ECP Connection States
In an operating cluster, an ECP connection can be in one of the following states:
ECP Connection States
State Description
Not Connected The connection is defined but has not been used yet.
Connection in Progress The connection is in the process of establishing itself. This is a transitional state that lasts only until the connection is established.
Normal The connection is operating normally and has been used recently.
Trouble The connection has encountered a problem. If possible, the connection automatically corrects itself.
Disabled The connection has been manually disabled by a system administrator. Any application making use of this connection receives a <NETWORK> error.
The following sections describe each connection state as it relates to application servers or the data server:
Application Server Connection States
The following sections describe the application server side of each of the connection states:
Application Server Not Connected State
An application server-side ECP connection starts out in the Not Connected state. In this state, there are no ECP daemons for the connection. If an application server process makes a network request, daemons are created for the connection and the connection enters the Connection in Progress state.
Application Server Connection in Progress State
In the Connection in Progress state, a network daemon exists for the connection and actively tries to establish a connection to the data server; when the connection is established, it enters the Normal state. While the connection is in the Connection in Progress state, the user process must wait for up to 20 seconds for it to be established. If the connection is not established within that time, the user process receives a <NETWORK> error.
The application server ECP daemon attempts to create a new connection to the data server in the background. If no connection is established within 20 minutes, the connection returns to the Not Connected state and the daemon for the connection goes away.
Application Server Normal State
After a connection completes, it enters the Normal (data transfer) state. In this state, the application server-side daemons exist and actively send requests and receive answers across the network. The connection stays in the Normal state until the connection becomes unworkable or until the application server or the data server requests a shutdown of the connection.
Application Server Trouble State
If the connection from the application server to the data server encounters problems, the application server ECP connection enters the Trouble state. In this state, application server ECP daemons exist and are actively try to restore the connection. An underlying TCP connection may or may not still exist. The recovery method is similar whether or not the underlying TCP connection gets reset and must be recreated, or if it stops working temporarily.
During the application server Time to wait for recovery timeout (default of 20 minutes), the application server attempts to reconnect to the data server to perform ECP connection recovery. During this interval, existing network requests are preserved, but the originating application server-side user process blocks new network requests, waiting for the connection to resume. If the connection returns within the Time to wait for recovery timeout, it returns to the Normal state and the blocked network requests proceed.
For example, if a data server goes offline, any application server connected to it has its state set to Trouble until the data server becomes available. If the problem is corrected gracefully, a connection’s state reverts to Normal; otherwise, if the trouble state is not recovered, it reverts to Not Connected.
Applications continue running until they require network access. All locally cached data is available to the application while the server is not responding.
Application Server Transitional Recovery States
Transitional recovery states are part of the Trouble state. If there is no current TCP connection to the data server, and a new connection is established, the application server and data server engage in a recovery protocol which flushes the application server cache, recovers transactions and locks, and returns to the Normal state.
Similarly, if the data server shuts down, either gracefully or as a result of a crash, and then restarts, it enters a short period (approximately 30 seconds) during which it allows application servers to reconnect and recover their existing sessions. Once again, the application server and the data server engage in the recovery protocol.
If connection recovery is not complete within the Time to wait for recovery timeout, the application server gives up on connection recovery. Specifically, the application server returns errors to all pending network requests and changes the connection state to Not Connected. If it has not already done so, the data server rolls back all the transactions and releases all the locks from this application server the next time this application server connects to the data server.
If the recovery is successful, the connection returns to the Normal state and the blocked network requests proceed.
Application Server Disabled State
An ECP connection is marked Disabled if an administrator declares that it is disabled. In this state, no daemons exist and any network requests that would use that connection immediately receive <NETWORK> errors.
Data Server Connection States
The following sections describe the data server side of each of the connection states:
Data Server Free States
When an ECP server instance starts up, all incoming ECP connections are in an initial “unassigned” Free state and are available for connections from any application server that is listed in the connection access control list. If a connection from an application server previously existed and has since gone away, but does not require any recovery steps, the connection is placed in the “idle” Free state. The only difference between these two states is that in the idle state, this connection block is already assigned to a particular application server, rather than being available for any application server that passes the access control list.
Data Server Normal State
In the data server Normal state, the application server connection is normal. At any point in the processing of incoming connections, whenever the application server disconnects from the data server (except as part of the data server’s own shutdown sequence), the data server rolls back any pending transactions and releases any incoming locks from that application server, and places the application server connection in the “idle” Free state.
Data Server Trouble States
If the application server is not responding, the data server shows a Trouble state. If the data server crashes or shuts down, it remembers the connections that were active at the time of the crash or shutdown. After restarting, the data server waits for a brief time (usually 30 seconds) for application servers to reclaim their sessions (locks and open transactions). If an application server does not complete recovery during this awaiting recovery interval, all pending work on that connection is rolled back and the connection is placed in the “idle” state.
Data Server Recovering State
The data server connection is in a recovery state for a very short time when the application server is in the process of reclaiming its session. The data server keeps the application server in trouble state for the Time interval for Troubled state timeout (default is 60 seconds) for it to reclaim the connection; otherwise, it releases the application resources (rolls back all open transactions and releases locks) and then sets the state to Free.
ECP Connection Operations
On the ECP Data Servers page (System Administration > Configuration > Connectivity > ECP Settings, click Data Servers button) on an application server, you can change the status of the ECP connection. In each data server row, click Change Status to display the connection information and perform the appropriate selection of the following choices:
Change to Disabled
Set the state of this connection to Disabled. This releases any locks held for the application server, rolls back any open transactions involving this connection, and purges cached blocks from the data server. If this is an active connection, the change in status sends an error to all applications waiting for network replies from the data server.
Change to Normal
Set the state of this connection to Normal.
Change to Not Connected
Set the state of this connection to Not Connected. As with changing the state to Disabled, this releases any locks held for the application server, rolls back any open transactions involving this connection, and purges cached blocks from the data server. If this is an active connection, the change in status sends an error to all applications waiting for network replies from the data server.
Developing Distributed Cache Applications
This chapter discusses application development and design issues that are helpful if you would like to deploy your application on a distributed cache cluster, either as an option or as its primary configuration.
With InterSystems IRIS, the decision to deploy an application as a distributed system is primarily a runtime configuration issue (see Deploying a Distributed Cache Cluster). Using InterSystems IRIS configuration tools, map the logical names of your data (globals) and application logic (routines) to physical storage on the appropriate system.
This chapter discusses the following topics:
ECP Recovery Protocol
ECP is designed to automatically recover from interruptions in connectivity between an application server and the data server. In the event of such an interruption, ECP executes a recovery protocol that differs depending on the nature of the failure. The result is that the connection is either recovered, allowing the application processes to continue as though nothing had happened, or reset, forcing transaction rollback and rebuilding of the application processes. The main principles are as follows:
The ECP timeout settings are shown in the following table. Each is configurable on the System > Configuration > ECP Settings page of the Management Portal, or in the ECP section of the iris.cpf file (see ECP in the Configuration Parameter File Reference).
ECP Timeout Values
Management Portal Setting CPF Setting Default Range  
Time between reconnections ClientReconnectInterval 5 seconds 1–60 seconds The interval at which an application makes attempts to reconnect to the data server.
Time interval for Troubled state ServerTroubleDuration 60 seconds 20–65535 seconds The length of time for which the data server waits for contact from the application server before resetting an interrupted connection.
Time to wait for recovery ClientReconnectDuration 1200 seconds (20 minutes) 10–65535 seconds The length of time for which an application server continues attempting to reconnect to the data server before resetting an interrupted connection.
The default values are intended to do the following:
ECP relies on the TCP physical connection to detect the health of the instance at the other end without using too much of its capacity. On most platforms,you can adjust the TCP connection failure and detection behavior at the system level.
While an application server connection becomes inactive, the data server maintains an active daemon waiting for new requests to arrive on the connection, or for a new connection to be requested by the application server. If the old connection returns, it can immediately resume operation without recovery. When the underlying heartbeat mechanism indicates that the application server is completely unavailable due to a system or network failure, the underlying TCP connection is quickly reset. Thus, an extended period without a response from an application server generally indicates some kind of problem on the application server that caused it to stop functioning, but without interfering with its connections.
If the underlying TCP connection is reset, the data server puts the connection in an “awaiting reconnection” state in which there is no active ECP daemon on the data server. A new pair of data server daemons are created when the next incoming connection is requested by the application server.
Collectively, the nonresponsive state and the awaiting reconnection state are known as the data server Trouble state. The recovery required in both cases is very similar.
If the data server fails or shuts down, it remembers the connections that were active at the time of the crash or shutdown. After restarting, the data server has a short window (usually 30 seconds) during which it places these interrupted connections in the awaiting reconnection state. In this state, the application server and data server can cooperate together to recover all the transaction and lock states as well as all the pending Set and Kill transactions from the moment of the data server shutdown.
During the recovery of an ECP-configured instance, InterSystems IRIS guarantees a number of recoverable semantics, and also specifies limitations to these guarantees. ECP Recovery Process, Guarantees, and Limitations describes these in detail, as well as providing additional details about the recovery process.
Forced Disconnects
By default, ECP automatically manages the connection between an application server and a data server. When an ECP-configured instance starts up, all connections between application servers and data servers are in the Not Connected state (that is, the connection is defined, but not yet established). As soon as an application server makes a request (for data or code) that requires a connection to the data server, the connection is automatically established and the state changes to Normal. The network connection between the application server and data server is kept open indefinitely.
In some applications, you may wish to close open ECP connections. For example, suppose you have a system, configured as an application server, that periodically (a few times a day) needs to fetch data stored on a data server system, but does not need to keep the network connection with the data server open afterwards. In this case, the application server system can issue a call to the SYS.ECP.ChangeToNotConnected method to force the state of the ECP connection to Not Connected.
For example:
 Do OperationThatUsesECP()
 Do SYS.ECP.ChangeToNotConnected("ConnectionName")
The ChangeToNotConnected method does the following:
  1. Completes sending any data modifications to the data server and waits for acknowledgment from the data server.
  2. Removes any locks on the data server that were opened by the application server.
  3. Rolls back the data server side of any open transactions. The application server side of the transaction goes into a “rollback only” condition.
  4. Completes pending requests with a <NETWORK> error.
  5. Flushes all cached blocks.
After completion of the state change to Not Connected, the next request for data from the data server automatically reestablishes the connection.
Note:
See Data Server Connections for information about changing data server connection status from the management portal.
Performance and Programming Considerations
To achieve the highest performance and reliability from distributed cache cluster-based applications, you should be aware of the following issues:
Do Not Use Multiple ECP Channels
InterSystems strongly discourages establishing multiple duplicate ECP channels between an application server and a data server to try to increase bandwidth. You run the dangerous risk of having locks and updates for a single logical transaction arrive out-of-sync on the data server, which may result in data inconsistency.
Increase Data Server Database Caches for ECP Control Structures
In addition to buffering the blocks that are served over ECP, data servers use global buffers to store various ECP control structures. There are several factors that go into determining how much memory these structures might require, but the most significant is a function of the aggregate sizes of the clients' caches. To roughly approximate the requirements, so you can adjust the data server’s database caches if needed, use the following guidelines:
Database Block Size Recommendation
8 KB 50 MB plus 1% of the sum of the sizes of all of the application servers’ 8 KB database caches
16 KB (if enabled) 0.5% of the sum of the sizes of all of the application servers’ 16 KB database caches
32 KB (if enabled) 0.25% of the sum of the sizes of all of the application servers’ 32 KB database caches
64 KB (if enabled) 0.125% of the sum of the sizes of all of the application servers’ 64 KB database caches
For example, if the 16 KB block size is enabled in addition to the default 8 KB block size, and there are six application servers, each with an 8 KB database cache of 2 GB and a 16 KB database cache of 4 GB, you should adjust the data server’s 8 KB database cache to ensure that 52 MB (50MB + [12 GB * .01]) is available for control structures, and the 16 KB cache to ensure that 2 MB (24 GB * .005) is available for control structures (rounding up in both cases).
For information about allocating memory to database caches, see Memory and Startup Settings in the “Configuring InterSystems IRIS” chapter of the System Administration Guide.
Evaluate the Effects of Load-balancing User Requests
Connecting users to application servers in a round-robin or load-balancing scheme may diminish the benefit of caching on the application server. This is particularly true if users work in functional groups that have a tendency to read the same data. As these users are spread among application servers, each application server may end up requesting exactly the same data from the data server, which not only diminishes the efficiency of distributed caching using multiple caches for the same data, but can also lead to increased block invalidation as blocks are modified on one application server and refreshed across other application servers. This is somewhat subjective, but someone very familiar with the application characteristics should consider this possible condition.
Restrict Transactions to a Single Data Server
Restrict updates within a single transaction to either a single remote data server or the local server. When a transaction includes updates to more than one server (including the local server) and the TCommit cannot complete successfully, some servers that are part of the transaction may have committed the updates while others may have rolled them back. For details, see Commit Guarantee in “ECP Recovery Guarantees and Limitations”.
Note:
Updates to IRISTEMP are not considered part of the transaction for the purpose of rollback, and, as such, are not included in this restriction.
Locate Temporary Globals on the Application Server
Temporary (scratch) globals should be local to the application server, assuming they do not contain data that needs to be globally shared. Often, temporary globals are highly active and write-intensive. If temporary globals are located on the data server, this may penalize other application servers sharing the ECP connection.
Avoid Repeated References to Undefined Globals
Repeated references to a global that is not defined (for example, $Data(^x(1)) where ^x is not defined) always requires a network operation to test if the global is defined on the data server.
By contrast, repeated references to undefined nodes within a defined global (for example, $Data(^x(1)) where any other node in ^x is defined) does not require a network operation once the relevant portion of the global (^x) is in the application server cache.
This behavior differs significantly from that of a non-networked application. With local data, repeated references to the undefined global are highly optimized to avoid unnecessary work. Designers porting an application to a networked environment may wish to review the use of globals that are sometimes defined and sometimes not. Often it is sufficient to make sure that some other node of the global is always defined.
Use the $Increment Function for Application Counters
A common operation in online transaction processing systems is generating a series of unique values for use as record numbers or the like. In a typical relational application, this is done by defining a table that contains a “next available” counter value. When the application needs a new identifier, it locks the row containing the counter, increments the counter value, and releases the lock. Even on a single-server system, this becomes a concurrency bottleneck: application processes spend more and more time waiting for the locks on this common counter to be released. In a networked environment, it is even more of a bottleneck at some point.
InterSystems IRIS addresses this by providing the $Increment function, which automatically increments a counter value (stored in a global) without any need of application-level locking. Concurrency for $Increment is built into the InterSystems IRIS database engine as well as ECP, making it very efficient for use in single-server as well as in distributed applications.
Applications built using the default structures provided by InterSystems IRIS objects (or SQL) automatically use $Increment to allocate object identifier values. $Increment is a synchronous operation involving journal synchronization when executed over ECP. For this reason, $Increment over ECP is a relatively slow operation, especially compared to others which may or may not already have data cached (in either the application server database cache or the data server database cache). The impact of this may be even greater in a mirrored environment due to network latency between the failover members. For this reason, it may be useful to redesign an application to assign a batch of new values to each application server and use $Increment with that local batch within each application server, involving the data server only when a new batch of values is needed. (This approach cannot be used, however, when consecutive application counter values are required.) The $Sequence function can also be helpful in this context, as an alternative to or used in combination with $Increment.
ECP-related Errors
There are several runtime errors that can occur on a system using ECP. An ECP-related error may occur immediately after a command is executed or, in the case of commands that are asynchronous in nature, such as Kill, the error occurs a short time after the command completes.
<NETWORK> Errors
A <NETWORK> error indicates an error that could not be handled by the normal ECP recovery mechanism.
In an application, it is always acceptable to halt a process or roll back any pending work whenever a <NETWORK> error is received. Some <NETWORK> errors are essentially fatal error conditions. Others indicate a temporary condition that might clear up soon; however, the expected programming practice is to always roll back any pending work in response to a <NETWORK> error and start the current transaction over from the beginning.
A <NETWORK> error on a get-type request such as $Data or $Order can often be retried manually rather than simply rolling back the transaction immediately. ECP tries to avoid giving a <NETWORK> error that would lose data, but gives an error more freely for requests that are read-only.
Rollback Only Condition
The application-side rollback-only condition occurs when the data server detects a network failure during a transaction initiated by the application server and enters a state in which all network requests are met with errors until the transaction is rolled back.
ECP Recovery Process, Guarantees, and Limitations
The ECP recovery protocol is summarized in ECP Recovery Protocol. This section describes ECP recovery in detail, including its guarantees and limitations.
The simplest case of ECP recovery is a temporary network interruption that is long enough to be noticed, but short enough that the underlying TCP connection stays active during the outage. During the outage, the application server notices that the connection is nonresponsive and blocks new network requests for that connection. Once the connection resumes, processes that were blocked are able to send their pending requests.
If the underlying TCP connection is reset, the data server waits for a reconnection for the Time interval for Troubled state setting (one minute by default). If the application server does not succeed in reconnecting during that interval, the data server resets its connection, rolls back its open transactions, and releases its locks. Any subsequent connection from that application server is converted into a request for a brand new connection and the application server is notified that its connection is reset.
The application server keeps a queue of locks to remove and transactions to roll back once the connection is reestablished. By keeping this queue, processes on the application server can always halt, whether or not the data server on which it has pending transactions and locks is currently available. ECP recovery completes any pending Set and Kill operations that had been queued for the data server before the network outage was detected, before it completes the release of locks.
Any time a data server learns that an application server has reset its own connection (due to application server restart, for example), even if it is still within the Time interval for Troubled state, the data server resets the connection immediately, rolling back transactions and releasing locks on behalf of that application server. Since the application server’s state was reset, there is no longer any state to be maintained by the data server on its behalf.
The final case is when the data server shut down, either gracefully or as a result of a crash. The application server maintains the application state and tries to reconnect to the data server for the Time to wait for recovery setting (20 minutes by default). The data server remembers the application server connections that were active at the time of the crash or shutdown; after restarting, it waits up to thirty seconds for those application servers to reconnect and recover their connections. Recovery involves several steps on the data server, some of which involve the data server journal file in very significant ways. The result of the several different steps is that:
During the recovery of an ECP-configured system, InterSystems IRIS guarantees a number of recoverable semantics which are described in detail in ECP Recovery Guarantees. Limitations to these guarantees are described in detail in the ECP Recovery Limitations section of the aforementioned appendix.
ECP Recovery Guarantees
During the recovery of an ECP-configured system, InterSystems IRIS guarantees the following recoverable semantics:
In the description of each guarantee the first paragraph describes a specific condition. Subsequent paragraphs describe the data guarantee applicable to that particular situation.
In these descriptions, Process A, Process B and so on refer to processes attempting update globals on a data server. These processes may originate on the same or different application servers, or on the data server itself; in some cases the origins of processes are specified, in others they are not germane.
In-order Updates Guarantee
Process A updates two data elements sequentially, first global ^x and next global ^y, where ^x and ^y are located on the same data server.
If Process B sees the change to ^y, it also sees the change to ^x. This guarantee applies whether or not Process A and Process B are on the same application server as long as the two data items are on the same data server and the data server remains up.
Process B’s ability to view the data modified by Process A does not ensure that Set operations from Process B are restored after the Set operations from Process A. Only a Lock or a $Increment operation can ensure proper ordering of competing Set and Kill operations from two different processes during cluster failover or cluster recovery.
See the Loose Ordering in Cluster Failover or Restore limitation regarding the order in which competing Set and Kill operations from separate processes are applied during cluster dejournaling and cluster failover.
Important:
This guarantee does not apply if the data server crashes, even if ^x and ^y are journaled. See the Dirty Data Reads for ECP Without Locking limitation for a case in which processes that fit this description can see dirty data that never becomes durable before the data server crash.
ECP Lock Guarantee
Process B on DataDataServer S acquires a lock on global ^x, which was once locked by Process A.
Process B can see all updates on DataServer S done by Process A (while holding a lock on ^x). Also, if Process C sees the updates done by Process B on DataServer S (while holding a lock on ^x), Process C is guaranteed to also see the updates done by Process A on DataServer S (while holding a lock on ^x).
Serializability is guaranteed whether or not Process A, Process B, and Process C are located on the same application server or on DataServer S itself, as long as DataServer S stays up throughout.
Important:
The lock and the data it protects must reside on the same data server.
Clusters Lock Guarantee
Process B on a cluster member acquires a lock on global ^x in a clustered database; a lock once held by Process A.
Process B sees all updates to any clustered database done by Process A (while holding a lock on ^x).
Additionally, if Process C on a cluster member sees the updates on a clustered database made by Process B (while holding a lock on ^x), Process C also sees the updates made by Process A on any clustered database (while holding a lock on ^x).
Serializability is guaranteed whether or not Process A, Process B, and Process C are located on the same cluster member, and whether or not any cluster member crashes.
Important:
See the Dirty Data Reads When Cluster Slave Crashes limitation regarding transactions on one cluster member seeing dirty data from a transaction on a cluster member that crashes.
Rollback Guarantee
Process A executes a TStart command, followed by a series of updates, and either halts before issuing a TCommit, or executes a TRollback before executing a TCommit.
All the updates made by Process A as part of the transaction are rolled back in the reverse order in which they originally occurred.
Important:
See the rollback-related limitations: Conflicting, Non-Locked Change Breaks Rollback, Journal Discontinuity Breaks Rollback, and Asynchronous TCommit Converts to Rollback for more information.
Commit Guarantee
Process A makes a series of updates on DataServer S and halts after starting the execution of a TCommit.
On each DataServer S that is part of the transaction, the data modifications on DataServer S are either committed or rolled back. If the process that executes the TCommit has the Perform Synchronous Commit property turned on (SynchCommit=1, in the configuration file) and the TCommit operation returns without errors, the transaction is guaranteed to have durably committed on all the data servers that are part of the transaction.
Important:
If the transaction includes updates to more than one server (including the local server) and the TCommit cannot complete successfully, some servers that are part of the transaction may have committed the updates while others may have rolled them back.
Transactions and Locks Guarantee
Process A executes a TStart for Transaction T, locks global ^x on DataServer S, and unlocks ^x (unlock does not specify the “immediate unlock” lock type).
InterSystems IRIS guarantees that the lock on ^x is not released until the transaction has been either committed or rolled back. No other process can acquire a lock on ^x until Transaction T either commits or rolls back on DataServer S.
Once Transaction T commits on DataServer S, Process B that acquires a lock on ^x sees changes on DataServer S made by Process A during Transaction T. Any other process that sees changes on DataServer S made by Process B (while holding a lock on ^x) sees changes on DataServer S made by Process A (while executing Transaction T). Conversely, if Transaction T rolled back on DataServer S, a Process B that acquires a lock on ^x, sees none of the changes made by Process A on DataServer S.
Important:
See the Conflicting, Non-Locked Change Breaks Rollback limitation for more information.
ECP Rollback Only Guarantee
Process A on AppServer C makes changes on DataServer S that are part of a Transaction T, and DataServer S unilaterally rolls back those changes (which can happen in certain network outages or data server outages).
All subsequent network requests to DataServer S by Process A are rejected with <NETWORK> errors until Process A explicitly executes a TRollback command.
Additionally, if any process on AppServer C completes a network request to DataServer S between the rollback on DataServer S and the TCommit of Transaction T (AppServer C finds out about the rollback-only condition before the TCommit), Transaction T is guaranteed to roll back on all data servers that are part of Transaction T.
ECP Transaction Recovery Guarantee
An data server crashes in the middle of an application server transaction, restarts, and completes recovery within the application server recovery timeout interval.
The transaction can be completed normally without violating any of the described guarantees. The data server does not perform any data operations that violate the ordering constraints defined by lock semantics. The only exception is the $Increment function (see the ECP and Clusters $Increment Limitation section for more information). Any transactions that cannot be recovered are rolled back in a way that preserves lock semantics.
Important:
InterSystems IRIS expects but does not guarantee that in the absence of continuing faults (whether in the network, the data server, or the application server hardware or software), all or most of the transactions pending into a data server at the time of a data server outage are recovered.
ECP Lock Recovery Guarantee
DataServer S has an unplanned shutdown, restarts, and completes recovery within the recovery interval.
The ECP Lock Guarantee still applies as long as all the modified data is journaled. If data is not being journaled, updates made to the data server before the crash can disappear without notice to the application server. InterSystems IRIS no longer guarantees that a process that acquires the lock sees all the updates that were made earlier by other processes while holding the lock.
If DataServer S shuts down gracefully, restarts, and completes recovery within the recovery interval, the ECP Lock Guarantee still applies whether or not data is being journaled.
Updates that are part of a transaction are always journaled; the ECP Transaction Recovery Guarantee applies in a stronger form. Other updates may or may not be journaled, depending on whether or not the destination global in the destination database is marked for journaling.
$Increment Ordering Guarantee
The $Increment function induces a loose ordering on a series of Set and Kill operations from separate processes, even if those operations are not protected by a lock.
For example: Process A performs some Set and Kill operations on DataServer S and performs a $Increment operation to a global ^x on DataServer S. Process B performs a subsequent $Increment to the same global ^x. Any process, including Process B, that sees the result of Process B incrementing ^x, sees all changes on DataServer S that Process A made before incrementing ^x.
Important:
See the ECP and Clusters $Increment Limitation section for more information.
ECP Sync Method Guarantee
Process A updates a global located on Data Server S, and issues a $system.ECP.Sync() to S. Process B then issues a $system.ECP.Sync() to S. Process B can see all updates performed by Process A on Data Server S prior to its $system.ECP.Sync() call.
$system.ECP.Sync() is relevant only for processes running on an application server. If either process A or B are running on DataServer S itself, then that process does not need to issue a $system.ECP.Sync(). If both are running on DataServer S then neither needs $system.ECP.Sync, and this is simply the statement that global updates are immediately visible to processes running on the same server.
Important:
$system.ECP.Sync() does not guarantee durability; see the Dirty Data Reads in ECP without Locking limitation.
ECP Recovery Limitations
During the recovery of an ECP-configured system, there are the following limitations to the InterSystems IRIS guarantees:
ECP and Clusters $Increment Limitation
If a data server crashes while the application server has a $Increment request outstanding to the data server and the global is journaled, InterSystems IRIS attempts to recover the $Increment results from the journal; it does not re-increment the reference.
ECP Cache Liveness Limitation
In the absence of continuing faults, application servers observe data that is no more than a few seconds out of date, but this is not guaranteed. Specifically, if an ECP connection to the data server becomes nonfunctional (network problems, data server shutdown, data server backup operation, and so on), the user process may observe data that is arbitrarily stale, up to an application server connection-timeout value. To ensure that data is not stale, use the Lock command around the data-fetch operation, or use $system.ECP.Sync. Any network request that makes a round trip to the data server updates the contents of the application server ECP network cache.
ECP Routine Revalidation Limitation
If an application server downloads routines from a data server and the data server restarts (planned or unplanned), the routines downloaded from the data server are marked as if they had been edited.
Additionally, if the connection to the data server suffers a network outage (neither application server nor data server shuts down), the routines downloaded from the data server are marked as if they had been edited. In some cases, this behavior causes spurious <EDITED> errors as well as <ERRTRAP> errors.
Conflicting, Non-Locked Change Breaks Rollback
In InterSystems IRIS, the Lock command is only advisory. If Process A starts a transaction that is updating global ^x under protection of a lock on global ^y, and another process modifies ^x without the protection of a lock on ^y, the rollback of ^x does not work.
On the rollback of Set and Kill operations, if the current value of the data item is what the operation set it to, the value is reset to what it was before the operation. If the current value is different from what the specific Set or Kill operation set it to, the current value is left unchanged.
If a data item is sometimes modified inside a transaction, and sometimes modified outside of a transaction and outside the protection of a Lock command, rollback is not guaranteed to work. To be effective, locks must be used everywhere a data item is modified.
Journal Discontinuity Breaks Rollback
Rollback depends on the reliability and completeness of the journal. If something interrupts the continuity of the journal data, rollbacks do not succeed past the discontinuity. InterSystems IRIS silently ignores this type of transaction rollback.
A journal discontinuity can be caused by executing ^JRNSTOP while InterSystems IRIS is running, by deleting the Write Image Journal (WIJ) file after an InterSystems IRIS shutdown and before restart, or by an I/O error during journaling on a system that is not set to freeze the system on journal errors.
ECP Can Miss Error After Recovery
A Set or Kill operation completes on a data server, but receives an error. The data server crashes after completing that packet, but before delivering that packet to the application server system.
ECP recovery does not replay this packet, but the application server has not found out about the error; resulting in the application server missing Set or Kill operations on the data server.
Partial Set or Kill Leads to Journal Mismatch
There are certain cases where a Set or Kill operation can be journaled successfully, but receive an error before actually modifying the database. Given the particular way rollback of a data item is defined, this should not ever break transaction rollback; but the state of a database after a journal restore may not match the state of that database before the restore.
Loose Ordering in Cluster Failover or Restore
Cluster dejournaling is loosely ordered. The journal files from the separate cluster members are only synchronized wherever a lock, a $Increment, or a journal marker event occurs. This affects the database state after either a cluster failover or a cluster crash where the entire cluster must be brought down and restored. The database may be restored to a state that is different from the state just before the crash. The $Increment Ordering Guarantee places additional constraints on how different the restored database can be from its original form before the crash.
Process B’s ability to view the data modified by Process A does not ensure that Set operations from Process B are restored after the Set operations from Process A. Only a Lock or a $Increment operation can ensure proper ordering of competing Set and Kill operations from two different processes during cluster failover or cluster recovery.
Dirty Data Reads When Cluster Slave Crashes
A cluster slave Member A completes updates in Transaction T1, and that system commits that transaction, but in non-synchronous transaction commit mode. Transaction T2 on a different cluster Member B acquires the locks once owned by Transaction T1. Cluster slave Member A crashes before all the information from Transaction T1 is written to disk.
Transaction T1 is rolled back as part of cluster failover. However, Transaction T2 on Member B could have seen data from Transaction T1 that later was rolled back as part of cluster failover, despite following the rules of the locking protocol. Additionally, if Transaction T2 has modified some of the same data items as Transaction T1, the rollback of Transaction T1 may fail because only some of the transaction data has rolled back.
A workaround is to use synchronous commit mode for transactions on the cluster slave Member A. When using synchronous commit mode, Transaction T1 is durable on disk before its locks are released, so Transaction T1 is not rolled back once the application sees that it is complete.
Dirty Data Reads in ECP Without Locking
If an incoming ECP transaction reads data without locking, it may see data that is not durable on disk which may disappear if the data server crashes. It can only see such data when the data location is set by other ECP connections or by the local data server system itself. It can never see nondurable data that is set by this connection itself. There is no possibility of seeing nondurable data when locking is used both in the process reading the data and the process writing the data. This is a violation of the In-order Updates Guarantee and there is no easy workaround other than to use locking.
Asynchronous TCommit Converts to Rollback
If the data server side of a transaction receives an asynchronous error condition, such as a <FILEFULL>, while updating a database, and the application server does not see that error until the TCommit, the transaction is automatically rolled back on the data server. However, rollbacks are synchronous while TCommit operations are usually asynchronous because the rollback will be changing blocks the application server should be notified of before the application server process surrenders any locks.
The data server and the database are fine, but on the application server if the locks get traded to another process he may see temporarily see data that is about to be rolled back. However, the application server does not usually do anything that causes asynchronous errors