ECP Recovery Guarantees and Limitations

This appendix describes the guarantees Caché provides during the recovery of an ECP-configured system and the limitations associated with those guarantees. The semantics described are the same as those the global module provides in other contexts, such as in a single-server system or in a Caché cluster. It is divided into the following sections:

ECP Recovery Guarantees

During the recovery of an ECP-configured system, Caché guarantees the following recoverable semantics:

In the description of each guarantee the first paragraph describes a specific condition. Subsequent paragraphs describe the data guarantee applicable to that particular situation.

In these descriptions, Process A, Process B and so on refer to processes attempting update globals on an ECP data server. These processes may originate on the same or different application servers, or on the data server itself; in some cases the origins of processes are specified, in others they are not germane.

In-order Updates Guarantee

Process A updates two data elements sequentially, first global ^x and next global ^y, where ^x and ^y are located on the same data server.

If Process B sees the change to ^y, it also sees the change to ^x. This guarantee applies whether or not Process A and Process B are on the same application server as long as the two data items are on the same data server and the data server remains up.

Process B’s ability to view the data modified by Process A does not ensure that Set operations from Process B are restored after the Set operations from Process A. Only a Lock or a $Increment operation can ensure proper ordering of competing Set and Kill operations from two different processes during cluster failover or cluster recovery.

See the Loose Ordering in Cluster Failover or Restore limitation regarding the order in which competing Set and Kill operations from separate processes are applied during cluster dejournaling and cluster failover.

Important:

This guarantee does not apply if the data server crashes, even if ^x and ^y are journaled. See the Dirty Data Reads for ECP Without Locking limitation for a case in which processes that fit this description can see dirty data that never becomes durable before the data server crash.

ECP Lock Guarantee

Process B on DataDataServer S acquires a lock on global ^x, which was once locked by Process A.

Process B can see all updates on DataServer S done by Process A (while holding a lock on ^x). Also, if Process C sees the updates done by Process B on DataServer S (while holding a lock on ^x), Process C is guaranteed to also see the updates done by Process A on DataServer S (while holding a lock on ^x).

Serializability is guaranteed whether or not Process A, Process B, and Process C are located on the same application server or on DataServer S itself, as long as DataServer S stays up throughout.

Important:

The lock and the data it protects must reside on the same data server.

Clusters Lock Guarantee

Process B on a cluster member acquires a lock on global ^x in a clustered database; a lock once held by Process A.

Process B sees all updates to any clustered database done by Process A (while holding a lock on ^x).

Additionally, if Process C on a cluster member sees the updates on a clustered database made by Process B (while holding a lock on ^x), Process C also sees the updates made by Process A on any clustered database (while holding a lock on ^x).

Serializability is guaranteed whether or not Process A, Process B, and Process C are located on the same cluster member, and whether or not any cluster member crashes.

Important:

See the Dirty Data Reads When Cluster Slave Crashes limitation regarding transactions on one cluster member seeing dirty data from a transaction on a cluster member that crashes.

Rollback Guarantee

Process A executes a TStart command, followed by a series of updates, and either halts before issuing a TCommit, or executes a TRollback before executing a TCommit.

All the updates made by Process A as part of the transaction are rolled back in the reverse order in which they originally occurred.

Important:

See the rollback-related limitations: Conflicting, Non-Locked Change Breaks Rollback, Journal Discontinuity Breaks Rollback, and Asynchronous TCommit Converts to Rollback for more information.

Commit Guarantee

Process A makes a series of updates on DataServer S and halts after starting the execution of a TCommit.

On each DataServer S that is part of the transaction, the data modifications on DataServer S are either committed or rolled back. If the process that executes the TCommit has the Perform Synchronous Commit property turned on (SynchCommit=1, in the configuration file) and the TCommit operation returns without errors, the transaction is guaranteed to have durably committed on all the data servers that are part of the transaction.

Important:

If the transaction includes updates to more than one server (including the local server) and the TCommit cannot complete successfully, some servers that are part of the transaction may have committed the updates while others may have rolled them back.

Transactions and Locks Guarantee

Process A executes a TStart for Transaction T, locks global ^x on DataServer S, and unlocks ^x (unlock does not specify the “immediate unlock” lock type).

Caché guarantees that the lock on ^x is not released until the transaction has been either committed or rolled back. No other process can acquire a lock on ^x until Transaction T either commits or rolls back on DataServer S.

Once Transaction T commits on DataServer S, Process B that acquires a lock on ^x sees changes on DataServer S made by Process A during Transaction T. Any other process that sees changes on DataServer S made by Process B (while holding a lock on ^x) sees changes on DataServer S made by Process A (while executing Transaction T). Conversely, if Transaction T rolled back on DataServer S, a Process B that acquires a lock on ^x, sees none of the changes made by Process A on DataServer S.

Important:

See the Conflicting, Non-Locked Change Breaks Rollback limitation for more information.

ECP Rollback Only Guarantee

Process A on AppServer C makes changes on DataServer S that are part of a Transaction T, and DataServer S unilaterally rolls back those changes (which can happen in certain network outages or data server outages).

All subsequent network requests to DataServer S by Process A are rejected with <NETWORK> errors until Process A explicitly executes a TRollback command.

Additionally, if any process on AppServer C completes a network request to DataServer S between the rollback on DataServer S and the TCommit of Transaction T (AppServer C finds out about the rollback-only condition before the TCommit), Transaction T is guaranteed to roll back on all data servers that are part of Transaction T.

ECP Transaction Recovery Guarantee

An ECP data server crashes in the middle of an application server transaction, restarts, and completes recovery within the application server recovery timeout interval.

The transaction can be completed normally without violating any of the described guarantees. The data server does not perform any data operations that violate the ordering constraints defined by lock semantics. The only exception is the $Increment function (see the ECP and Clusters $Increment Limitation section for more information). Any transactions that cannot be recovered are rolled back in a way that preserves lock semantics.

Important:

Caché expects but does not guarantee that in the absence of continuing faults (whether in the network, the data server, or the application server hardware or software), all or most of the transactions pending into an ECP data server at the time of a data server outage are recovered.

ECP Lock Recovery Guarantee

DataServer S has an unplanned shutdown, restarts, and completes recovery within the recovery interval.

The ECP Lock Guarantee still applies as long as all the modified data is journaled. If data is not being journaled, updates made to the data server before the crash can disappear without notice to the application server. Caché no longer guarantees that a process that acquires the lock sees all the updates that were made earlier by other processes while holding the lock.

If DataServer S shuts down gracefully, restarts, and completes recovery within the recovery interval, the ECP Lock Guarantee still applies whether or not data is being journaled.

Updates that are part of a transaction are always journaled; the ECP Transaction Recovery Guarantee applies in a stronger form. Other updates may or may not be journaled, depending on whether or not the destination global in the destination database is marked for journaling.

$Increment Ordering Guarantee

The $Increment function induces a loose ordering on a series of Set and Kill operations from separate processes, even if those operations are not protected by a lock.

For example: Process A performs some Set and Kill operations on DataServer S and performs a $Increment operation to a global ^x on DataServer S. Process B performs a subsequent $Increment to the same global ^x. Any process, including Process B, that sees the result of Process B incrementing ^x, sees all changes on DataServer S that Process A made before incrementing ^x.

Important:

See the ECP and Clusters $Increment Limitation section for more information.

ECP Sync Method Guarantee

Process A updates a global located on Data Server S, and issues a $system.ECP.Sync() to S. Process B then issues a $system.ECP.Sync() to S. Process B can see all updates performed by Process A on Data Server S prior to its $system.ECP.Sync() call.

$system.ECP.Sync() is relevant only for processes running on an application server. If either process A or B are running on DataServer S itself, then that process does not need to issue a $system.ECP.Sync(). If both are running on DataServer S then neither needs $system.ECP.Sync, and this is simply the statement that global updates are immediately visible to processes running on the same server.

Important:

$system.ECP.Sync() does not guarantee durability; see the Dirty Data Reads in ECP without Locking limitation.

ECP Recovery Limitations

During the recovery of an ECP-configured system, there are the following limitations to the Caché guarantees:

ECP and Clusters $Increment Limitation

If an ECP data server crashes while the application server has a $Increment request outstanding to the data server and the global is journaled, Caché attempts to recover the $Increment results from the journal; it does not re-increment the reference.

ECP Cache Liveness Limitation

In the absence of continuing faults, application servers observe data that is no more than a few seconds out of date, but this is not guaranteed. Specifically, if an ECP connection to the data server becomes nonfunctional (network problems, data server shutdown, data server backup operation, and so on), the user process may observe data that is arbitrarily stale, up to an application server connection-timeout value. To ensure that data is not stale, use the Lock command around the data-fetch operation, or use $system.ECP.Sync. Any network request that makes a round trip to the data server updates the contents of the application server ECP network cache.

ECP Routine Revalidation Limitation

If an application server downloads routines from an ECP data server and the data server restarts (planned or unplanned), the routines downloaded from the data server are marked as if they had been edited.

Additionally, if the connection to the data server suffers a network outage (neither application server nor data server shuts down), the routines downloaded from the data server are marked as if they had been edited. In some cases, this behavior causes spurious <EDITED> errors as well as <ERRTRAP> errors.

Conflicting, Non-Locked Change Breaks Rollback

In Caché, the Lock command is only advisory. If Process A starts a transaction that is updating global ^x under protection of a lock on global ^y, and another process modifies ^x without the protection of a lock on ^y, the rollback of ^x does not work.

On the rollback of Set and Kill operations, if the current value of the data item is what the operation set it to, the value is reset to what it was before the operation. If the current value is different from what the specific Set or Kill operation set it to, the current value is left unchanged.

If a data item is sometimes modified inside a transaction, and sometimes modified outside of a transaction and outside the protection of a Lock command, rollback is not guaranteed to work. To be effective, locks must be used everywhere a data item is modified.

Journal Discontinuity Breaks Rollback

Rollback depends on the reliability and completeness of the journal. If something interrupts the continuity of the journal data, rollbacks do not succeed past the discontinuity. Caché silently ignores this type of transaction rollback.

A journal discontinuity can be caused by executing ^JRNSTOP while Caché is running, by deleting the Write Image Journal (WIJ) file after a Caché shutdown and before restart, or by an I/O error during journaling on a system that is not set to freeze the system on journal errors.

ECP Can Miss Error After Recovery

A Set or Kill operation completes on a data server, but receives an error. The data server crashes after completing that packet, but before delivering that packet to the application server system.

ECP recovery does not replay this packet, but the application server has not found out about the error; resulting in the application server missing Set or Kill operations on the data server.

Partial Set or Kill Leads to Journal Mismatch

There are certain cases where a Set or Kill operation can be journaled successfully, but receive an error before actually modifying the database. Given the particular way rollback of a data item is defined, this should not ever break transaction rollback; but the state of a database after a journal restore may not match the state of that database before the restore.

Loose Ordering in Cluster Failover or Restore

Cluster dejournaling is loosely ordered. The journal files from the separate cluster members are only synchronized wherever a lock, a $Increment, or a journal marker event occurs. This affects the database state after either a cluster failover or a cluster crash where the entire cluster must be brought down and restored. The database may be restored to a state that is different from the state just before the crash. The $Increment Ordering Guarantee places additional constraints on how different the restored database can be from its original form before the crash.

Dirty Data Reads When Cluster Slave Crashes

A cluster slave Member A completes updates in Transaction T1, and that system commits that transaction, but in non-synchronous transaction commit mode. Transaction T2 on a different cluster Member B acquires the locks once owned by Transaction T1. Cluster slave Member A crashes before all the information from Transaction T1 is written to disk.

Transaction T1 is rolled back as part of cluster failover. However, Transaction T2 on Member B could have seen data from Transaction T1 that later was rolled back as part of cluster failover, despite following the rules of the locking protocol. Additionally, if Transaction T2 has modified some of the same data items as Transaction T1, the rollback of Transaction T1 may fail because only some of the transaction data has rolled back.

A workaround is to use synchronous commit mode for transactions on the cluster slave Member A. When using synchronous commit mode, Transaction T1 is durable on disk before its locks are released, so Transaction T1 is not rolled back once the application sees that it is complete.

Dirty Data Reads in ECP Without Locking

If an incoming ECP transaction reads data without locking, it may see data that is not durable on disk which may disappear if the data server crashes. It can only see such data when the data location is set by other ECP connections or by the local data server system itself. It can never see nondurable data that is set by this connection itself. There is no possibility of seeing nondurable data when locking is used both in the process reading the data and the process writing the data. This is a violation of the In-order Updates Guarantee and there is no easy workaround other than to use locking.

Asynchronous TCommit Converts to Rollback

If the data server side of a transaction receives an asynchronous error condition, such as a <FILEFULL>, while updating a database, and the application server does not see that error until the TCommit, the transaction is automatically rolled back on the data server. However, rollbacks are synchronous while TCommit operations are usually asynchronous because the rollback will be changing blocks the application server should be notified of before the application server process surrenders any locks.

The data server and the database are fine, but on the application server if the locks get traded to another process he may see temporarily see data that is about to be rolled back. However, the application server does not usually do anything that causes asynchronous errors