Journal I/O Errors
This page explains how InterSystems IRIS® data platform responds when it encounters a journal file I/O error.
The response depends on the Freeze on error journal setting, which is on the Journal Settings page of the Management Portal. The Freeze on error setting works as follows:
-
When the Freeze on error setting is No (the default), the journal daemon retries the failed operation until it succeeds or until one of several conditions is met, at which point all journaling is disabled. This approach keeps the system available, but disabling journaling compromises data integrity and recoverability.
-
When Freeze on error is set to Yes, all journaled global updates are frozen. This protects data integrity at the expense of system availability.
The Freeze on error setting also affects application behavior when a local transaction rollback fails.
InterSystems recommends you review your business needs and determine the best approach for your environment, using the information on this page.
Journal Freeze on Error Setting is No
If you configure InterSystems IRIS not to freeze on a journal file I/O error, the journal daemon retries the failed operation periodically (typically at one second intervals) until either it succeeds or one of the following conditions is met:
-
The daemon has been retrying the operation for a predetermined time period (typically 150 seconds)
-
The system cannot buffer any further journaled updates
When one of these conditions is met, journaling is disabled and database updates are no longer journaled. As a result, the journal is no longer a reliable source from which to recover databases if the system crashes. The following conditions exist when journaling is disabled:
-
Transaction rollback fails, generating <ROLLFAIL> errors and leaving transactions partly committed.
-
Crash recovery of uncommitted data is nonexistent.
-
Full recovery no longer exists. You are able to recover only to the last backup.
-
ECP lock and transaction recoverability guarantees are compromised.
-
If the system crashes, InterSystems IRIS startup recovery does not attempt to roll back incomplete transactions started before it disabled journaling because the transactions may have been committed, but not journaled.
To summarize, if journaling is disabled, perform the following steps:
-
Resolve the problem — As soon as possible, resolve the problem that disabled journaling.
-
Switch the journal file — The Journal daemon retries the failed I/O operation periodically in an attempt to preserve the journal data accumulated prior to the disabling. If necessary, you can switch the journal file to a new directory to resolve the error; however, InterSystems IRIS does not re-enable journaling automatically even if it succeeds with the failed I/O operation and switches journaling to a new file. It also does not re-enable journaling if you switch the journal file manually.
-
Back up the databases — on the main server (the backup automatically re-enables journaling if you have not done so).
InterSystems strongly recommends backing up your databases as soon as possible after the error to avoid potential data loss. In fact, performing an online backup when journaling is disabled due to an I/O error restarts journaling automatically, provided that the error condition that resulted in the disabling of journaling has been resolved and you have sufficient privileges to do so. You can also enable journaling by running ^JRNSTART.
When a successful backup operation restarts journaling, InterSystems IRIS discards any pending journal I/O, since any database updates covered by the pending journal I/O are included in the backup.
Important:Starting journaling requires higher privileges than running a backup.
Journal Freeze on Error Setting is Yes
If you configure InterSystems IRIS to freeze on a journal file I/O error, all journaled global updates are frozen immediately upon such an error. This prevents the loss of journal data at the expense of system availability. Global updates are also frozen if the journal daemon has been unable to complete a journal write for at least 30 seconds.
The journal daemon retries the failed I/O operation and unfreezes global updates after it succeeds. Meanwhile, the freezing of global updates causes other jobs to hang. The typical outcome is that InterSystems IRIS hangs until you resolve the journaling problem, with the system appearing to be down to operational end-users. While InterSystems IRIS is hung you can take corrective measures, such as freeing up disk space, switching the journal to a different disk, or correcting a hardware failure.
The advantage to this option is that once the problem is resolved and InterSystems IRIS resumes normal operation, no journal data has been lost. The disadvantage is that the system is less available or unavailable while the problem is being solved.
InterSystems IRIS posts alerts (severity 3) to the messages.log file periodically while the journal daemon is retrying the failed I/O operation.
Impact of Journal Freeze on Error Setting on Transaction Rollback with TROLLBACK
It is important to be aware that the Freeze on error setting you choose can have significant implications for application behavior unrelated to journaling. When an application attempts to roll back an open transaction using the TROLLBACK command (see TROLLBACK) and the attempt fails, the same tradeoff presents itself as is faced when a journal I/O error is encountered: that of data integrity versus availability. Like journaling, TROLLBACK uses the Freeze on error setting to determine the appropriate behavior, as follows:
-
When the Freeze on error setting is No (the default), the process initiating the transaction and the TROLLBACK receives an error, the transaction is closed, and the locks retained for the transaction are released. This approach keeps the application available, but compromises data integrity and recoverability.
-
When Freeze on error is set to Yes, the initiating process halts and CLNDMN makes repeated attempts to roll back the open transaction. During the CLNDMN retry period, locks retained for the transaction remain intact, and as a result the application might hang. This protects data integrity at the expense of application availability.
If CLNDMN repeatedly tries and fails to roll back an open transaction for a dead job (as reported in the messages log), you can use the Manage^CLNDMN utility to manually close the transaction.
The Freeze on error setting affects local (non-ECP) transaction rollback only.