Caché Data Integrity Guide
Write Image Journaling and Recovery
[Back] [Next]
   
Server:docs1
Instance:LATEST
User:UnknownUser
 
-
Go to:
Search:    

Caché uses write image journaling to maintain the internal integrity of your Caché database. It is the foundation of the database recovery process.

This chapter discusses the following topics:
Write Image Journaling
Caché safeguards database updates by using a two-phase technique, write image journaling, in which updates are first written from memory to a transitional journal, CACHE.WIJ, and then to the database. If the system crashes during the second phase, the updates can be reapplied upon recovery. The following topics are covered in greater detail:
Write Image Journal (WIJ)
The Write daemon is activated at Caché startup and creates the write image journal (WIJ) file. The Write daemon records database updates in the WIJ before writing them to the Caché database.
By default, the WIJ file is named CACHE.WIJ and resides in the system manager directory, usually install-dir/Mgr, where install-dir is the installation directory. To specify a different location for this file, use the Management Portal:
  1. Navigate to the [System Administration] > [Configuration] > [System Configuration] > [Journal Settings] page.
  2. Enter the new location of the WIJ in the Write image journal directory box and click Save. The name must identify an existing directory on the system and may be up to 63 characters long. If you edit this setting for a clustered instance, restart Caché to apply the change; no restart is necessary for a standalone instance.
Two-Phase Write Protocol
Caché maintains application data in databases whose structure enables fast, efficient searches and updates. Generally, when an application updates data, Caché must modify a number of blocks in the database structure to reflect the change.
Due to the sequential nature of disk access, any sudden, unexpected interruption of disk or computer operation can halt the update of multiple database blocks after the first block has been written but before the last block has been updated. The two-phase write protocol prevents this incomplete update from leading to an inconsistent database structure, which could occur with it. The consequences could be as severe as a database that is totally unusable, all data irretrievable by normal means.
The Caché write image journaling technology uses a two-phase process of writing to the database to protect against such events as follows:
When Caché starts, it automatically checks the WIJ and runs a recovery procedure if it detects that an abnormal shutdown occurred. When the procedure completes successfully, the internal integrity of the database is restored. Caché also runs WIJ recovery following a shutdown as a safety precaution to ensure that database can be safely backed up.
Recovery
WIJ recovery is necessary if a system crash or other major system malfunction occurs. When Caché starts, it automatically checks the WIJ. If it detects that an abnormal shutdown occurred, it runs a recovery procedure. Depending on where the WIJ is in the two-phase write protocol process, recovery does the following:
WIJ Restore
If the WIJ is marked as “active,” the Write daemon completed writing modified disk blocks to the WIJ but had not completed writing the blocks back to their respective databases. This indicates that WIJ restoration is needed. The recovery program, cwdimj, does the following:
Typically, all recovery is performed in a single run of the cwdimj program.
Dataset Recovery
A dataset is a specific database directory on a specific Caché system. The cwdimj program restores all datasets configured in the Caché instance being restarted after an abnormal shutdown.
The cwdimj program can run interactively or non-interactively. The manner in which it runs depends on the platform, as follows:
Note:
When the ccontrol start quietly command is used on UNIX/OpenVMS systems, always runs noninteractively.
When the recovery procedure is complete, cwdimj marks the contents of the WIJ as “deleted” and startup continues.
If an error occurred during writing, the WIJ remains active and Caché will not start; recovery is repeated the next time Caché starts unless you override this option (in interactive mode).
Caution:
If you override the option to restore the WIJ, databases become corrupted or lose data.
The following topics are discussed in more detail:
Interactive Dataset Recovery
The recovery procedure allows you to confirm the recovery on a dataset-by-dataset basis. Normally, you specify all datasets. After each dataset prompt, type either:
You can also specify a new location for the dataset if the path to it has been lost, but you can still access the dataset. Once a dataset has been recovered, it is removed from the list of datasets requiring recovery; furthermore, it is not recovered during subsequent runs of the cwdimj program should any be necessary.
Noninteractive Dataset Recovery
When the recovery procedure runs noninteractively, Caché attempts to restore all datasets and mark the WIJ as deleted. On Unix and Windows platforms, Caché first attempts a fast parallel restore of all datasets; in the event of one or more errors during the fast restore, datasets are restored one at a time so that the databases that were fully recovered can be identified. If at least one dataset cannot be restored:
WIJ Block Comparison
Typically, a running Caché instance is actively writing to databases only a small fraction of the time. In most crashes, therefore, the blocks last written to the WIJ were confirmed to have been durably written to the databases before the crash; the WIJ is not marked "active", and there is no WIJ restore to be performed. When Caché starts up after such a crash, however, the blocks in the most recent WIJ updates are compared to the corresponding blocks in the affected databases as a form of rapid integrity check, to guard against starting the instance in an uncertain state after a crash that was accompanied by a storage subsystem failure. The comparison runs for a short time to avoid impacting availability and asynchronous I/O is utilized to maximize throughput. If all blocks match, or no mismatch is detected within 10 seconds, startup continues normally. If a mismatch is found within this time, the results are as follows:
This situation calls for immediate attention. Use the information that follows to determine the appropriate course of action. When your recovery procedures are complete, you must delete the MISMATCH.WIJ file, either using the STURECOV routine or externally, before Caché startup can continue; the file is persistent and prevents normal startup of the instance.
Run the indicated platform-dependent command (install-dir\bin\csession instancename -B on UNIX®/Linux or install-dir\bin\cache -sinstall-dir\mgr -B on Windows) to perform an emergency login as system administrator (see Connecting to a Caché Instance in the “Using Multiple Instances of Caché” chapter of the Caché System Administration Guide).
You are now in the manager’s namespace and can run the startup recovery routine with the command Do ^STURECOV. The following WIJ mismatch recovery message and menu appear on a UNIX®/Linux system:
The system crashed and some database blocks do not match what was
expected based on the contents of write image journal (the WIJ).
The WIJ blocks have been placed in the MISMATCH.WIJ file.  If any
database files, or the WIJ, were modified or replaced since the crash,
you should delete the MISMATCH.WIJ. Otherwise, MISMATCH.WIJ probably
contains blocks that were lost due to a disk problem.  You can view 
those blocks and apply them if necessary.  When finished, delete the 
MISMATCH.WIJ in order to continue startup.
 
1) List Affected Databases and View Blocks
2) Apply mismatched blocks from WIJ to databases
3) Delete MISMATCH.WIJ
4) Dismount a database
5) Mount a database
6) Database Repair Utility
7) Check Database Integrity
8) Bring up the system in multi-user mode
9) Display instructions on how to shut down the system

--------------------------------------------------------------
H) Display Help
E) Exit this utility
--------------------------------------------------------------
On a Windows system, options 8 and 9 are replaced by 8) Bring down the system prior to a normal startup.
The appropriate actions in the event of a WIJ mismatch differ based on the needs and policies of your enterprise, and are largely the same as your site's existing practices for responding to events that imply data integrity problems. Considerations include tolerance for risk, criticality of the affected databases, uptime requirements, and suspected root cause.
The following represent some considerations and recommendations specific to the WIJ block comparison process:
If you are uncertain about how to proceed when WIJ mismatches are detected, contact the InterSystems Worldwide Response Center (WRC).
Note:
The WIJ comparison does not take place on OpenVMS systems.
Limitations of Write Image Journaling
While the two-phase write protocol safeguards structural database integrity, it does not prevent data loss. If the system failure occurs prior to a complete write of an update to the WIJ, Caché does not have all the information it needs to perform a complete update to disk and, therefore, that data is lost. However, data that has been written to a journal file is recovered as described in Recovery in this chapter.
In addition, write image journaling cannot eliminate database degradation in the following cases:
If you believe that one of these situations has occurred, contact the InterSystems Worldwide Response Center (WRC).