When mirroring or other mechanisms are used to maintain a copy of data on another system, you may want to check the consistency of that data between the two systems. DataCheck provides this checking and includes provisions to recheck transient discrepancies.
DataCheck Overview
DataCheck provides a mechanism to compare the state of data on two systems — the DataCheck source and the DataCheck destination — to determine whether or not they match. All configuration, operational controls and results of the check are provided on the destination system; the source system is essentially passive.
On the instance of InterSystems IRIS® that is to act as the DataCheck destination, you must create a DataCheck destination configuration. You can create multiple destination configurations on the same instance, which you can configure to check data against multiple source systems (or configure them to check different data against a single source). If you are using DataCheck to check the consistency of a mirror, see DataCheck for Mirror Configurations for more details.
The following subsections describe DataCheck topics in more detail:
DataCheck Queries
The destination system submits work units called DataCheck “queries” to the source system. Each query specifies a database, an initial global reference, a number of nodes, and a target global reference. Both systems calculate an answer by traversing the specified number of global nodes starting with the initial global reference, and hashing the global keys and values. If the answers match, the destination system records the results and resubmits the query with a larger number of nodes and the initial global reference advanced; if they don’t match, the query is resubmitted with a smaller number of nodes until the discrepancy is isolated down to the configured minimum query size.
You can display information about the queries submitted by the destination system using the View Queries option of the View Details submenu of the ^DATACHECK routine, including the globals that remain to be processed (or global ranges if subscript include/exclude ranges are used), and the active queries currently being worked on by DataCheck.
DataCheck Jobs
The answer to each query is calculated by DataCheck worker jobs running on both the source system and the destination system. The number of worker jobs is determined by the dynamically tunable performance settings of the destination system; for more information, see “Performance Considerations” in this chapter.
In addition to the worker jobs, there are other jobs on each system. The following additional jobs run on the destination system:
-
Manager job — Loads and dispatches queries, compares query answers, and manages the progression through the workflow phases; this job is connected to the source system Manager job.
-
Receiver job — Receives answers from the source system.
The following additional jobs run on the source system:
-
Manager job — Receives requests from the destination system Manager job and sends them to worker jobs.
-
Sender job — Receives query answers from the worker jobs and sends them to the destination system Receiver job; this job is connected to the destination system Receiver job.
DataCheck Results
The results of the check lists global subscript ranges with one of the following states:
-
Unknown — DataCheck has not yet checked this range.
-
Matched — DataCheck has found that this range matches.
-
Unmatched — DataCheck has found a discrepancy in this range.
-
Collation Discrepancy — Global was found to have differing collation between the source system and the destination system.
-
Excluded — This range is excluded from checking.
You can view the results from the current check and the final results from the last check on the destination system; for more information, see the SYS.DataCheck.RangeListOpens in a new tab class. For all subscript ranges within DataCheck, the beginning of a range is inclusive and the end exclusive. See Specifying Globals and Subscript Ranges to Check in this chapter for information about subscript ranges.
The following provides a sample check result:
c:\InterSystems\iris\mgr\mirror2 ^XYZ Unmatched
^XYZ --Matched--> ^XYZ(3001,4)
^XYZ(3001,4) --Unmatched--> ^XYZ(5000)
^XYZ(5000) --Matched--> [end]
This result indicates that the nodes in the range starting at ^XYZ up to but not including ^XYZ(3001,4) are matched, while there is at least one discrepancy in the range of nodes from ^XYZ(3001,4) up to but not including ^XYZ(5000). The nodes in the range from ^XYZ(5000) to the end are matched.
The minimum number and frequency of discrepancies in the unmatched range depends on the minimum query size (see Performance Considerations). For example, if the minimum query size is set to the default of 32 in this case, there is at least one discrepancy every 32 nodes from ^XYZ(3001,4) until ^XYZ(5000); if there were a sequence within this range of more than 32 nodes without a discrepancy, it would appear in the results as a separate matched range.
DataCheck Workflow
During the check, data may be changing and transient discrepancies may be recorded. Rechecking may be required to eliminate these transient discrepancies. The destination system has a workflow that defines a strategy for how to check the globals.
A typical workflow begins with the “Check” phase as phase #1. (Phase #1 should always be defined as the logical starting point of the check cycle, since it is used by the workflow timeout and the Start dialog of the ^DATACHECK routine to indicate a "reset" from beginning, as described in the next section.) At the beginning of this phase, the current set of results are saved as the last completed results and a new set of active results is established. DataCheck makes an initial pass through all globals specified for inclusion in the check.
Following the Check phase, the “Recheck Discrepancies” phase is typically specified with the desired number of iterations. Each iteration rechecks all unmatched ranges in an effort to eliminate transient discrepancies.
As each phase of the workflow is completed, DataCheck moves to the next phase. The workflow is implicitly restarted from phase #1 after the last phase is complete. The “Stop” phase shuts down all DataCheck jobs and the “Idle” phase causes DataCheck to wait for you to manually specify the next phase.
Starting/Stopping/Reconnecting DataCheck
You can stop and start DataCheck at any time; when you start DataCheck, it resumes the workflow from where it left off. In addition, you can specify a different workflow phase to follow the current phase and/or abort the current phase at any time.
If, during a check, DataCheck is stopped, becomes disconnected, or pauses due to mirroring, the routine reports why the system was stopped, what phase it stopped in, and what it will do when it starts (for example, resume processing, move to the next phase, change phase due to user request or restart at phase #1 due to workflow timeout). If, upon starting, DataCheck is going to resume processing the current phase or make a transition to any phase other than phase #1, you are offered the option of restarting at phase #1, as in the following example:
Option? 4
Configuration Name: test
State: Stopped due to Stop Requested
Current Phase: 1 - Check
Workflow Phases:
1 - Check
2 - RecheckDiscrepancies, Iterations=10
3 - Stop
(restart)
Workflow Timeout: 432000
New Phase Requested: 2
Abort Current Phase Requested
DataCheck is set to abort the current phase and transition to phase #2.
You may enter RESTART to restart at phase #1
Start Datacheck configuration 'test'? (yes/no/restart)
In cases in which DataCheck becomes disconnected and reconnects only after an extended period, it may be more desirable to restart from phase #1 of the workflow instead. For example, if the systems were disconnected for several weeks in the middle of a check and then the check is resumed, the results are of questionable value, having been collected in part from two weeks prior and in part from the present time. The workflow has a Timeout property that specifies the time, in seconds, within which DataCheck may resume a partially completed workflow phase. If the timeout is exceeded, DataCheck restarts from phase #1 the next time it reaches the running state. The default value is five days (432000 seconds), based on the assumption that a large amount of data is checked by this DataCheck configuration and the check may take hours or days to complete normally; a smaller value may be preferable for configurations that complete a check in a shorter amount of time. A value of zero means no timeout.
Note:
As noted, you should define phase #1 to be the logical starting point of the check cycle, since it is used by the workflow timeout and the Start dialog of the ^DATACHECK routine to indicate a "reset" from beginning, as shown in the previous example.
DataCheck for Mirror Configurations
Upon creating a DataCheck destination configuration, if the system is a member of a mirror (see the “Mirroring” chapter of the High Availability Guide), you are given the option to configure DataCheck to check the mirrored data. If you choose this option, you need only select the mirror member to act as the DataCheck source, and the rest of the configuration is automatic.
When a check begins, all mirrored databases are included in the check; you do not have to map databases individually. You can specify which globals are checked or exclude entire databases, as described in Specifying Globals and Subscript Ranges to Check. A mirror-based DataCheck configuration cannot be used to check nonmirrored databases, but a separate nonmirrored DataCheck configuration can be created for such purposes.
This section discusses the following topics:
Planning DataCheck within the Mirror
Each DataCheck destination configuration connects to one source mirror member. Although the source member should not be changed, additional DataCheck configurations can be created to check against more than one source mirror member (or to check different sets of data from the same source).
This section includes the following member-specific subsections:
Checking Data Between Failover Members
When checking between failover mirror members, the check is typically run with the backup failover member configured as the DataCheck destination for the following reasons:
-
The DataCheck destination uses more resources than the source in order to maintain the results of the check and other state information (which is itself journaled).
-
If the backup failover member is the DataCheck destination, the results are available for review on the backup if the primary failover member goes down.
Note:
In most configurations, it is assumed that the failover has already occurred and any review of the results probably happens after the failover decision point.
Whenever DataCheck loses its connection to the source, it retries the connection, waiting indefinitely for the source machine to become available again. If a mirror-based DataCheck is started on the destination when it was not the primary failover member, and that member becomes the primary, DataCheck stops rather than automatically try to reconnect. This prevents DataCheck from unintentionally running on the primary. For more information about reconnecting, see Starting/Stopping/Reconnecting DataCheck in this chapter.
Checking Data on Async Members
When mirror-based DataCheck is checking between a failover member and an async member, the async member is typically the destination. This is for the same reasons mentioned above (see Checking Data Between Failover Members) in regards to checking between failover members, but primarily because the results of the check should be stored on the async member during disaster recovery.
When there are two failover members, it is often desirable to create one DataCheck destination configuration on an async member for each of the two failover members as sources. The ^DATACHECK routine offers to create both for you, and offers settings for how they behave with respect to which of the two is the primary failover member.
Each DataCheck configuration has a setting to govern how it behaves based on the source failover member’s status as the primary member. The settings are:
-
No restriction
Checking both without restriction (the default) is desirable because it uses the async member as an agent to check both failover members without needing to run DataCheck between the failover members.
-
Check primary only (pause until DataCheck source is primary)
Checking against the primary only is desirable because the primary is the true source of the data for this async member.
-
Do not check primary (pause when DataCheck source is primary)
Checking against the backup is desirable because it does not consume resources on the production primary system.
For DataCheck configurations that are run manually (on demand) by a system administrator, these settings may not be of particular importance; they are more important for DataCheck configurations that are run continuously (or nearly so).
Any member may check another member without any particular relation. For example, if an async member is being used to check both failover members, it could also be used as the source of a check for other async members, thus avoiding the need to have any other async members check against the failover members.
Selecting Globals to Check
All mirrored databases that exist when DataCheck is run are checked automatically; for information about controlling which globals and databases are checked, see Specifying Globals and Subscript Ranges to Check in this chapter.
^DATACHECK Routine
You can use the ^DATACHECK routine (in the %SYS namespace) to configure and manage the data checking. To obtain Help at any prompt, enter ?.
To start the ^DATACHECK routine, do the following:
-
Enter the following commands in the Terminal:
set $namespace = "%SYS"
%SYS>do ^DATACHECK
-
The main menu is displayed. Enter the number of your choice or press Enter to exit the routine:
1) Create New Configuration
2) Edit Configuration
3) View Details
4) Start
5) Stop
6) Delete Configuration
7) Incoming Connections to this System as a DataCheck Source
Option?
Note:
For options 2 through 6, if you created multiple destination systems, a list is displayed so that you can select the destination system on which to perform the action.
The main menu lets you select DataCheck tasks to perform as described in the following table:
Option |
Description |
1) Create New Configuration |
Prompts for the name of a new DataCheck destination system configuration via the Create New Configuration prompt. |
2) Edit Configuration |
Displays the Edit Configuration submenu. |
3) View Details |
Displays the View Details submenu. |
4) Start |
Starts/restarts the destination system. If you are restarting, it resumes from where you stopped it. |
5) Stop |
Stops the destination system. If you restart the destination system after stopping it, it resumes from where you stopped it. |
6) Delete Configuration |
Deletes the specified destination system configuration. |
7) Incoming Connections to this System as a DataCheck Source |
Displays the Incoming Connections to this System as a DataCheck Source submenu.
Note:
This option must be selected on a source system.
|
Create New Configuration
This submenu lets you configure the destination system. When you select this option, the following prompt is displayed:
Configuration Name:
If you are creating a DataCheck configuration on a system that is not a mirror member, the Edit Settings submenu is displayed, and you complete the configuration manually as described in Editing DataCheck Configurations on Non-mirror-based Systems.
If you are creating a DataCheck configuration on a system that is a mirror member, you are prompted for additional information that is dependent upon whether or not you want to base the data checking on mirroring. Choosing to configure DataCheck that is not based on mirroring displays the Edit Settings submenu, which you use to complete the configuration manually as described in Editing DataCheck Configurations on Non-mirror-based Systems. However, choosing to configure DataCheck based on mirroring restricts data checking to mirrored databases, and subsequent prompts are dependent on whether the destination system is a failover or async mirror member; for more information, see DataCheck for Mirror Configurations in this chapter.
Edit Configuration
The submenu lets you modify the destination system configurations. The options in the submenus are different depending on whether you are editing mirror-based or non-mirror-based configurations. For more information, see the following subsections:
Editing DataCheck Configurations on Non-mirror-based Systems
On a non-mirror-based system, when you select this option, the following prompts are displayed:
Configuration Name: dc_test
1) Import Settings from a Shadow (static)
2) Connection Settings (static)
3) Database Mappings (static)
4) Globals to Check (dynamic)
5) Performance Settings (dynamic)
6) Manage Workflow (dynamic)
Option?
Note:
In edit mode, if you created multiple destination systems, a list is displayed so that you can select a destination system to edit. In addition, before you edit the settings for options 1 through 3, you must stop the system.
Enter the number of your choice or press ^ to return to the previous menu. The options in this submenu let you configure the destination system as described in the following table:
Option |
Description |
1) Import Settings from a Shadow |
Deprecated; do not use. |
2) Connection Settings |
Information to connect to the source system. |
3) Database Mappings |
Lets you add, delete, or list database mappings on the source and destination systems. |
4) Globals to Check |
Globals to check or exclude from checking. For more information, see Specifying Globals and Subscript Ranges to Check in this chapter. |
5) Performance Settings |
Adjusts system resources (throttle) used and/or granularity with which DataCheck isolates discrepancies (minimum query size). For more information, see Performance Considerations in this chapter. |
6) Manage Workflow |
Manages the order of workflow phases. For more informations, see DataCheck Workflow in this chapter. |
Editing Mirror-based DataCheck Configurations
On a mirror-based system, the following submenu is displayed:
Configuration Name: MIRRORSYS2_MIRRORX201112A_1
1) Globals to Check
2) Performance Settings
3) Manage Workflow
4) Change Mirror Settings (Advanced)
Option?
Enter the number of your choice or press ^ to return to the previous menu. The options in this submenu let you configure the destination system as described in the following table:
Option |
Description |
1) Globals to Check |
Globals to check or exclude from checking. For more information, see Specifying Globals and Subscript Ranges to Check in this chapter. |
2) Performance Settings |
Adjusts system resources (throttle) used and/or granularity with which DataCheck isolates discrepancies (minimum query size). For more information, see Performance Considerations in this chapter. |
3) Manage Workflow |
Manages the order of workflow phases. For more informations, see DataCheck Workflow in this chapter. |
4) Change Mirror Settings (Advanced) |
See Planning DataCheck within the Mirror in the “Mirroring Considerations” section of this chapter |
View Details
This submenu lets you monitor the status of the destination system, as well as view detailed information about the queries that are running and the results of data checking:
System Name: dc_test
1) View Status
2) View Results
3) View Queries
3) View Log
Option?
Enter the number of your choice or press ^ to return to the previous menu. The options in this submenu let you view information about the destination system as described in the following table:
Option |
Description |
1) View Status |
Displays information about the selected destination system, including performance metrics for the DataCheck worker jobs, the source and state, current phase, workflow timeout, new phases requested, percentage of queries completed in the current phase, and the number of discrepancies recorded in this phase. |
2) View Results |
Displays the results for the selected destination system. For more information, see DataCheck Results in this chapter. |
3) View Queries |
Displays information about the queries submitted by the selected destination system (see DataCheck Queries). This includes the globals that remain to be processed (or global ranges if subscript include/exclude ranges are used), and indicates the active queries currently being worked on by DataCheck. A summary count is displayed at the end of the list. |
4) View Log |
Displays the selected destination system log file. |
Note:
When ^DATACHECK is run against the two copies of a mirrored database on two mirror member instances, and that database is experiencing the rapid setting and killing of a whole global, it can display confusing results from the View Status option when compared to the View Results option. For example, it will report that there are unmatched answers in status, but will not actually report the globals that caused these answers in results (because further passes resolved the discrepancies). In addition, displayed answer counts can be larger than the actual number of globals within the instance (as displayed in the management portal, and as actually reported in the results).
When View Status shows Answers Rcvd having a non-zero unmatched value but discrepancies having a zero value, this is indicative of transient globals, not a data issue.
Incoming Connections to this System as a DataCheck Source
This submenu lets you view information about the source system:
1) List Source Systems
2) View Log
Option?
Enter the number of your choice or press ^ to return to the previous menu. The options in this submenu let you view information about the source system as described in the following table:
Option |
Description |
1) List Source Systems |
Displays information about the DataCheck source system. |
2) View Log |
Displays the source system log file. |
Special Considerations for Data Checking
Review the following special considerations when using DataCheck:
Security Considerations
The destination system stores subscript ranges for globals that it has checked and is checking (results and queries). (See Specifying Globals and Subscript Ranges to Check in this chapter.) This subscript data is stored in the ^SYS.DataCheck* globals in the %SYS namespace (in the IRISSYS database by default). Global values are not stored; only subscripts are stored. These global subscripts from other databases that are stored in the %SYS namespace may contain sensitive information that may not otherwise be visible to some users, depending on the security configuration. Therefore, some special care is needed in secured deployments.
Use of the ^DATACHECK routine, including the ability to configure, start, and stop, requires both %Admin_Operate:Use privilege and Read/Write privilege (Write for configuring a check, Read for all other tasks) on the database containing the ^SYS.DataCheck* globals which, by default, is IRISSYS. The configuration and results data stored in the ^SYS.DataCheck* globals can be viewed and manipulated outside of the routine by anyone with sufficient database privileges.
For any secure deployment in which %DB_IRISSYS:Read privilege is given to users that should not have access to DataCheck data, you can add a global mapping to the %SYS namespace to map ^SYS.DataCheck* globals to a separate database other than IRISSYS. This database can be assigned a new resource name; read permission for the resource can then be restricted to those roles authorized to use DataCheck.
The ability for another destination system to connect to this system as a source is governed by this system's %Service_DataCheck service. This service is disabled by default on new installations and can be configured with a list of allowed IP addresses. For more information, see Enabling the DataCheck Service in this chapter.
For encryption of the communication between the two systems, the destination system can be configured to use TLS to connect to the source. See Configuring the InterSystems IRIS Superserver to Use TLS for details.