Skip to main content

Using Health Monitor

Health Monitor monitors a running InterSystems IRIS instance by sampling the values of a broad set of key metrics during specific periods and comparing them to configured parameters for the metric and established normal values for those periods; if sampled values are too high, Health Monitor generates an alert (notification of severity 2) or warning (severity 1). For example, if CPU usage values sampled by Health Monitor at 10:15 AM on a Monday are too high based on the configured maximum value for CPU usage or normal CPU usage samples taken during the Monday 9:00 AM to 11:30 AM period, Health Monitor generates a notification.

Health Monitor is part of the System Monitor tools.

Health Monitor Overview

Health Monitor uses a fixed set of rules to evaluate sampled values and identify those that are abnormally high. This design is based on the approach to monitoring manufacturing processes described in the “Process or Product Monitoring and Control” section of the NIST/SEMATECH e-Handbook of Statistical MethodsOpens in a new tab, with deviation from normal values determined using rules based on the WECO statistical probability rules (Western Electric RulesOpens in a new tab), both adapted specifically for InterSystems IRIS monitoring purposes.

Health Monitor alerts (severity 2) and warnings (severity 1) are written to the messages log (install-dir\mgr\messages.log). See Tracking System Monitor Notifications for information about ways to make sure you are aware of these notifications.

Health Monitor status messages (severity 0) are written to the System Monitor log (install-dir\mgr\SystemMonitor.log).

Note:

Unlike System Monitor and Application Monitor, Health Monitor runs only in the %SYS namespace.

The following subsections describe how Health Monitor works and contain information about configuring and extending it in various ways:

Health Monitor Process Description

By default, Health Monitor does not start automatically when the instance starts; for this to happen, you must enable Health Monitor within System Monitor using the ^%SYSMONMGR utility. (You can specify an interval to wait after InterSystems IRIS starts before starting Health Monitor when it is enabled, allowing the instance to reach normal operating conditions before sampling begins.) You can always use the utility to see the current status of Health Monitor. For more information, see Using ^%SYSMONMGR to Manage Health Monitor.

The basic elements of the Health Monitor process are described in the following:

  • Health Monitor monitors a number of system sensors, which are represented as sensor objects. Every sensor object has a base (minimum) value for sensor samples, and optionally includes two notification threshold values (one for alerts, and the other for warnings) which can be set as absolute values or multipliers. These values determine when Health Monitor sends notifications.

    Sensors and Sensor Objects lists all the sensor objects.

  • For the duration of a predefined period, each sensor is sampled every 30 seconds; samples below the base value are discarded. By default there are 63 weekly periods (nine per day), but you can configure your own weekly, monthly, quarterly, or yearly periods. Periods lists the default periods.

  • For a given sensor, unless the notification thresholds are set as absolute values, Health Monitor evaluates the sensor readings based on a chart. If the necessary chart for the current period does not exist, Health Monitor places the sensor in analysis mode to generate the chart.

    You can edit or create a chart to calibrate how Health Monitor evaluates sensor readings. For more information, see Charts.

  • If a sensor is not in analysis mode, it is in monitoring mode. In monitoring mode, sensor readings are evaluated by the appropriate subscriber classOpens in a new tab. To ensure that notifications are not triggered by transient abnormal samples, every six sample values are averaged together to generate one reading every three minutes, and it is these readings that are evaluated.

  • When a sequence of readings meets the criteria for a notification (as described in Notification Rules), the subscriber class generates an alert or a warning by passing a notification containing text and a severity code to the system notifier, SYS.Monitor.SystemNotifyOpens in a new tab.

    Note:

    Because no chart is required to evaluate readings from sensors whose sensor objects have maximum and warning values specified, evaluation of these sensor readings and posting of any resulting notifications is handled by the SYS.Monitor.SystemSubscriber subscriber class, rather than the SYS.Monitor.Health.Control subscriber class (see Default System Monitor Components). As a result, notifications for these sensors are generated even when Health Monitor is not enabled, as long as System Monitor is running.

    If you want to generate notifications using absolute values for some sensors but using multipliers for others—for example, using absolute values for DBLatency sensors for some databases but multipliers for others—you can do so by setting multipliers in the sensor object and manually creating charts for those for which you want to use absolute values; see Editing a Chart for more information.

Sensors and Sensor Objects

A Health Monitor sensor object represents one of the sensors in SYS.Monitor.SystemSensors. Each sensor object must provide a base value, and can optionally provide a maximum (alert) threshold and a warnings threshold (either as absolute values or multipliers); see Notification Rules for information about how these values are used in evaluating sensor readings. The Health Monitor sensor objects are shown with their default parameters in the following table.

Some sensors represent an overall metric for the InterSystems IRIS instance. These are the sensors which, in the following table, have no value listed in the Sensor Item column. For example, the LicensePercentUsed sensor samples the percentage of the instance’s authorized license units that are currently in use, while the JournalGrowthRate sensor samples the amount of data (in KB per minute) written to the instance’s journal files.

Other sensors collect information about a specific sensor item (either a CSP server, a database, or a mirror). For example, DBReads sensors sample the number of reads per minute from each mounted database. These sensors are specified as <sensor_object> <sensor_item>; for example, the DBLatency install-dir\IRIS\mgr\user sensor samples the time (in milliseconds) required to complete a random read on the USER database.

Sensor objects can be listed and edited (but not deleted) using the ^%SYSMONMGR utility (as described in Configure Health Monitor Classes). Editing a sensor object allows you to modify one or all of its values. You can enter a base value only; a base, maximum (alert), and warning value; or a base value, maximum (alert) multiplier, and warning multiplier.

Health Monitor Sensor Objects
Sensor Object Sensor Item Description Base Max Val. Max Mult. Warn Val. Warn Mult.
CPUUsage   System CPU usage (percent). 50 85 75
CSPSessions IP_address:port Number of active web sessions on the listed Web Gateway server. 100 2 1.6
CSPActivity IP_address:port Requests per minute to the listed Web Gateway server. 100 2 1.6
CSPActualConnections IP_address:port Number of connections created on the listed Web Gateway server. 100 2 1.6
CSPInUseConnections IP_address:port Number of currently active connections to the listed Web Gateway server. 100 2 1.6
CSPPrivateConnections IP_address:port Number of private connections to the listed Web Gateway server. 100 2 1.6
CSPUrlLatency IP_address:port Time (milliseconds) required to obtain a response from IP_address:port/csp/sys/UtilHome.csp. 1000 5000 3000
CSPGatewayLatency IP_address:port Time (milliseconds) required to obtain a response from the listed Web Gateway server when fetching the metrics represented by the CSP sensor objects. 1000 2000 1000
DBLatency database_directory Milliseconds to complete a random read from the listed mounted database. 1000 3000 1000
DBReads database_directory Reads per minute from the listed mounted database. 1024 2 1.6
DBWrites database_directory Writes per minute to the listed mounted database. 1024 2 1.6
DiskPercentFull database_directory Disk percentage used for the listed mounted database. 50 99 95
ECPAppServerKBPerMinute   KB per minute sent to the ECP data server. 1024 2 1.6
ECPConnections   Number of active ECP connections. 100 2 1.6
ECPDataServerKBPerMinute   KB per minute received as ECP data server. 1024 2 1.6
ECPLatency   Network latency (milliseconds) between the ECP data server and this ECP application server. 1000 3000 3000
ECPTransOpenCount   Number of open ECP transactions 100 2 1.6
ECPTransOpenSecsMax   Duration (seconds) of longest currently open ECP transaction 60 2 1.6
GlobalRefsPerMin   Global references per minute. 1024 2 1.6
GlobalSetKillPerMin   Global sets/kills per minute. 1024 2 1.6
JournalEntriesPerMin   Number of journal entries written per minute. 1024 2 1.6
JournalGrowthRate   Number of KB per minute written to journal files. 1024 2 1.6
LicensePercentUsed   Percentage of authorized license units currently in use. 50 1.5
LicenseUsedRate   License acquisitions per minute. 20 1.5
LockTablePercentFull   Percentage of the lock table in use. 50 99 85
LogicalBlockRequestsPerMin   Number of logical block requests per minute. 1024 2 1.6
MirrorDatabaseLatencyBytes mirror_name On the backup failover member of a mirror, number of bytes of journal data received from the primary but not yet applied to mirrored databases on the backup (measure of how far behind the backup’s databases are). 2*107 2 1.6
MirrorDatabaseLatencyFiles mirror_name On the backup failover member of a mirror, number of journal files received from the primary but not yet fully applied to mirrored databases on the backup (measure of how far behind the backup’s databases are). 3 2 1.6
MirrorDatabaseLatencyTime mirror_name On the backup failover member of a mirror, time (in milliseconds) between when the last journal file was received from the primary and when it was fully applied to the mirrored databases on the backup (measure of how far behind the backup’s databases are). 1000 4000 3000
MirrorJournalLatencyBytes mirror_name On the backup failover member of a mirror, number of bytes of journal data received from the primary but not yet written to the journal directory on the backup (measure of how far behind the backup is). 2*107 2 1.6
MirrorJournalLatencyFiles mirror_name On the backup failover member of a mirror, number of journal files received from the primary but not yet written to the journal directory on the backup (measure of how far behind the backup is). 3 2 1.6
MirrorJournalLatencyTime mirror_name On the backup failover member of a mirror, time (in milliseconds) between when the last journal file was received from the primary and when it was fully written to the journal directory on the backup (measure of how far behind the backup is). 1000 4000 3000
PhysicalBlockReadsPerMin   Number of physical block reads per minute. 1024 2 1.6
PhysicalBlockWritesPerMin   Number of physical block writes per minute. 1024 2 1.6
ProcessCount   Number of active processes for the InterSystems IRIS instance. 100 2 1.6
RoutineCommandsPerMin   Number of routine commands per minute. 1024 2 1.6
RoutineLoadsPerMin   Number of routine loads per minute. 1024 2 1.6
RoutineRefsPerMin   Number of routine references per minute. 1024 2 1.6
SMHPercentFull   Percentage of the shared memory heap (generic memory heap) in use. 50 98 85
TransOpenCount   Number of open local transactions (local and remote). 100 2 1.6
TransOpenSecondsMax   Duration (seconds) of longest currently open local transaction. 60 2 1.6
WDBuffers   Average number of database buffers updated per write daemon cycle. 1024 2 1.6
WDCycleTime   Average number of seconds required to complete a write daemon cycle. 60 2 1.6
WDWIJTime   Average number of seconds spent updating the write image journal (WIJ) per cycle. 60 2 1.6
WDWriteSize   Average number of KB written per write daemon cycle. 1024 2 1.6
Note:

Some sensors are not sampled for all InterSystems IRIS instances. For example, the ECP... sensors are sampled only on ECP data and application servers.

When you are monitoring a mirror member (see Mirroring), the following special conditions apply to Health Monitor:

  • No sensors are sampled while the mirror is restarting (for example, just after the backup failover member has taken over as primary) or if the member’s status in the mirror is indeterminate.

  • If a sensor is in analysis mode for a period and the member’s status in the mirror changes during the period, no chart is created and the sensor remains in analysis mode.

  • Only the MirrorDatabaseLatency* and MirrorJournalLatency* sensors are sampled on the backup failover mirror member.

  • All sensors except the MirrorDatabaseLatency* and MirrorJournalLatency* sensors are sampled on the primary failover mirror member.

Periods

By default there are 63 recurring weekly periods during which sensors are sampled. Each of these periods represents one of the following specified intervals during a particular day of the week:

Default Health Monitor Periods
00:15 a.m. – 02:45 a.m. 03:00 a.m. – 06:00 a.m. 06:15 a.m. – 08:45 a.m.
09:00 a.m. – 11:30 a.m. 11:45 a.m. – 01:15 p.m.

01:30 p.m. – 04:00 p.m.

04:15 p.m. – 06:00 p.m.

06:15 p.m. – 08:45 p.m.

09:00 p.m. – 11:59 p.m.

You can list, add and delete periods using the Configure Periods option in the ^%SYSMONMGR utility (see Configure Health Monitor Classes). You can add monthly, quarterly or yearly periods as well as weekly periods.

Note:

Quarterly periods are listed in three-month increments beginning with the month specified as the start month; for example, if you specify 5 (May) as the starting month, the quarterly cycle repeats in August (8), November (11) and February (2).

Descriptions are optional for user-defined periods.

Charts

If the notification threshold values for a sensor object are not given as multipliers (or not specified), Health Monitor requires a chart to evaluate those sensor readings. Health Monitor generates the necessary charts by calculating the mean, standard deviation, and maximum value from sample sensor readings. This section describes how Health Monitor generates charts in analysis mode, and how to edit or create custom charts.

Analysis Mode

Before Health Monitor can evaluate sensor samples, it checks whether that sensor requires a chart. If a chart is required but does not yet exist, Health Monitor automatically puts the sensor in analysis mode.

In analysis mode, Health Monitor simply records the samples it collects, and at the end of the period generates the required chart for the sensor. To ensure that the chart is reliable, a minimum of 13 samples must be taken in analysis mode. Until 13 valid samples are taken within a single recurrence of a period, the sensor remains in analysis mode and no chart is generated for that period.

Note:

Charts should always be generated from samples taken during normal, stable operation of the InterSystems IRIS instance. For example, when a Monday 09:00 a.m. - 11:30 a.m. chart does not exist, it should not be generated on a Monday holiday or while a technical problem is affecting the operation of the InterSystems IRIS instance.

When a period has recurred five times since a chart was generated for a sensor or sensor/item during that period, not including those during which an alert was generated, the readings from these five normal period recurrences are evaluated to detect a rising or shifted mean for the sensor. If the mean is rising or has shifted with 95% certainty, the chart is recalibrated—the existing chart for the sensor during that period is replaced with a chart generated from the samples taken during the most recent recurrence of the period. For example, if the number of users accessing a database is growing slowly but steadily, the mean DBReads value for that database is likely to also rise slowly but steadily, resulting in regular chart recalibration every five periods, which avoids unwarranted alerts.

Note that sensor object absolute and multiplier values cannot be automatically recalibrated in the same way, and should be adjusted manually because automatic chart recalibration does not apply to such sensors. For example, if the number of users accessing a database grows, the base, maximum (alert) value, and warning value for the DBLatency sensor object may require manual adjustment.

Editing a Chart

The ^%SYSMONMGR utility lets you display a list of all current charts, including the mean and sigma of each. You can also display the details of a particular chart, including the individual readings and highest reading. To access these options from the utility, select Configure Charts from the Configure Health Monitor Classes submenu .

The Configure Charts option also provides two ways to customize alerting by customizing charts:

  • You can change the mean and/or sigma to whatever values you wish by editing an existing chart. The standard notification rules apply, but using the values you have entered.

  • You can create a chart, specifying an alert value and a warning value. Creating a chart is similar to setting an absolute value for the notification threshold; alerts and warnings are generated based solely on the values you supply for the chart.

Note:

When listing, examining, editing, or creating charts, the Item heading or prompt refers to a database (specified by a directory path), a Web Gateway server (specified by an IP address), or a mirror (specified by the mirror name). See Sensors and Sensor Objects for more information.

You can also programmatically build chart statistics based on a list of values with the following SYS.Monitor.Health.ChartOpens in a new tab class methods:

For more information, see the SYS.Monitor.Health.ChartOpens in a new tab class documentation.

Note:

A chart generated by Health Monitor, including one you have edited, can be automatically recalibrated as described in Analysis Mode. In addition, all charts generated by Health Monitor, including those that have been edited, are deleted when an InterSystems IRIS instance is upgraded.

A chart created using the Configure Charts submenu or the CreateChart()Opens in a new tab class method, however, is never automatically recalibrated or deleted on upgrade. A user-created chart is therefore permanently associated with the selected sensor/period combination until you select the Reset Charts option within the Reset Defaults option of the Configure Health Monitor Classes submenu or select Recalibrate Charts within the Configure Charts option.

Notification Rules

Health Monitor generates an alert (notification of severity 2) if three consecutive readings of a sensor during a period are greater than the sensor maximum threshold value, and a warning (notification of severity 1) if five consecutive readings of a sensor during a period are greater than the sensor warning threshold value. The maximum and warning threshold values depend on the settings in the sensor object and whether the applicable chart was generated by Health Monitor or created by a user, as shown in the following table.

Note also that:

  • When a sensor object has maximum value and warning value set, no chart is required and therefore no chart is generated, and notifications are generated even when Health Monitor is disabled.

  • When a sensor object has maximum multiplier and warning multiplier set, or base only, a chart is required; until sufficient samples have been collected in analysis mode to generate the chart, no notifications are generated.

  • When a user-created chart exists, it does not matter what the sensor object settings are.

Sensor Object Settings Chart Type Sensor Maximum Value Sensor Warning Value Active When
base, maximum value, warning value none sensor object maximum value sensor object warning value System Monitor running
base, maximum multiplier, warning multiplier generated sensor object maximum multiplier times greater of:
  • chart mean plus three sigma

  • highest chart value plus one sigma

sensor object warning multiplier times greatest of:
  • base

  • chart mean plus two sigma

  • highest chart value

System Monitor running, Health Monitor enabled
base only generated greater of:
  • chart mean plus three sigma

  • highest chart value

greater of:
  • chart mean plus two sigma

  • highest chart value

System Monitor running, Health Monitor enabled
(n/a if user-created chart exists) user-created chart alert value chart warning value System Monitor running, Health Monitor enabled

Examples

In this example, the chart for the DBReads install-dir\IRIS\mgr\user sensor during the Monday 09:00 a.m. - 11:30 a.m. period indicates that the mean reads per minute from the USER database is 2145, with a sigma of 141 and maximum value of 2327. The default notification threshold multipler for DBReads is 2. An alert is generated for this sensor when three consecutive readings exceed the greater of the following two values:

  • maximum multiplier * (chart mean + (3 * chart sigma))

    2 * (2145 + (3 * 141)) = 5136

  • maximum multiplier * (chart maximum value + chart sigma))

    2 * (2327 + 141) = 4936

So, or this sensor during this period, an alert is generated if three consecutive readings are greater than 5136.

A sensor with no multipliers or maximum values is evaluated with a multiplier of 1. As an example, if the DBReads sensor object were edited to remove the multipliers, leaving it with only a base, an alert is generated for DBReads install-dir\IRIS\mgr\user when three consecutive readings are greater than 2568, calculated as the greater of:

  • maximum multiplier * (the chart mean + three times the sigma)

    1 * (2145 + (3 * 141)) = 2568

  • maximum multiplier * (the highest value in the chart + one sigma)

    1 * (2327 + 141) = 2468

Using ^%SYSMONMGR to Manage Health Monitor

As described in Using the ^%SYSMONMGR Utility, the ^%SYSMONMGR utility lets you manage and configure System Monitor, including Health Monitor. To manage Health Monitor, change to the %SYS namespace in the Terminal, then enter the following command:

%SYS>do ^%SYSMONMGR

1) Start/Stop System Monitor
2) Set System Monitor Options
3) Configure System Monitor Classes
4) View System Monitor State
5) Manage Application Monitor
6) Manage Health Monitor
7) View System Data
8) Exit 

Option? 

Note:

Health Monitor runs only in the %SYS namespace. When you start ^%SYSMONMGR in another namespace, option 6 (Manage Health Monitor) does not appear.

Enter 6 for Manage Health Monitor. The following menu displays:

1) Enable/Disable Health Monitor 
2) View Alerts Records
3) Configure Health Monitor Classes 
4) Set Health Monitor Options
5) Exit 

Option? 

Enter the number of your choice or press Enter to exit the Health Monitor utility.

The options in the main menu let you perform Health Monitor tasks as described in the following table:

Option Description
1) Enable/Disable Health Monitor
  • Enable Health Monitor (if it is disabled, as by default), so that it starts when System Monitor starts. Health Monitor does not begin collecting sensor reading until after the configured startup wait time is complete.

  • Disable Health Monitor (if it is enabled), so that it does not start when System Monitor starts.

2) View Alert Records
  • View alert records for one or all sensors objects over a specified date range.

3) Configure Health Monitor Classes
  • List notification rules.

  • List and delete existing periods and add new ones.

  • List, examine, edit, create and recalibrate charts.

  • List sensor objects and edit their settings.

  • Reset Health Monitor elements to their defaults.

4) Set Health Monitor Options
  • Set startup wait time.

  • Specify when alert records should be purged.

Note:

When the utility asks you to specify a single element such as a sensor, rule, period or chart, you can enter ? (question mark) at the prompt for a numbered list, then enter the number of the element you want.

All output from the utility can be displayed on the Terminal or sent to a specified device.

View Alerts Records

Choose this option to view recently generated alerts for a specific sensor, or for all sensors. You can examine the details of individual alerts and warnings, including the mean and sigma of the chart and the readings that triggered the notification. (Alert records are purged after a configurable number of days; see the Set Health Monitor Options for more information.)

Configure Health Monitor Classes

The options in this submenu let you customize Health Monitor, as described in the following table.

Note:

You cannot use these options to customize Health Monitor while System Monitor is running; you must first stop System Monitor, and then restart it after you have made your changes.

Option Description
1) Activate/ Deactivate Rules

(not in use in this release)

2) Configure Periods

List the currently configured periods and add and delete periods.

3) Configure Charts

Lets you

  • List the mean and sigma of all existing charts, organized by period.

  • Examine individual charts in detail, including the readings on which the mean and sigma are based, with the highest reading called out.

  • Change the mean and sigma of an existing chart using the Edit Charts option.

  • Create a chart, specifying alert and warning thresholds.

  • Manually recalibrate all charts (including user-created charts) or an individual chart from the most recent data.

4) Edit Sensor Objects

List the sensor objects representing the sensors in the SYS.Monitor.SystemSensors class and modify their base, maximum, warning, maximum multiplier, and warning multiplier values.

5) Reset Defaults

Lets you

  • Reset to the default period configuration and remove all existing charts, returning every period to analysis mode (see Health Monitor Process Description).

  • Remove all existing charts (including user-created charts), returning every period to analysis mode, without removing any user-defined period configuration.

  • Reset all sensor objects to their default values.

  • Reset the health monitor options (startup wait time and alert purge time) to their defaults

Set Health Monitor Options

This submenu lets you set several Health Monitor options, as shown in the following table:

Option Description
1) Set Startup Wait Time

Configure the number of minutes System Monitor waits after starting, when Health Monitor is enabled, before passing sensor readings to the Health Monitor subscriber, SYS.Health.Monitor.Control. This allows InterSystems IRIS to reach normal operating conditions before Health Monitor begins creating charts or evaluating readings.

2) Set Alert Purge Time Specify when an alert record should be purged (deleted); the default is five days after the alert is generated.

See Also

FeedbackOpens in a new tab