Skip to main content

Tune and Troubleshoot SAM

Important:

System Alerting and Monitoring (SAM) has been deprecated; the following documentation is provided for existing users only. Customers interested in a comprehensive view of their operational platform can access the metrics APIOpens in a new tab and structured logsOpens in a new tab of InterSystems products within another observability tool. Existing users who would like assistance identifying an alternative solution should contact the WRCOpens in a new tab.

This section describes how to perform common tasks to configure the performance of System Alerting and Monitoring (SAM) version 1.0 or 1.1, and how to troubleshoot common problems.

Access the Management Portal for the SAM Manager

The SAM Manager is the InterSystems IRIS instance that powers the System Alerting and Monitoring application. Like other InterSystems IRIS instances, the SAM Manager provides a web application for performing maintenance and administration called the Management PortalOpens in a new tab. You can access the Management Portal for the SAM at the following address:

http://<sam-domain-name>:<port>/csp/sys/UtilHome.csp

where <sam-domain-name> is the DNS name or IP address of the system SAM is running on, and <port> is the configured Nginx port (8080 by default).

Actions you can perform using the Management Portal for SAM include:

Important:

The SAM Manager should not be used to develop or run any application; it is strictly for use by SAM. The directions in this section describe appropriate uses and interactions with the SAM Manager.

For a general purpose InterSystems IRIS instance, install InterSystems IRIS community editionOpens in a new tab.

Adjust Startup Settings

The SAM Manager initially allocates memory on startup as follows:

  • 2,000 MB of 8KB blocks for the database cache

  • 300 MB for the routines cache

This allocation should be sufficient when monitoring a modest number (30 or fewer) of InterSystems IRIS instances. If you are monitoring a large number of instances, or find that the SAM Manager is regularly using the full amount of allocated memory, you can increase these limits.

For details on adjusting these settings, see the Allocating Memory to the Database and Routine CachesOpens in a new tab topic in the System Administration Guide.

Clear the SAM Database

System Alerting and Monitoring Community Edition has a maximum database size limit of 10 GB. If this limit is met, SAM may exhibit unexpected behavior, and it becomes necessary to clear the database.

In the SQL page of the SAM Manager (System Explorer > SQL), enter the following command to delete all SAM metric data:

DELETE FROM %SAM.PrometheusSample

To prevent the SAM database from filling up again, consider using a difference license, changing the storage location, or lowering the number of days that SAM stores metrics using the configuration settings dialog on the Monitor Clusters page.

Monitor the SAM Manager

It is possible to use System Alerting and Monitoring to monitor the SAM Manager, as the SAM Manager is itself an InterSystems IRIS instance. This allows you to keep track of whether the SAM database is at risk of filling up, and make sure the configured cache sizes are sufficient for SAM operations.

Adding the SAM Manager to a SAM cluster is the same as adding any other InterSystems IRIS instance, with the following difference:

For the IP and Port fields, specify the fully qualified DNS name and port (8080 by default) where SAM runs. You can see these values in the address bar of your browser when accessing SAM. For example, if the URL for SAM is:

http://<sam-domain-name>:<port>/api/sam/app/index.csp

Specify <sam-domain-name> in the IP field, and <port> in the Port field.

Note:

It does not work to specify localhost in the IP field; you must enter a fully qualified DNS name.

Create Custom Alert Handlers

You can create custom alert handlers that specify additional actions for System Alerting and Monitoring to perform when an alert fires, such as sending a text or email. Setting up an alert handler is a two step process:

  1. Write the Alert Handler

  2. Import the Alert Handler

Write the Alert Handler

To create an alert handler, you must create a class using an ObjectScript IDEOpens in a new tab. Connect this IDE to an InterSystems IRIS instance that is not part of SAM.

Important:

You cannot use the SAM Manager to create the alert handler, as SAM is not a development platform.

Instead, you must connect the IDE to a different InterSystems IRIS instance (such as the InterSystems IRIS Community EditionOpens in a new tab), and later import the alert handler into the SAM Manager.

After setting up the IDE, create a class with the following characteristics:

  • The class extends the %SAM.AbstractAlertsHandler class.

  • The class implements the HandleAlerts() class method. Within this method, specify the desired behavior when an alert fires.

When SAM detects a new alert (or multiple new alerts), SAM calls the HandleAlerts() method of all alert handlers. The HandleAlerts() method receives a %DynamicArray packet of alerts with the following format:

[
  {
    "labels":{
      "alertname":"High CPU Usage",
      "cluster":"1",
      "instance":"10.0.0.24:9092",
      "job":"SAM",
      "severity":"critical"
    },
    "annotations":{
      "description":"CPU usage exceeded the 95% threshold."
    },
    "ts": "2020-04-17 18:07:42.536"
  },
  {
    "labels":{
      "alertname":"iris_system_alert",
      "cluster":"1",
      "instance":"10.0.0.24:9092",
      "job":"SAM",
      "severity":"critical"
    },
    "annotations":{
      "description":"Previous system shutdown was abnormal, system forced down or crashed"
    },
    "ts":"2020-04-17 18:07:36.926"
  }
]
Note:

Alerts generated by an InterSystems IRIS instance are all named iris_system_alert.

Below is an example of an alert handler class. This example writes a message to the messages log (or Console Log) whenever an alert fires:

/// An example Alert Handler class, which writes messages to the messages log.
Class User.AlertHandler Extends %SAM.AbstractAlertsHandler
{

ClassMethod HandleAlerts(packet As %DynamicArray) As %Status
{
      set iter = packet.%GetIterator()
      while iter.%GetNext(.idx, .alert) {
            set msg = alert.annotations.description
            if alert.labels.severity = "critical" {set severity = 2} else {set severity = 1}
            do ##class(%SYS.System).WriteToConsoleLog(msg, 1, severity)
         }
      q $$$OK
}

}

Import the Alert Handler into SAM

After creating the alert handler, the next step is to import it into SAM.

  1. First, export the alert handler in XML format. How to do this depends on the IDE you are using.

  2. Next, log in to the Management Portal for the SAM Manager from a web browser, using the following address:

    http://<sam-domain-name>:8080/csp/sys/UtilHome.csp
    

    where <sam-domain-name> is the fully qualified DNS name or IP address of the system SAM is running on.

  3. Navigate to the Classes page (System Explorer > Classes).

  4. Make sure the SAM namespace is selected, then click Import. This brings up the Import Classes dialog.

  5. In the Import Classes dialog:

    • For The import file resides on, select My Local Machine.

    • For Select the path and name of the import file, click the Choose File button and select the alert handler XML file from your file system.

  6. At the bottom of the dialog, click Next, then Import. A result dialog should appear to tell you the status of your import.

After the import is complete, you have successfully added the alert handler to SAM. From now on, any time SAM detects a new alert, it calls the HandleAlerts() method of your class.

If you ever need to update an alert handler, simply repeat the steps above with the newer version. This replaces the previous version with the new one.

Improve Performance When Querying Older Metrics

SAM includes two databases: the Prometheus database (used for short-term metrics storage) and an InterSystems IRIS database (used for longer-term storage). The Prometheus database retains data for two hours in a cache optimized for rapid querying, while the InterSystems IRIS database retains the data for long term analysis.

If you constantly run queries for data older than two hours, increasing the Prometheus retention time may increase performance. Adjust this setting by changing the “--storage.tsdb.retention.time” flag in the docker-compose.yml file. For more information, see “Operational aspects” in the Prometheus documentation (https://prometheus.io/docs/prometheus/latest/storage/#operational-aspectsOpens in a new tab).

Troubleshoot an Unreachable Instance

There are many reasons the state of an instance could become Unreachable. This section provides several potential causes and solutions.

If none of these procedures resolve the Unreachable status, contact the InterSystems Worldwide Response Center (WRC)Opens in a new tab for further troubleshooting help.

The target instance is not outputting metrics

The /api/monitor application for the instance you are monitoring with SAM may not be outputting metrics. To determine this, use the web browser or the curl command in the command window to access the following URL:

http://<instance-host>:<port>/api/monitor/metrics

If this does not return a list of metrics, ensure that the instance is on InterSystems IRIS version 2020.1 or higher and that the /api/monitor application is configured to allow unauthenticated access, as described in the section on preparing instances for monitoring.

The target instance has an IP address in the 172.17.x.x range

System Alerting and Monitoring may not be able to reach an instance with an IP address in the 172.17.x.x range (for example, 172.17.123.123). This is because Docker uses this IP range for its own networks.

You can resolve this issue by changing the Docker IP address range. To do this, specify a different range (e.g. 10.10.x.x) in the Docker daemon configuration file using the default-address-pools option. Refer to the Docker documentationOpens in a new tab for further help editing this file.

The target instance is not responding before timeout

The /api/monitor application may be outputting metrics, but failing to respond to the GET request from Prometheus before the connection timeout (ten seconds, by default). An analysis of traffic over the network can confirm that Prometheus is ending each unsuccessful attempt to connect to the instance with a TCP FIN packet.

Restarting the instance may be sufficient to render the /api/monitor application responsive again.

Alternatively, if you have defined custom application metrics for the instance, the time required to compute these metrics may be exceeding the default scraping interval for Prometheus. While it is possible to specify larger values for the scraping interval and timeout parameters in the isc_prometheus.yml file, in this case InterSystems recommends adjusting your organization’s monitoring strategy so that SAM can reliably receive updates to all the metrics it monitors with the default frequency.

The SAM database is full

If the SAM database fills up, instances may show up as Unreachable and stop reporting metrics. To check whether this is the case:

  1. Open the SAM Manager from a web browser, using the following address:

    http://<sam-domain-name>:8080/csp/sys/UtilHome.csp
    
  2. Navigate to the Databases page (System Operation > Databases).

  3. Select Free Space View.

  4. Check the % Free column for the SAM database to see whether the value is 0.

If the database is full, you should free some space by deleting data, as described in an earlier section. Once you have done so, shut down System Alerting and Monitoring using the stop.sh script, and restart it using start.sh.

To prevent this from happening again, you can lower the number of days SAM stores data using the Configuration Settings menu.

Alternatively, you may prefer to change the location where Docker stores the persistent copy of the database to accommodate the volume of data collected. To do this:

  1. Shut down SAM using the stop.sh script.

  2. Change the location where Docker stores a persistent copy of the data from the SAM IRIS container, as described in our guide to setting up SAM.

  3. Move the contents of the former storage location to the new location. The default configuration specified by the docker-compose.yml file stores monitoring data in a named volume (irisdata) located in the /var/lib/docker/volumes/ directory.

  4. Restart SAM using the start.sh script.

Upgrade from SAM 1.0 to SAM 1.1

Overview

Version 1.1 of InterSystems System Alerting and Monitoring (SAM) provides performance improvements for the graphs in the Grafana dashboard and the underlying Prometheus queries, especially when displaying metrics over a longer period of time.

Upgrade notes and requirements:

  • The upgrade requires a new docker-compose.yml file.

  • Performance improvements to SAM use an index that may include up to 50% more data. You must have the space available to accommodate this index.

Note:

The upgrade does not update or overwrite the files in the /config directory created by the SAM 1.0 installation.

Performing the Upgrade

To upgrade from SAM 1.0 to SAM 1.1:

  1. Download and extract the image for SAM 1.1.

  2. Shut down the existing SAM installation.

  3. Copy the docker-compose.yml file for version 1.1 into the SAM installation directory, replacing the docker-compose.yml file for version 1.0.

  4. Configure the new docker-compose.yml file, as needed.

  5. Restart SAM.

Restarting SAM pulls a SAM 1.1 image, and then uses that image to upgrade and run the SAM container.

The SAM Manager for SAM 1.1 is a version of InterSystems IRIS 2021.1.2. To accommodate this, the new docker-compose.yml includes a new ‘iris-init’ service, which runs briefly at startup and then exits.

About Rebuilding the Index

The upgrade creates a new index for any existing data at the initial startup of the new version. Depending on the amount of data, this may take several minutes or longer.

In the SAM messages.log file, there are entries marking the start and completion of creating the index:

[Utility.Event] SAM Manager starting Index rebuild for PrometheusSample class.
...
[Utility.Event] SAM Manager completed Index rebuild for PrometheusSample class.

Note that older data may not be available in the SAM dashboards until this process has completed.

FeedbackOpens in a new tab