Disk Usage and Drain of Events Health Monitor Alerts

The Disk Usage health module compares disk usage on a managed device’s hard drive and malware storage pack to the limits configured for the module and alerts when usage exceeds the percentages configured for the module. This module also alerts when the system excessively deletes files in monitored disk usage categories, or when disk usage excluding those categories reaches excessive levels, based on module thresholds.

This topic describes the symptoms and troubleshooting guidelines for two health alerts generated by the Disk Usage health module:

Frequent Drain of Events
Drain of Unprocessed Events

The disk manager process manages the disk usage of a device. Each type of file monitored by the disk manager is assigned with a silo. Based on the amount of disk space available on the system the disk manager computes a High Water Mark (HWM) and a Low Water Mark (LWM) for each silo.

To display detailed disk usage information for each part of the system, including silos, LWMs, and HWMs, use the show disk-manager command.

Examples

The following is an example of the disk manager information:


> show disk-manager
Silo                                    Used        Minimum     Maximum
Temporary Files                         0 KB        499.197 MB  1.950 GB
Action Queue Results                    0 KB        499.197 MB  1.950 GB
User Identity Events                    0 KB        499.197 MB  1.950 GB
UI Caches                               4 KB        1.462 GB    2.925 GB
Backups                                 0 KB        3.900 GB    9.750 GB
Updates                                 0 KB        5.850 GB    14.625 GB
Other Detection Engine                  0 KB        2.925 GB    5.850 GB
Performance Statistics                  33 KB       998.395 MB  11.700 GB
Other Events                            0 KB        1.950 GB    3.900 GB
IP Reputation & URL Filtering           0 KB        2.437 GB    4.875 GB
Archives & Cores & File Logs            0 KB        3.900 GB    19.500 GB
Unified Low Priority Events             1.329 MB    4.875 GB    24.375 GB
RNA Events                              0 KB        3.900 GB    15.600 GB
File Capture                            0 KB        9.750 GB    19.500 GB
Unified High Priority Events            0 KB        14.625 GB   34.125 GB
IPS Events                              0 KB        11.700 GB   29.250 GB

Health Alert Format

When the Health Monitor process on the Firewall Management Center runs (once every 5 minutes or when a manual run is triggered), the Disk Usage module looks into the diskmanager.log file and, if the correct conditions are met, the health alert is triggered.

The structures of these health alerts are as follows:

Frequent drain of <SILO NAME>
Drain of unprocessed events from <SILO NAME>

For example,

Frequent drain of Low Priority Events
Drain of unprocessed events from Low Priority Events

Its possible for any silo to generate a Frequent drain of <SILO NAME> health alert. However, the most commonly seen are the alerts related to events. Among the event silos, the Low Priority Events are often seen because device generates this type of events frequently.

A Frequent drain of <SILO NAME> event has a Warning severity level when seen in relation to an event-related silo, because events will be queued to be sent to the Firewall Management Center. For a non-event related silo, such as the Backups silo, the alert has a Critical severity level because this information is lost.

Important

Only event silos generate a Drain of unprocessed events from <SILO NAME> health alert. This alert always has a Critical severity level.

Additional symptoms besides the alerts can include:

Slowness on the Firewall Management Center user interface
Loss of events

Common Troubleshoot Scenarios

A Frequent drain of <SILO NAME> event is caused by too much input into the silo for its size. In this case, the disk manager drains (purges) that file at least twice in the last 5-minute interval. In an event type silo, this is typically caused by excessive logging of that event type.

A Drain of unprocessed events of <SILO NAME> health alert is caused by a bottleneck in the event processing path.

There are three potential bottlenecks with respect to these Disk Usage alerts:

Excessive logging ― The EventHandler process on Firewall Threat Defense is oversubscribed (it reads slower than what Snort writes).
Sftunnel bottleneck ― The Eventing interface is unstable or oversubscribed.
SFDataCorrelator bottleneck ― The data transmission channel between the Firewall Management Center and the managed device is oversubscribed.

Excessive Logging

One of the most common causes for the health alerts of this type is excessive input. The difference between the Low Water Mark (LWM) and High Water Mark (HWM) gathered from the show disk-manager command shows how much space there is available to take on that silo to go from LWM (freshly drained) to the HWM value. If there are frequent drain of events (with or without unprocessed events), review the logging configuration.

Check for double logging ― Double logging scenarios can be identified if you look at the correlator perfstats on the Firewall Management Center:

admin@FMC:~$ sudo perfstats -Cq < /var/sf/rna/correlator-stats/now
Check logging settings for the ACP ― Review the logging settings of the Access Control Policy (ACP). If the logging setting includes both "Beginning" and "End" of connection, modify the setting to log only the end to reduce the number of events.

Communications Bottleneck ― Sftunnel

Sftunnel is responsible for encrypted communications between the Firewall Management Center and the managed device. Events are sent over the tunnel to the Firewall Management Center. Connectivity issues and/or instability in the communication channel (sftunnel) between the managed device and the Firewall Management Center can be due to:

Sftunnel is down or is unstable (flaps).

Ensure that the Firewall Management Center and the managed device have reachability between their management interfaces on TCP port 8305.

The sftunnel process should be stable and should not restart unexpectedly. Verify this by checking the /var/log/message file and search for messages that contain the sftunneld string.
Sftunnel is oversubscribed.

Review trend data from the Heath Monitor and look for signs of oversubscription of the Firewall Management Center's management interface, which can be a spike in management traffic or a constant oversubscription.

Use as a secondary management interface for eventing. To use this interface, you must configure its IP address and other parameters at the Firewall Threat Defense CLI using the configure network management-interface command.

Communications Bottleneck ― SFDataCorrelator

The SFDataCorrelator manages data transmission between the Firewall Management Center and the managed device; on the Firewall Management Center, it analyzes binary files created by the system to generate events, connection data, and network maps. The first step is to review the diskmanager.log file for important information to be gathered, such as:

The frequency of the drain.
The number of files with Unprocessed Events drained.
The occurrence of the drain with Unprocessed Events.

Each time the disk manager process runs it generates an entry for each of the different silos on its own log file, which is located under [/ngfw]/var/log/diskmanager.log. Information gathered from the diskmanager.log (in CSV format) can be used to help narrow the search for a cause.

Additional troubleshooting steps:

The command stats_unified.pl can help you to determine if the managed device does have some data which must be sent to Firewall Management Center. This condition can happen when the managed device and the Firewall Management Center experience a connectivity issue. The managed device stores the log data on to a hard drive.

admin@FMC:~$ sudo stats_unified.pl
The manage_proc.pl command can reconfigure the correlator on the Firewall Management Center side.

root@FMC:~# manage_procs.pl

Before You Contact Cisco TAC

It is highly recommended to collect these items before you contact Cisco TAC:

Screenshots of the health alert seen.
Troubleshoot file generated from the Firewall Management Center.
Troubleshoot file generated from the affected managed device.
Date and Time when the problem was first seen.
Information about any recent changes done to the policies (if applicable).
The output of the stats_unified.pl command as described in Communications Bottleneck ― SFDataCorrelator.