Disk Usage and Drain of Events Health Monitor Alerts
The Disk Usage health module compares disk usage on a managed device’s hard drive and malware storage pack to the limits configured for the module and alerts when usage exceeds the percentages configured for the module. This module also alerts when the system excessively deletes files in monitored disk usage categories, or when disk usage excluding those categories reaches excessive levels, based on module thresholds.
This topic describes the symptoms and troubleshooting guidelines for two health alerts generated by the Disk Usage health module:
-
Frequent Drain of Events
-
Drain of Unprocessed Events
The disk manager process manages the disk usage of a device. Each type of file monitored by the disk manager is assigned with a silo. Based on the amount of disk space available on the system the disk manager computes a High Water Mark (HWM) and a Low Water Mark (LWM) for each silo.
To display detailed disk usage information for each part of the system, including silos, LWMs, and HWMs, use the show disk-manager command.
Examples
The following is an example of the disk manager information:
> show disk-manager
Silo Used Minimum Maximum
Temporary Files 0 KB 499.197 MB 1.950 GB
Action Queue Results 0 KB 499.197 MB 1.950 GB
User Identity Events 0 KB 499.197 MB 1.950 GB
UI Caches 4 KB 1.462 GB 2.925 GB
Backups 0 KB 3.900 GB 9.750 GB
Updates 0 KB 5.850 GB 14.625 GB
Other Detection Engine 0 KB 2.925 GB 5.850 GB
Performance Statistics 33 KB 998.395 MB 11.700 GB
Other Events 0 KB 1.950 GB 3.900 GB
IP Reputation & URL Filtering 0 KB 2.437 GB 4.875 GB
Archives & Cores & File Logs 0 KB 3.900 GB 19.500 GB
Unified Low Priority Events 1.329 MB 4.875 GB 24.375 GB
RNA Events 0 KB 3.900 GB 15.600 GB
File Capture 0 KB 9.750 GB 19.500 GB
Unified High Priority Events 0 KB 14.625 GB 34.125 GB
IPS Events 0 KB 11.700 GB 29.250 GB
Health Alert Format
When the Health Monitor process on the management center runs (once every 5 minutes or when a manual run is triggered), the Disk Usage module looks into the diskmanager.log file and, if the correct conditions are met, the health alert is triggered.
The structures of these health alerts are as follows:
-
Frequent drain of <SILO NAME>
-
Drain of unprocessed events from <SILO NAME>
For example,
-
Frequent drain of Low Priority Events
-
Drain of unprocessed events from Low Priority Events
Its possible for any silo to generate a Frequent drain of <SILO NAME> health alert. However, the most commonly seen are the alerts related to events. Among the event silos, the Low Priority Events are often seen because device generates this type of events frequently.
A Frequent drain of <SILO NAME> event has a Warning severity level when seen in relation to an event-related silo, because events will be queued to be sent to the management center. For a non-event related silo, such as the Backups silo, the alert has a Critical severity level because this information is lost.
Important | Only event silos generate a Drain of unprocessed events from <SILO NAME> health alert. This alert always has a Critical severity level. |
Additional symptoms besides the alerts can include:
-
Slowness on the management center user interface
-
Loss of events
Common Troubleshoot Scenarios
A Frequent drain of <SILO NAME> event is caused by too much input into the silo for its size. In this case, the disk manager drains (purges) that file at least twice in the last 5-minute interval. In an event type silo, this is typically caused by excessive logging of that event type.
A Drain of unprocessed events of <SILO NAME> health alert is caused by a bottleneck in the event processing path.
There are three potential bottlenecks with respect to these Disk Usage alerts:
-
Excessive logging ― The EventHandler process on threat defense is oversubscribed (it reads slower than what Snort writes).
-
Sftunnel bottleneck ― The Eventing interface is unstable or oversubscribed.
-
SFDataCorrelator bottleneck ― The data transmission channel between the management center and the managed device is oversubscribed.
Excessive Logging
One of the most common causes for the health alerts of this type is excessive input. The difference between the Low Water Mark (LWM) and High Water Mark (HWM) gathered from the show disk-manager command shows how much space there is available to take on that silo to go from LWM (freshly drained) to the HWM value. If there are frequent drain of events (with or without unprocessed events), review the logging configuration.
-
Check for double logging ― Double logging scenarios can be identified if you look at the correlator perfstats on the management center:
admin@FMC:~$ sudo perfstats -Cq < /var/sf/rna/correlator-stats/now
-
Check logging settings for the ACP ― Review the logging settings of the Access Control Policy (ACP). If the logging setting includes both "Beginning" and "End" of connection, modify the setting to log only the end to reduce the number of events.
Communications Bottleneck ― Sftunnel
Sftunnel is responsible for encrypted communications between the management center and the managed device. Events are sent over the tunnel to the management center. Connectivity issues and/or instability in the communication channel (sftunnel) between the managed device and the management center can be due to:
-
Sftunnel is down or is unstable (flaps).
Ensure that the management center and the managed device have reachability between their management interfaces on TCP port 8305.
The sftunnel process should be stable and should not restart unexpectedly. Verify this by checking the /var/log/message file and search for messages that contain the sftunneld string.
-
Sftunnel is oversubscribed.
Review trend data from the Heath Monitor and look for signs of oversubscription of the management center's management interface, which can be a spike in management traffic or a constant oversubscription.
Use as a secondary management interface for eventing. To use this interface, you must configure its IP address and other parameters at the threat defense CLI using the configure network management-interface command.
Communications Bottleneck ― SFDataCorrelator
The SFDataCorrelator manages data transmission between the management center and the managed device; on the management center, it analyzes binary files created by the system to generate events, connection data, and network maps. The first step is to review the diskmanager.log file for important information to be gathered, such as:
-
The frequency of the drain.
-
The number of files with Unprocessed Events drained.
-
The occurrence of the drain with Unprocessed Events.
Each time the disk manager process runs it generates an entry for each of the different silos on its own log file, which is located under [/ngfw]/var/log/diskmanager.log. Information gathered from the diskmanager.log (in CSV format) can be used to help narrow the search for a cause.
Additional troubleshooting steps:
-
The command stats_unified.pl can help you to determine if the managed device does have some data which must be sent to management center. This condition can happen when the managed device and the management center experience a connectivity issue. The managed device stores the log data on to a hard drive.
admin@FMC:~$ sudo stats_unified.pl
-
The manage_proc.pl command can reconfigure the correlator on the management center side.
root@FMC:~# manage_procs.pl
Before You Contact Cisco TAC
It is highly recommended to collect these items before you contact Cisco TAC:
-
Screenshots of the health alert seen.
-
Troubleshoot file generated from the management center.
-
Troubleshoot file generated from the affected managed device.
-
Date and Time when the problem was first seen.
-
Information about any recent changes done to the policies (if applicable).
-
The output of the stats_unified.pl command as described in Communications Bottleneck ― SFDataCorrelator.