Features of mon 0.99.2

mon was developed under Linux, but it is known to work under Solaris 2.5 and 2.6. Since the clients and server are written completely in Perl, portability shouldn't really be too much of an issue.

The following is a list of some of the features of mon:

Monitors
"Monitors" are programs that check for a particular condition, and report success or failure to the server, along with any output. They are independent of mon, so to add a test for a new service, you can just write your monitor in any language, put it in the monitor directory, and it just works.

Asynchronous Events
Support for asynchronous events communicated to the mon server. This is an open-ended protocol, like the monitor and alert scripts, so that you can trigger on anything. One obvious use is acting on SNMP traps. Traps generated by remote entities can be programmed to behave in the same manner as failures noticed by local polling monitors, so it is possible to build a distributed monitoring architecture. For example, remote monitoring domains (such as sites separated by slow WAN lines) can collect their own data locally and report significant events to a centralized location, such as a NOC.

Alerts
"Alert" scripts send a message or otherwise act on a failure that mon detects. These alerts, like the monitors, are not part of mon, and are easy to add. "Upalerts" are also supported, which are used to trigger an alert when a server comes back up after being down for a long amount of time.

Alert Management and Failure Handling
Failure of any monitor can trigger any (and multiple) alerts, to different people at different times. You can effectively construct "on call" schedules using this feature. For example, you can send a page to all system administrators if a resource goes down before 8PM, but after 8PM, page only Joe, but send email to everyone else.

Many alert throttling controls are implemented.

Parallelization
Parallelizes the checking of services on different hosts or groups of hosts. For example, pinging your routers can happen while it is also pinging your WWW servers. There's no queue that can postpone the scheduled testing of other services.

Repetitive Alert Supression
Repetitive alerts can be supressed. For example, only send email once an hour if a service continues to fail. As an option, small, transient failures of a service may be ignored.

Dependencies
Inter-service dependencies and even correlation. For example, if the router between the monitoring host and your WWW server is down, HTTP won't work, so only send an alert that the router is down. This prevents the cascading of zillions of alerts that happens when some critical resource is not accessible. Dependencies can be understood as a hierarchical form (a tree), and when a failure occurs, the tree is traversed towards the node which has no unresolved dependencies. However, complex dependencies can be described using a generic graph, since the actual implementation does not require a hierarchichal layout.

Flexible Configuration
A very flexible (and extensible) configuration file. Hosts can be grouped together, and each host or group can have multiple services. Have a look at an example configuration file. Another m4-based example.

Client/Server Model
Has interactive command-line, WWW-based, and SkyTel 2-Way alphanumeric pager-based clients that query the server for status and history. The protocol is simple, and it is very easy to make clients of your own. Multiple authentication methods are supported (including PAM), along with per-user access control. A Perl module API can be used to query the server, so writing alternate interfaces are simple (such as one which takes advantage of WAP, Wireless Access Protocol). At this point there are several WWW interfaces actively being maintained by different parties, each with its own report and goal.

Click here for demonstration of the mon.cgi web interface.

View-based Status Reports
To help with large configurations, "views" can be generated to simplify reports for customers who do not need to know the status of all services being monitored. For example, a "network" view can be generated which includes the status of all networking gear, just as a "servers" view can show all info pertaining to servers. Views can be configured on a per-customer basis if needed, and customers have control over their own views.

Run-time Alert Acknowledgement and Disabling
A service failure can be acknowledged so that alerts are surpressed until the problem is fixed. This "ack" state is retreivable from the client interface so that users can see that support staff are working on the problem. Also, Alerts for particular hosts, groups, or services can be temporarily disabled an re-enabled by the client, without stopping and restarting the server. If you're upgrading a particular server, you can disable the alert while you're doing the work, and re-enable it when you're done.

History
Keeps a historical list (queried by the clients) of both failures that were detected and alerts that were triggered.

Portability
Nothing to compile for the server or clients, and written in 100% Perl 5. This should help portability.
trockij@linux.kernel.org