Polling Graphite with Nagios

2012-05-31 20:37:00 by jdixon

I'm a big proponent of using Graphite as the source of truth for monitoring systems where polling host and service checks have traditionally been the norm. Realistically, this will take a long and gradual shift in philosophy by the larger IT community. Until then, we can still use Nagios and Graphite in tandem for powering more insightful checks of our application metrics.

There are actually a few different "check_graphte" scripts out there. The first one I saw announced publicly was Pierre-Yves Ritschard's check-graphite project. Shortly afterwards I published my own check_graphite script. Pierre's version is smaller but doesn't appear to automatically invert the thresholds (e.g. if critical is lower than warning). Otherwise you should be fine using either module; the remaining differences are mostly isolated to implementation details and default values. Since this is my blog, I'm going to use my script for this example. ;-)

Before we look at a sample check, let's add our custom command in commands.cfg. Note that the -m (metric) option is enclosed in quotes. This allows us to do some neat stuff with complex targets and wildcards.

define command {
  command_name    check_graphite
  command_line    $USER32$/check_graphite -u https://graphite.example.com -m "$ARG1$" 
-w $ARG2$ -c $ARG3$ 2>&1
}

Ok, let's take a look at a basic service check that we use to monitor the number of metrics being received/written by Carbon. Because the critical threshold is less than warning, the script understands that "less is bad".

define service {
    service_description   Graphite Carbon Health Check
    hostgroup             graphite
    check_command         check_graphite!carbon.agents.*.committedPoints!350000!300000
}

Now here's a slightly more advanced example. This is a check we use at $DAYJOB to monitor for day-to-day spikes in one of our internal services. As you can see, any valid Graphite target definition will simply be passed through to the Graphite API service.

define service {
    service_description   XYZ Service - Percent Change over Previous Day
    host_name             heroku
    check_command         check_graphite!offset(scale(divideSeries(
custom.xyz.foobar,timeShift(custom.xyz.foobar,'1d')),100),-100)!15!25
}

There's a wide range of possibilities for this sort of check. In particular this can be a really powerful way to monitor your business processes and projections.

Comments

at 2012-06-01 10:26:47, Matt Simmons wrote in to say...

I like this.

There's a lot of value in having a single unified source of information that something like Nagios checks, but I do worry about how dynamic this is. It's very much a multi-step process whenever new services are added to the network. You've got to add them to Graphite, then you've got to add them to Nagios.

The other issue with things like this are alert resolution. If you're drawing your alerts from Graphite, then things in Nagios like retry_interval become useless without excessive values in check_interval (or very aggressive reporting in Graphite), which leads to a long lead time to alerts.

You could do up/down checking with Nagios direct to the servers, but then you're back to not having a central source for information and having double checks, since Nagios is returning performance data.

It's just not an "easy" problem.

at 2012-06-01 10:33:44, Jason Dixon wrote in to say...

@Matt - Agreed wrt adding stuff twice. This example is intended to get people thinking about the problem and initiating the shift towards metrics-driven monitoring. Ideally there would be a way to associate thresholds with their metrics (inside the field) so you wouldn't need an active service like Nagios with separate configuration data. This is a problem we're working on actively within $DAYJOB.

at 2012-07-02 21:12:22, xkilian wrote in to say...

Great post. In exchanging with Jelle Smet he is also aiming to push out some of that dynamic processing to the edge so that data can go t both the datastore and the threshold processing in parallel. Having A Shinken/Nagios instance query the the datastore is not necessarely bad, as combined metrics are available which would not be otherwise.

The wildcards also make adding and removing metrics are more dynamic affaire. Shinken/Nagios does not need to know that the cluster is composed of 5 servers instead of 6, but that the services is being provided to user/admin satisfaction.

Using Shinken is one avenue that can possibly interact quickly enough with Graphite to provide less delay between reality and notification.

at 2013-07-17 13:00:45, Jim Sander wrote in to say...

This may be off topic, but in trying to gather the list of availabile/collected metrics in Graphite, it would nice to have a programatic/cli method. I've been told it's not possible to list graphite metrics outside of the UI.

at 2013-07-17 13:28:36, Jason Dixon wrote in to say...

@Jim - You can always curl against /metrics/index.json for a dump of the current metrics. This is what https://github.com/obfuscurity/therry uses to keep a cache of metrics in memory for searches via API.

at 2014-02-10 05:10:31, Nir Dothan wrote in to say...

I've had a long struggle getting it to work on RHEL 6.4. The problem was that the system ruby version was 1.8.7, and rest-client required higher. So I tried rvm, then installed ruby 2.1.0 for source, but nagios kept ignoring and falling back to system, failing to load rest-client. Eventually I discovered that nagios was sourcing /etc/init.d/functions which overrides PATH, hence I prepended /usr/local/bin to $PATH in /etc/init.d/nagios and it now works.

Can any of you guys recommend cleaner solution? I'm sure that something better can be accomplished with rvm, but I know nothing about rvm.

Add a comment:

  name

  email

  url

max length 4000 chars