I had been wanting to try Sensu out for quite a while. If I’m not mistaken, I think the first time I heard about someone using it in production was back at the 2014 OpenStack summit in Paris where fellow OpenStack Operator Joe Topjian told me he had been using Sensu.
A few weeks back, the RDO community formalized that we needed to improve the monitoring for our build, packaging, mirror and CI infrastructure. The timing was such that the TripleO community had recently been working towards integrating Sensu to monitor their OpenStack deployments.
I decided to pick Sensu to monitor the RDO infrastructure. This gave me the opportunity to leverage their work to help me get started quickly but also provide feedback and contributions back to help strengthen the integration.
Before we start talking about notifications in Sensu, let’s make sure we are on the same level on some of the basic topics.
In a nutshell
I’m not new to the world of monitoring servers and their services. I’ve been managing and monitoring servers for a good part of the last decade with a mix of Nagios, Shinken, Icinga with some Graphite, Munin and Cacti on the side.
Sensu, for better or for worse, is nothing like the traditional monitoring software so it deserves some explaining.
The Sensu architecture flow in a nutshell:
Client -> Subscription <- Check -> Event -> Filter -> Mutator -> Handler
- Client: A
client
is a server running a check. Usually this is either a server you are monitoring or a server monitoring another server/service. - Subscription: Clients and checks are matched together through a
subscription
. This can be fairly analogous to a template in other monitoring software since you decide which check to run in a subscription and then you subscribe clients to it. - Check: A
check
, for checking things like resource usage or anything you can think of. - Event: An
event
is generated every time a check is done. It’s the result of the check, positive or negative, there’s always going to be one. - Filter: A
filter
is a way to decide which event is going to be triggering notifications (handler execution). - Mutator: A
mutator
is run after a filter and modifies the event data. This can be useful if your handler expects additional information not usually found in the event. - Handler: A
handler
is actually anything that receives an event data and does something with it. This could be an e-mail notification or a script that restarts things. Anything.
I encourage going through the official documentation to get a better understanding of each component.
Starting with something simple
In Nagios land, you might have something very straightforward that looks like this:
# Check every 5 minutes
# If there is a failure, re-check every minute
# Consider this a legitimate problem after 5 check attempts
# Notify contact_groups about a problem every 30 minutes
define host {
host_name bogus-host
address 192.168.1.254
check_command check-host-alive
check_interval 5
retry_interval 1
max_check_attempts 5
contact_groups admins
notification_interval 30
}
Let’s try to translate this very simple Nagios configuration into something consumable by Sensu.
client.json
Nothing special here:
{
"client": {
"name": "bogus-host",
"address": "192.168.1.254",
"subscriptions": [
"default",
],
}
}
check-host-alive.json
{
"checks": {
"check-host-alive": {
"command": "/usr/local/bin/check-host-alive",
"interval": 60,
"occurrences": 5,
"subscribers": [
"default"
],
"handlers": [
"notification_script"
],
}
}
}
Let’s dissect the check configuration above.
The command
field is fairly self explanatory. A command of your choice will run and the status is decided based on the exit code of the script, not unlike Nagios plugins:
0 = OK
1 = WARNING
2 = CRITICAL
3 = UNKNOWN
This effectively means you can re-use your Nagios scripts if you want.
interval
is another built-in check attribute and it allows you to configure the frequency at which the check is done.
There isn’t nothing similar to the retry_interval logic from Nagios-like configuration, so, for the sake of simplicity, let’s forget about that and just run the check every 60 seconds.
Then, there is the occurrences
field in the definition of the check.
Remember that a check always generates an event. In this event data, there is another occurrences field that contains the amount of times the check returned the same status.
By default, Sensu will filter events and prevent them from reaching handlers unless the event occurrences are equal or greater than the check occurrences. This is our way of doing the same thing as max_check_attempts from Nagios.
After that we have the subscribers
, in this case default. With the matching client.json subscription, this means this specific check will be configured for that client.
The handlers
are scripts that receive event data as STDIN and then does something with that data.
The hard(er) part: Notification interval
With the above Sensu configuration, we have something fairly similar to our Nagios config.
The check will be run every minute on the client and it will trigger the handler if it reaches a bad status more than 5 consecutive times. After that, though ? It will trigger the handler for each and every subsequent occurrence. If your handler sends an e-mail, you’ll receive an e-mail every minute past the fifth bad status. Not exactly awesome.
The closest Sensu has to Nagios’ notification_interval parameter is mathematics. Really.
There are two ways to approach this with what Sensu provides out of the box:
- The
refresh
parameter - Built-in filters capability
Math: refresh
Sensu checks have a refresh parameter that would seem to help in doing this but I have not been able to get things working the way I wanted to.
From the documentation:
Time in seconds until the event occurrence count is considered reset for the purpose of counting occurrences, to allow an event for the check to be handled again. For example, a check with a refresh of 1800 will have its events (recurrences) handled every 30 minutes, to remind users of the issue.
In theory, you could do something that looks like this:
# Same config as above, just with an additional "refresh" parameter.
{
"checks": {
"check-host-alive": {
"command": "/usr/local/bin/check-host-alive",
"interval": 60,
"occurrences": 5,
"refresh": "1800",
"subscribers": [
"default"
],
"handlers": [
"notification_script"
],
}
}
}
So, in this configuration we still have a check that will run every minute and won’t trigger your handler unless it has 5 consecutive bad statuses. Then, according to the documentation, the handler should only be triggered once every 30 minutes with the refresh parameter.
It didn’t really work for me and maybe it was just me, though, but I kept getting spammed by notifications regardless. Maybe this is only used by specific plugins ? I don’t know.
I gave up on using this parameter early on anyway because it meant doing different math for different checks and it was just awkward. I say different math for different checks because not all of my checks were as simple as “interval: 60, occurrences: 5”. Some were set to longer intervals or lower and higher occurrences count – and that meant having different refresh parameters for every check.
Math: filters
Another way that initially seemed very promising was the built-in filtering capability.
You can do something like this:
{
"filters": {
"recurrences": {
"attributes": {
"occurrences": "eval: value == 1 || value % 60 == 0"
}
}
}
}
So, in that example, we check that the event occurrences value is either 1 (i.e, the first bad status event) or it’s divisible by 60 (i.e, 60, 120, etc).
This means doing more math even beyond the actual filter because you need to build your filters based on your occurrences and intervals to get the notification interval just right.
For example, I could have a check interval of 60 and an occurrences of 5. So, to get notifications every 30 minutes and on the 5th occurrence, I actually need:
# We check every 60 seconds, filter on 5th occurrence and on every 30th event (30 minutes)
{
"filters": {
"recurrences-5-30": {
"attributes": {
"occurrences": "eval: value == 5 || value % 30 == 0"
}
}
}
}
Okay, not so bad, I guess – but then you have to do one filter per specific check interval and occurrences combination to get your 30 minutes windows.
If I have a check with an interval
of 120 but an occurrences
of 3, I need another filter to get notifications every 30 minutes:
# We check every 120 seconds, filter on 3th occurrence and on every 15th event (30 minutes)
{
"filters": {
"recurrences-3-15": {
"attributes": {
"occurrences": "eval: value == 3 || value % 15 == 0"
}
}
}
}
I don’t need to tell you that this becomes a pain to manage.
Another limitation of filters is that you can only compare a field against itself. So for example you can only filter the event occurrences against the actual event’s occurrence and that’s it. You couldn’t, for example, filter the event occurrences against the check occurrences.
Something like this could be convenient:
{
"filters": {
"recurrences": {
"attributes": {
"occurrences": "eval: value == event[:check][:occurrences] || value % 15 == 0"
}
}
}
}
But alas, it is not possible right now - it looks like raising these limitations to the attention of the Sensu dev team lead them to start developing something to help do something like that.
Another solution: Ruby
I shared my pains and my thought process with the Sensu community on #sensu on freenode, github and the mailing list.
I eventually talked with folks from Yelp who essentially took the entire logic of notification outside of Sensu and into custom handlers. I was pointed to a custom filter extension they did and it was very enlightening to see how flexible Sensu can be – if you can write Ruby to save your life.
You can write completely custom filter extensions that will be fed the entire event data and then, with that data, your filter can decide to continue the process (send this to the next filter or to the handler) or stop the event. Such extensions can be dropped directly in /etc/sensu/extensions
and they will be loaded by sensu-server when it is reloaded. There is no need to declare a filter configuration in a json file.
There are some quirks associated with that apparently and I haven’t yet figured it out exactly. For example, in my experience, adding such filters seem to take out the default basic filters such as stopping processing silenced (stashed) events or the ones based on the check occurrences.
With some work, I eventually came up with two filters:
- filter_disabled_or_silenced re-creates the same default behavior of inhibiting notifications on silenced events or disabled checks.
- filter_check_retry_occurrences is about controlling the frequency of the notifications and deserves some explanation.
On my checks, I have a custom field, retry_occurrences
.
So, basically I have the check occurrences telling me when the first notification should occur and then retry_occurrences for triggering a notification every N retry_occurrences. So, something similar than what I tried (and failed) to achieve with the built-in filtering I wrote about above since basic filters are unable to compare two different fields.
This is harder than it should be
The built-in mechanisms to control handler execution are awkward, limited and fairly opaque.
However, Sensu is extremely flexible and allows you to insert yourself directly inside the execution pipeline. From another perspective, this took way too long to figure out and I have flagged the topic of notification control to the devs as an important missing piece of documentation.
I’m not happy that I had to write Ruby to get things to work the way I wanted to – I hate Ruby. Monitoring software should appeal and be accessible to system administrators and not every one of them can write Ruby.
I really hope this post is useful to people wanting to get started with Sensu and that the time I spent figuring this thing out can be worth it to you. Keep in mind that I am by no means a Sensu expert, if you have any comments, suggestions or feedback, please let me know in the comments below.
Share this post