Nagios vs Sensu vs Icinga2
Choosing a suitable monitoring framework for your system is important. If you get it wrong you might find yourself having to re-write your checks and setup something different (most likely) at great cost. Recently I looked into a few monitoring frameworks for a system and came to a few conclusions which I'll share below.
At the time of this investigation, the system (which has a microservices architecture) was in the process of being "productionised". It had no monitoring in place and had never been supported in production. The plan was to introduce monitoring so that it could be supported and monitored 24x7 with the hope of achieving minimal downtime.
Warning - I am biased
Before we get started, I have to acknowledge a few biases I have. I have worked with nagios in the past and found it to be bit of a pain. However, this was probably due to the fact we created our checks in puppet which added an extra layer of complexity to an already high learning curve. I decided to re-evaluate nagios because (a) we'd be creating our monitoring checks directly and (b) nagios has moved on since.
I think I might also be biased to favour newer technologies over older ones for no better reason that I'm currently at a startup who are working with a lot of new technologies.
Pros:
Pros:
Pros:
Cons:
Background
The SystemAt the time of this investigation, the system (which has a microservices architecture) was in the process of being "productionised". It had no monitoring in place and had never been supported in production. The plan was to introduce monitoring so that it could be supported and monitored 24x7 with the hope of achieving minimal downtime.
Warning - I am biased
Before we get started, I have to acknowledge a few biases I have. I have worked with nagios in the past and found it to be bit of a pain. However, this was probably due to the fact we created our checks in puppet which added an extra layer of complexity to an already high learning curve. I decided to re-evaluate nagios because (a) we'd be creating our monitoring checks directly and (b) nagios has moved on since.
I think I might also be biased to favour newer technologies over older ones for no better reason that I'm currently at a startup who are working with a lot of new technologies.
Requirements
As follows:- Highly scalable in terms of:
- handling complexity (presenting a large number of checks in a way that's easy to understand).
- handling load (can support lots of hosts with lots of checks).
- Secure.
- Good UI with:
- Access to historical alerts.
- Able to switch off alerts temporarily - possibly with comments e.g. “Ignoring unused failing web node - not causing an issue".
- Easy to extend/change.
- Ability to define custom checks.
- Ability to add descriptive text to an alert e.g. “If this check fails it means users of our site won't be able to...”>
- Easy to adjust alarm thresholds.
- Good support for check dependencies.
- This is related to requirement 1.1) - Ideally the monitoring system will be able to help the user separate cause from affect. When you have 100s of alerts firing it becomes hard to establish the underlying cause (see earlier post on this). Without alert dependencies, the more alerts you add the more you increase the risk of confusion during an incident. This is hugely important for your users when it comes to fixing a problem in 5 minutes instead of 30!
Nagios
Nagios is very popular, can do everything, but comes with several drawbacks. For my proof of concept I extended Brian Goff's docker-nagios image.Pros:
- Very popular so lots of support.
- Huge number of features.
- Good documentation.
- High learning curve due to its number of features. This applies to both navigating the UI and writing checks.
- Creating check dependencies is cumbersome as you have to reference checks via their service_description field. This means you either use the description like an ID (i.e. not a description) or you duplicate your description in all the places you reference (depend on) your checks.
- Creating checks with a check frequency of higher than 60 seconds involves a "proceed at your own risk" disclaimer "I have not really tested other values for this variable, so proceed at your own risk if you decide to do so!" See here.
- UI feels old (at least it does to me). I remember being very frustrated with the use of frames for the dashboard which mean it's hard to send people links and if you hit F5 to refresh, it takes you back to the homepage.
Sensu
Sensu is a lightweight framework that's simple to extend and use. I used Hiroaki Sano's sensu docker image to get my proof of concept up and running.Pros:
- Has a fantastic UI. Feels nice and simple.
- Has support for dashing https://github.com/mrichar1/dashing-sensu
- Has been dockerized: https://github.com/hiroakis/docker-sensu-server
- Big list of community plugins: https://github.com/sensu/sensu-community-plugins/tree/master/plugins
- Can run any monitoring plugin (formerly known as nagios plugin)
- Checks can be defined on server and run on client OR defined on client and run on client (standalone).
Cons:
- UI has a feature called stash which I don't understand and doesn't seem to be documented.
- Dependencies can be configured however they seem to have little affect on the dashboard... see issue I raised on github here.
- Documentation is great for a beginners guide and walkthrough. I learnt the basics very quickly. However I quickly found the need for a specification page which detailed exactly what the json could/could not contain. This will be fixed soon hopefully: https://github.com/sensu/sensu-docs/issues/192
- Could not associate descriptive test with my checks e.g. "This checks for connectivity to the database which is required for..."
Icinga2
Originally a fork of the nagios project (and now a complete re-write), this framework has a huge number of features and a good looking dashboard. I used Jordan Jethwa's icinga2 docker image
Pros:
- Has good support for alert dependencies and reflected in dashboard.
- Objects (checks, dependencies etc etc) can be created using expressions with conditionals which reduces the need for boilerplate copy+paste config.
- Has two good looking UIs (have only used icinga-web):
- Can run any monitoring plugin (formerly known as nagios plugin)
- Has native support for graphite (not used graphite yet but it might be on the cards)
Cons:
- Found the documentation hard to understand at first - this is related to the high learning curve.
- Can assign notes / free text to alerts but the dashboard seems to only present it at quite a low level. I couldn't find a way of customising the dashboard to display my "notes" This could be customised at the dashboard level (via some feature I missed or is perhaps undocumented).
Chosen Framework: icinga2
Sensu was sadly discarded because we felt there was a risk it would not scale in terms of handling the future complexity of the system. This was mainly due to the apparent lack of support for dependencies. The gaps in the documentation also made it feel like it wasn't quite ready to be adopted. I'm definitely hopeful for sensu's future and look forward to seeing how it develops... it's definitely one to watch.
Nagios vs Icinga2.
They both have:
They both have:
- the same number of compatible plugins.
- lots of features at the cost of a high learning curve.
- both handle dependencies (so should scale well in terms of complexity).
- icinga2 has a nicer UI - it feels more responsive.
- dynamically creating objects and their relationships with conditionals (I think) should result in less boiler plate and copy pasted code which I have seen with nagios in the past.
Excellent post. I've now worked with all three, and agree with you on almost all points. I'm going with Icinga2 as well, although I don't like icingaweb2 at all, and am replacing it with Thruk which is very obviously inspired by the old Nagios interface, but fixes all the issues, such as the lack of direct links with the frames.
ReplyDeleteCompletely agreed that Sensu is one to watch, but the lack of complete functionality in any one of the three UIs that I worked with (uchiwa and sensuadmin being the other two) was really frustrating. The API access was pretty cool, though.
Greetings from Seattle. We actually went along a similar path, and ended up migrating away from sensu to icinga2. We are also using ansible for all the things!
ReplyDeleteI did the same thing sensu is cool but it just lack the maturity that we needed
ReplyDeleteRan across this post the other day: http://thehackernews.com/2015/12/how-to-hack-instagram.html
ReplyDeleteHad to laugh when I saw it was a flawed Sensu Admin which allowed him in and all that access. Even gladder I went with Icinga2, although any software needs to be patched regularly. Also wonder why they had it publically accessible.
"UI has a feature called stash which I don't understand and doesn't seem to be documented."
ReplyDeleteSame stash as in Git. Hide event temporary from handling. Like "Ok, I know about that alarm but I will fix this later. Don't bother me now."
Icinga/Nagios is not suited for cloud infrastructures with servers changing at every moment. Whenever the infrastructure changes, Icinga/Nagios has to be reconfigured and restarted. Although this can be automatically done with configuration management tools, it is still not a clean solution, since provisioning runs are performed only in certain intervals. The monitoring tool should be able to handle a changing environment on its own in real-time.
ReplyDeleteThank you for sharing your views on the monitoring options
ReplyDelete