September 1st, 2015 by Steen Gundersborg

I’ve been experimenting with how we can beef up our monitoring of everything java. For this post I will only discuss the collection, storing and visualisation of various metrics. Alerting, trending and whatever else is covered by the monitoring definition I will blatantly ignore.

Setting the stage

I have a list of criteria, that is going to drive my selection of components. The more off-the-shelf components I can configure and use, the better. Moreover, since we’re trying out Docker, I’d like to wire everything up using Docker Compose. So if there’s already an docker image with a specific component, terrific!

More or less everything we develop or use is running java. We develop a lot of micro-services using DropWizard, use lots of open source components (SolrHBase, WildFly etc.), so being able to pull metrics from JMX is a must. Support for pulling or accepting pushed OS metrics in some way is also a must. If the components we create our selves can push metrics directly themselves, it’s an added bonus but otherwise not a dealbreaker, since JMX will be the fallback.

An important criteria for me is that metrics should not need to be predefined, before the storage backend will accept and store them. Some of our existing solutions use Zabbix, where this is a requirement. That might be fine for a single, isolated system, but becomes a nuisance if the monitoring solution is shared across systems. More importantly, adding new metrics should be a breeze. Adding new metrics for all your new service rest endpoints alone is going to be annoying and error prone, if you need to define them up front.

Since we are on the subject of Zabbix, it would be nice if the frontend supports multiple data sources and Zabbix, so we can visualise metrics from existing systems in the same frontend.

Finally, it is imperative that we be able to import and export metric data.

With that the above mind, let’s sum up the criteria:

  1. Off-the-shelf components, where possible
  2. Everything running as Docker containers, orchestrated with Docker Compose
  3. Must support pulling and storing JMX metrics
    1. Supporting metrics pushed from components is added bonus
  4. Predefining metrics must not be a requirement for accepting/storing them
  5. Bonus if the the GUI supports multiple datasources, including Zabbix
  6. Support for import/export of data

Selecting the pieces

Conceptually, we need three components:

  1. Timeseries database
  2. JMX metrics collector
  3. Frontend

First of all, we need a timeseries database. Second, we need something, in general, to collect metrics from JMX and push them to the database. Third and finally, we need a frontend, whose only responsibility is visualising the collected data.

Now, let’s have a look at the chosen components.

Frontend: Grafana

Grafana is a beautiful, multi-datasource graphing frontend for visualising data. There is full support for Graphite, OpenTSDB and InfluxDB, as well as initial support for KairosDB as datasources. There’s even a query editor for each data source. With a plugin, Zabbix is also supported. It has rich graphing, templated dashboards and queries as well as annotation support. There’s also a docker image for it.

Timeseries database: KairosDB + Cassandra

I have previously tried InfluxDB in the 0.8.8 version (also using a docker image), but I ended up having some issues with it: In case of an unclean process exit, I couldn’t start it up again unless I wiped the data. The frequency of the unclean exit part was a side effect of the image I used and not really InfluxDB per se, but the issue would be the same in a process, OS or host crash. Moreover, as our data amounts grew the database would sometimes stop responding. The only fix seemed to be to restart it (and wiping data, as per the first issue).

I am aware that InfluxDB is out in a version 0.9.x, which is quite a change from 0.8.x. There is a new storage backend (BoltDB) and support for importing/exporting data. Clustering support has gotten a major rework, but is still in alpha. I don’t think we really need clustering performance wise, but it would be nice for high availability.

On the flip side, there are a lot of breaking changes from version 0.8.x to 0.9.x. So sticking to InfluxDB still means all 0.8.x dashes would need to be re-done. Based on that, I thought I’d look around to see what else is available.

There’s a fairly new kid on the block called KairosDB, which uses Cassandra as the backend storage. Some people are running quite large clusters with terabytes of data, as this example shows, so it’s more than sufficient for our needs. It supports the OpenTSDB telnet format, rest, Graphite and Carbon (using a plugin). So it has support for most of the currently used protocols for sending metrics. That, and it supports import/export.

JMX metrics collector

There is a tool called jmxtrans, which is multithreaded and can poll any number of java processes for JMX data. It has output writers for almost anything (OpenTSDB, Graphite, Ganglia, ElasticSearch, Kafka, RDDTool, StatsD etc.). In short, it’s very flexible and will fit well for this use case.

Gluing it all together

I found most of the above components as docker images on docker hub. Without further ado, let’s have a look at docker-compose.yml:

Just a comment: Since I fixed the ports, to get going as fast as possible, it’s quite obvious that the scale command won’t work. But for now, it doesn’t need to.

I chose to run a single-node cassandra, tuned for quick container startup by guys at Spotify. I mapped the JMX port and exposed the telnet protocol port 4242. That’s the one both Grafana and KairosDB is going to use. By the way, at least on a mac, using a volume doesn’t seem to work so I just commented that out for now. In a production setting, that would be a no-go.

KairosDB is not on docker hub, but there are several git repos. I chose the one from mesophere and built it locally. I did change the baseimage to my liking, since I had to build it anyway. I fixed another issue related to using the image with docker compose, but I’ll get to that later.

Jmxtrans is off-the-shelf and the same for grafana.

Gathering metrics

I created a json file for Jmxtrans, specifying how to pull metrics from ActiveMQ:

I won’t go into details about the config. You can read the docs about which options are available on the OpenTSDBWriter, that is responsible for writing to KairosDB. How to do queries is documented here. I will say that using wildcards in the object name combined with the typeNames parameter is going to reduce the amount of queries specified a lot, as well as make it more generic.

One thing I did do when experimenting is hardcode the ActiveMQ host and port. For production, I would need to change the image to run something like envsubst on the configuration files, to get environment variables replaces with whatever is specified in the docker-compose.yml file.

With regards to our DropWizard services, I used the Metrics KairosDB plugin and added this to our DropWizard service startup:

The host and port of the KairosDB server should be taken from configuration, but you get the idea.

Graphing

To be honest, I have been and still am playing around with how best to create graphs. Ideally, I’d like to use the otherwise awesome templating feature with KairosDB,  but can’t quite get it to work like I want it to – probably because support for KairosDB is initial. That, or I’m missing something. Regular graphs do work as intended though. I have been trying to recreate this homegrown HBase dashboard, based on InfluxDB 0.8.8:

Hbase dashboard

So far, the only thing I am not able to do in the same way is the GC graphs. The InfluxDB dash uses regular expressions to match the GC algorithm part in the name of the metric, which KairosDB doesn’t support. I can hardcode the GC algorithm names, but it would require us to change our dash, if we sometime later choose something other than CMS. One possible solution would be to change the metric to exclude the CG algorithm name from the metric name and add it as a tag instead. It’s just somewhat more work, compared to what comes out of the box with DropWizard.

Closing comments

For now, I think I’ve reached a point where creating a complete graphing solution by using existing components and supplying configuration seems doable. However, using and composing the docker images from docker hub and github turned out to be more of a nuisance than expected. So was trying to get templating to work in the new KairosDB query editor in Grafana, while the editor is still kind of beta. When something doesn’t work, it’s not always obvious if it is a bug or a usage error.

I have a few pointers on how to structure metrics, in order for you to get the most out of Grafana and KairosDB as well as some thoughts on the docker part. The latter is not specific to my use case, so feel free to skip it at your leasure.

Thoughts about how to structure metrics

To be honest, I still haven’t fully decided how I’d like to structure the naming of the metrics we have, in order to support the most likely graphing requirements, both while running multiple instances of the same components as well as different components. Experimenting with the current KairosDB support in Grafana has led me to a single mantra: If you want more than one series per graph, manually enter each one or use tags and group by them.

Templating only works if you select one value per variable. Although this might be good enough for some use cases, it really does cripple the feature, in my opinion. Otherwise, if you just want to add series with fixed metric names, you’re good.

So, basically I’m thinking about adding tags for things like hostname, component type (service, database etc.), component name and whatever else I can think of. There is some talk about KairosDB getting support for selecting multiple series by regular expression, but I’m not sure if and when it’ll be released. If it does, then you could achieve the same by having the tags as part of the metric name.

One thing to consider, if you are pondering to use tags like crazy, is that having a lot of tag-value combinations might hurt performance. Depending on how many metrics you plan on storing, you may want to think about it.

Annoyances with the docker images used

My gripes fall into two categories, which are not really related to docker itself as much as the images shared on docker hub:

  1. The container doesn’t handle dependant resources not being available at startup
  2. The container doesn’t shut down properly

Now, before I discuss my gripes, let me be clear in saying that I am not complaining about people sharing their hard work for free. The issues above might be completely irrelevant to the use cases the original creators. It’s just to say that before a lot of the images can be used in a production setting, they would need to be modified.

Now, the first issue was an issue with both the Jmxtrans and the KairosDB images I used. If Cassandra wasn’t alive and kicking when KairosDB needed it, KairosDB would fail end exit. Jmxterm would fail if KairosDB wasn’t available when jmxtrans was starting.

Normally, if containers are spun up one at a time, the problem doesn’t appear. But when using docker compose, all containers gets fired up at the same time. Really, this is about the individual containers not being sufficiently robust, since things like network errors should always be anticipated (also on startup!). It’s surprising that this seems to be such a common issue. Most docker developers I know, that uses docker compose, have seen this issue with one or another image.

The second issue is the one about stopping containers gracefully. Only PID 1 in the running container receives signals, so you need to make sure that either the main process is running as PID 1 or the actual PID 1 (typically sh or bash) forwards signals to the main process. This is not a new issue, but there are still lots of images on docker hub that don’t allow stopping the container gracefully (thereby killing everything in the container after a timeout).

Now hiring in Dubai! - Read more

Posted in develop software Tagged with: , , ,