One of the first things that we need to detect when using Wavefront is if one of our endpoints goes silent. To do that, anyone can use the Query Wizard to create a basic alerting query. In this example I’m going to use the “system.uptime” metric as my base in the Alert Wizard’s magic sauce with a 1 minute time window. The Query Wizard shows me quite nicely that I’ve had a couple outages in the past. But, if you look closely, the 2nd through 4th indicated outages are all the same duration, but the outage duration is actually quite different…
Ok, let’s do a bit of playing around to figure out what’s going on. My testing strategy is as follows:
- 1 minute agent outage – 10:55-10:56
- 2 minute agent outage – 11:00-11:02
- 4 minute agent outage – 11:10-11:14
- 5 minute agent outage – 11:25-11:30
- 10 minute agent outage – 11:40 – 11:50
If you look at the below chart you can see that each outage is outlined (manually) in grey. The 5 minute outage is highlighted to make it clear due to other background colors. Using the Wizard’s generated query, it’s pretty close… but it prematurely clears due to the second component in the AND statement.
mcount(1m, (ts("system.uptime",source="*FQDN.com"))) = 0 and mcount(1m,lag(1m,(ts("system.uptime",source="*FQDN.com")))) != 0
If you were to simplify it and just use the basic mcount function it gives us a better result, but not without a side effect..
mcount(1m, ts("system.uptime", source="*.FQDN.com")) = 0
That undesirable side effect is that agents that are slower to report in, are always in an alarm state as shown below…
A workaround is to simply add the lag statement so that it ignores the current minute, however, I’m curious and want to push it a bit to explore how this works…
Because of this, I’ve created a bit of a hybrid alert, that takes the wizard generated query, modified with the 1 minute lag, and adds a second component that uses the built-in “~metric.counter” as an additional indicator of a loss of incoming metrics. Take note, the “~metric.counter” value is internal to Wavefront, and so even if your device is completely offline, it can still increment slightly…
lag(1m,(mcount(1m, (ts("system.uptime",source="*.sub.domain.com")))) = 0 and lag(1m,mcount(1m,lag(1m,(ts("system.uptime",source="*.sub.domain.com"))))) != 0) or (sum(mavg(1m,ts("~metric.counter", source="*.sub.domain.com")), sources) - sum(lag(1m,mavg(1m,ts("~metric.counter", source="*.sub.domain.com"))), sources) < 100)
The results of this experiment are shown below.
Compared to the basic mcount query with a 1 minute lag, the results are almost identical, however the hybrid alert provides a slightly earlier warning of an outage.
In the end, we might just want the simpler query to detect an outage because of reduced query overhead. However, in an effort to learn the Wavefront Platform, I’m going to keep going deeper until I’ve wrapped my head around the concepts of using math in monitoring. Thankfully, that’s greatly assisted with the very friendly UI for people like myself who haven’t used advanced math in years..