When does “a little code” become “a lot of trouble”

October 8, 2019

During my career I’ve come from a technical background to a more strategic position and picked up several memorable lessons along the way.  One of those is especially relevant to me in my current role where I have to decide on pursuing a short-term fix (while pursuing the long-term solution) or have the disciple to go the long-term route without a workaround.  I’m personally biased towards the short-term fix, so it’s been a good lesson that has saved me trouble as my role becomes more strategic.  Many people with a technical background are familiar with one of these two scenarios when supporting internal operational tasks.

  • You have a deadline to deliver a solution, but the amount of time to have “the right” team deliver it is far beyond your project’s deadline.
  • You have an open source tool that handles most of your needs. However, great business value can be added with a few (today) tweaks to the tool’s code.

These two scenarios represent what I have experienced as the two biggest drivers of this dilemma of customized code that can potentially provide critical business value and separate our organization from the competition.  However, on the other side of this two-edged sword, these customized code solutions have a high level of associated risk of falling into mismanagement, presenting security risks, or locking you into your own forked version and losing the benefits of updates provided by the larger, original code stream.  As an engineer at heart, I want to explore and identify several data points to help me (and hopefully you) better answer the question: “Should I invest in custom code, or should I justify a longer-term solution to my management and get deadlines pushed?”.  Let’s dive into each of these two categories to investigate and see if we can identify clear decision indicators and maybe some best practices. more “When does “a little code” become “a lot of trouble””

NFSv3 Usage and Audit Logging

November 9, 2018

It seems counter-intuitive that something that is as old and widespread as NFS would be severely lacking in the access logging area.  However, that does indeed appear to be the case.  Recently I’ve been able to take a break from my project management related activities to spend some time in something deeply technical to help prevent myself from getting stale.  The scenario started out very simply: take advantage of a hardware lifecycle refresh to move a legacy service that is small, ill-defined and running on bare metal and mature it to be a properly architected service running on virtual infrastructure.  In the process however, I discovered an NFSv3 share that is being served (and relied on for years) that has no defined and documented owner, users or requirements; just bits of associated tribal knowledge.  I needed to figure out who is using this NFS share to clear up the ambiguity and consequently find out if it’s even in use anymore…  That should be easy, right?  Well, turns out, not so much…

more “NFSv3 Usage and Audit Logging”

EXEC useradd in Docker fills hard drive on host

August 14, 2018

I recently discovered the hard way while building a Docker Image that when you add a user with an abnormally large UID it will eat an exorbitant amount of disk space on your Docker host.  The bug that lead me to this conclusion is here, and the fix is easy, just use the “-l” flag with the useradd command,  But my original run left me in a pickle.  In this one case, I can’t add enough disk space to my LVM logical volume to recover, I needed to delete the culprit. I tried to use docker system prune to no avail, tried deleting all unnecessary images, nothing.  MBs were freed-up and I still had a giant black hole that swallowed up almost 50 GB.

This might not be the best approach and IT WILL DELETE YOUR DOCKER DATA, but for this host I don’t care about keeping docker data, All images are posted upstream in Artifactory once built, this host just builds them once and sends them upstream.  In my case the culprit was a sub-folder in /var.lib/docker/overlay2 and took almost 50 GB.  If you delete the specific sub-folder folder in overlay2, you’ll get an error that “no such file or directory” exists inside of the overlay2 directory when trying to build the image again.  Here’s how I recovered, but be warned again, it’s a bit like killing a fly with a sledge hammer….  you’ve accomplished your goal in the end, but there’s a hole in the table when you’re done.

systemctl stop docker
rm -rf /var/lib/docker/
systemctl start docker

This immediately freed up the space necessary, and Docker rebuilds the contents of /var/lib/docker when the daemon starts.  From here I was able to run the build again without issues from the useradd command.

Python Exception inside Try/Except Statement

July 15, 2018

Recently I was fighting with a scenario where I was throwing an exception but couldn’t figure out why the exception was actually happening.  I’ve created a much simpler example below to demonstrate what I learned during the process of solving this.

The simplified scenario is as follows:  I have a Python Module that calls another module to do some work that is wrapped in a try/except.   Code is shown below.



Here’s the problem: when I run the code, I throw the catch-all exception, but no details are available as to the reason why it is failing.  In my real world situation, it caused a lot of head scratching as during my troubleshooting there was no obvious cause.

more “Python Exception inside Try/Except Statement”

Playing with Wavefront – Network Packet Loss

December 21, 2017

Now that we know when an agent goes offline, let’s create a query to detect when our devices experience an increased rate of dropped packets. To do that we’ll create 2 queries, the first is our data on all dropped packets per source.

sum(mavg(5m,ts("net.drop.*", source="FQDN,sub.domain.com" )),sources)

This value is represented with the blue line in the below chart.

That’s great, but we want to detect a change in trends, not just alert on a static threshold. To do that we’re going to create a query that uses moving averages. This query is reflected in the above chart as the orange line.

sum(mavg(2m,ts("net.drop.*", source="FQDN,sub.domain.com")), sources) - sum(lag(5m,mavg(2m,ts("net.drop.*", source="FQDN,sub.domain.com"))),sources)

As you can see, it handles the upticks rather nicely so we’re going to create an alarm off of it using the value of 10 as our threshold.

sum(mavg(2m,ts("net.drop.*", source="sg01-0-jnks1")), sources) - sum(lag(5m,mavg(2m,ts("net.drop.*", source="sg01-0-jnks1"))),sources) > 10

You can see the alert condition triggers in the Alert Backtesting match what we expected from our above research; every time the orange value was over 10 we receive an alert.

There we go…


Playing with Wavefront – Missing Agents

December 19, 2017

One of the first things that we need to detect when using Wavefront is if one of our endpoints goes silent. To do that, anyone can use the Query Wizard to create a basic alerting query. In this example I’m going to use the “system.uptime” metric as my base in the Alert Wizard’s magic sauce with a 1 minute time window. The Query Wizard shows me quite nicely that I’ve had a couple outages in the past.  But, if you look closely, the 2nd through 4th indicated outages are all the same duration, but the outage duration is actually quite different…


Ok, let’s do a bit of playing around to figure out what’s going on. My testing strategy is as follows:

  • 1 minute agent outage – 10:55-10:56
  • 2 minute agent outage – 11:00-11:02
  • 4 minute agent outage – 11:10-11:14
  • 5 minute agent outage – 11:25-11:30
  • 10 minute agent outage – 11:40 – 11:50

more “Playing with Wavefront – Missing Agents”

Getting Started – Wavefront by VMware – Queries

December 6, 2017

Ok, let’s chat about Wavefront’s UI and getting value from our data! This is the most user-friendly product that I’ve used for time series data! Let’s explore a quick example using disk space to showcase some of that functionality.

Telegraf only sends raw values for:

  • Total Space
  • Free Space
  • Used Space

This is seen below where I have intentionally limited the results to a single host and single disk object. We have ~55 GB Total and ~30 GB Free.

What if I want to know the percentage of used space?

It’s actually amazingly intelligent in allowing you to do math based operations on objects dynamically. I’m even going to do it a bit backwards, intentionally, to showcase this. Ideally, the mathematical formula to figure the percentage of used space out would be:

100 * (disk.used / disk.total)

to find the inverse (percent available) we can simply subtract that value from 100(%):

100 – ( 100 * (disk.used / disk.total))

Let’s see if that actually works:

Wow, just like that we can take two metrics, multiply, divide and subtract to show the percent of disk used. Wavefront automatically handles the correlation of the devices and properly applies the math. Ok, that’s nice, but how can I tell if I’m rapidly running out of disk space WITHOUT setting a static threshold.

First, we’re going to clone our original query using the little copy icon to the right of the query. Then, select the Query Wizard.

The Query Wizard makes people like me look smart. I can select the general category that I want…

and then I can select the method to use in that category. The wizard automatically applies the requested query syntax and previews the results for you to easily verify it is behaving as desired.

That’s great, now let’s create an alert on this standard deviation. In this case I can see that my disk space deviation on an hourly basis is generally under 1. For this example, I’m going to configure our alarm to trigger when the SD is greater than 2.

First, for reference sake, this is what our values look like:

Now, remember that an alert will be triggered whenever your query returns a non-zero value. This means that we need to modify our query slightly to include a threshold of sorts so that only breaches above our threshold (2) trigger an alarm (value above zero). You can see how the below spikes in our alert definition correlate to the above raw data.

There you have it, both the basics, a bit of advanced data manipulation as well as some alerting based on statistical analysis!



Wavefront by VMware – Missing Metrics – Point outside of reasonable timeframe

December 6, 2017

I ran into an issue where metrics were not showing up in Wavefront and wanted to share the solution. Bottom line, if you send any metrics to the Wavefront Proxy, they absolutely must be in nanoseconds. Here’s why:

I have an application that sends telemetry via Telegraf’s Socket Listener Input Plugin, and the telemetry timestamp was in milliseconds. What happened is that none of the data was recorded in Wavefront. Instead, inside of the /var/log/wavefront/wavefront.log there were entries stating that “[WF-402: Point  outside of reasonable timeframe“. There doesn’t appear to be any published mention of this error number. Later on in the error message it goes on to state that the timestamp value was “1506000”. However, my original timestamp was 1506628301128 so there is a huge discrepancy of about 47 years…

The root cause of this can be seen in the Wavefront Output Plugin’s code in the plugins/outputs/wavefront/wavefront.go file on line 214:

It appears that without checking to verify that the values received by the Wavefront Output Plugin are in nanoseconds, it just divides the value down (attempting to get to a second value for legacy system support). Hence, when a value is passed to the output plugin in a value smaller than nanoseconds (in our case: milliseconds), the division makes the timestamp a value too small to be valid and it’s rejected by the proxy as such.

Getting Started – Wavefront by VMware – Telegraf Agents

December 6, 2017

Wavefront, as a TSDB utilizes a wide range of Collectors to gather time series data from various devices. Most of these collectors utilize a, currently forked, version of the popular Telegraf Agent. A preview of this out-of-box functionality is shown below.

This forked agent includes and Output Plugin for the Wavefront Protocol and the changes can be seen in the Github Pull Request which is currently in version 1.5 RC1 of the native Telegraf Agent! Once Telegraf 1.5 releases, there will no longer be a need to use the custom fork.

Ok, that said, the installation is straightforward:

more “Getting Started – Wavefront by VMware – Telegraf Agents”

Getting Started – Wavefront by VMware – Automated Proxy Installation

December 6, 2017

As I mentioned in a previous post, we are beginning to use Wavefront, which is a Time Series Database (TSDB) that has an great user experience. Here’s a brief Getting Started guide that covers a bit of reverse engineering I did on the out-of-box installation process so that we can use automation to deploy the Wavefront Proxy.

The first step towards beginning to use Wavefront is to deploy local Wavefront Proxies inside of your environment that will ingest time series data and forward it to Wavefront since it is a SaaS based product. These proxies are easy to deploy as I’ll show, but I’ll also show how we automated the installation via Puppet. Get started by logging in to the Wavefront UI and select Browse > Proxies.

The next screen offers several convenient options for adding the proxy with a simple cut-and-paste command using:

  • Linux
  • Windows
  • Docker
  • Mac

more “Getting Started – Wavefront by VMware – Automated Proxy Installation”