It seems counter-intuitive that something that is as old and widespread as NFS would be severely lacking in the access logging area. However, that does indeed appear to be the case. Recently I’ve been able to take a break from my project management related activities to spend some time in something deeply technical to help prevent myself from getting stale. The scenario started out very simply: take advantage of a hardware lifecycle refresh to move a legacy service that is small, ill-defined and running on bare metal and mature it to be a properly architected service running on virtual infrastructure. In the process however, I discovered an NFSv3 share that is being served (and relied on for years) that has no defined and documented owner, users or requirements; just bits of associated tribal knowledge. I needed to figure out who is using this NFS share to clear up the ambiguity and consequently find out if it’s even in use anymore… That should be easy, right? Well, turns out, not so much…
Recently I was fighting with a scenario where I was throwing an exception but couldn’t figure out why the exception was actually happening. I’ve created a much simpler example below to demonstrate what I learned during the process of solving this.
The simplified scenario is as follows: I have a Python Module that calls another module to do some work that is wrapped in a try/except. Code is shown below.
Here’s the problem: when I run the code, I throw the catch-all exception, but no details are available as to the reason why it is failing. In my real world situation, it caused a lot of head scratching as during my troubleshooting there was no obvious cause.
One of the first things that we need to detect when using Wavefront is if one of our endpoints goes silent. To do that, anyone can use the Query Wizard to create a basic alerting query. In this example I’m going to use the “system.uptime” metric as my base in the Alert Wizard’s magic sauce with a 1 minute time window. The Query Wizard shows me quite nicely that I’ve had a couple outages in the past. But, if you look closely, the 2nd through 4th indicated outages are all the same duration, but the outage duration is actually quite different…
Ok, let’s do a bit of playing around to figure out what’s going on. My testing strategy is as follows:
- 1 minute agent outage – 10:55-10:56
- 2 minute agent outage – 11:00-11:02
- 4 minute agent outage – 11:10-11:14
- 5 minute agent outage – 11:25-11:30
- 10 minute agent outage – 11:40 – 11:50
I was recently experimenting with Zenoss Core and noticed that some odd behavior was occuring with my Linux devices. I’m using SNMP v3 and the devices would model correctly and Components would show up, however graphing and performance data never showed up.
There’s an article at https://support.zenoss.com/hc/en-us/articles/204643769?input_string=how+to+recover that shows how to nuke HBase and OpenTSDB for data corruption, which is normally the nuclear option in this case, however this is a new build. There are several other articles around this subject, but none had the solution that I needed.
It came to my attention during a talk with a customer today that there is some ambiguity around what is needed to use the VMware vRealize Log Insight Agent, and when it’s required. Since I’m writing this up for them, I figured it’s best to just publish it for anyone else who might have the same questions.
vRealize Log Insight can ingest logs from native syslog sources, as well as via the vRLI Agent. The vRealize Log Insight Agent is a robust log collection mechanism that can read from log files in various formats, as well as channels in the Windows Event Log. It provides encryption of events over the wire and is very resource friendly, but the primary benefit is in endpoint management. Gone are the days of having to configure your endpoints individually, with the vRLI Agent you can manage what files to read, per device class, all from your vRLI Server’s web interface. Additionally, native syslog, especially in applications doesn’t forward all the events that you sometimes want to display. A perfect example for this scenario is when you are looking at Dashboards inside of vRLI, and even though you have syslog configured in Horizon View, your widgets are still blank. The reason for this is that a lot of the Content Packs require logs that won’t natively be sent over a generic syslog method, they rely on additional logs that are stored on the file system. To make this information easy to collect, most Content Packs provide an Agent Group with these files predefined. This begs the question, what is an Agent Group?
Simply put: An Agent Group is a set of instructions on what logs to gather, that is limited by a user-defined criteria to a subset of your devices. Let’s take a look at a practical example, Horizon View…
VMware’s Photon OS is a minimal container host that is also used as the host OS of VMware appliances such as the vCenter Server. When using an appliance such as vCenter you can use the API to configure SNMP, however, if you just use base Photon it’s not that simple.
The below Ansible playbook is runsafe (can run multiple times without negative consiquences) and installs/configures net-snmp and creates a SNMP v3 user. Enjoy!
We have been investigating how to mature our Time Series Database architecture and options. Towards this end, I have completed an assessment of a couple of the most popular TSDB options as well as exploring Wavefront. In our team we heavily depend on Open Source, but Wavefront is very interesting since it was recently acquired by VMware. Here’s a quick burndown of the assessment.
Every comparison has some assumptions, here are the major ones that I made during this comparison effort.
1. There are currently a couple options that warranted investigation:
a. Wavefront by VMware (Much of the below does not apply to Wavefront since it is a SaaS offering.)
Although popular, OpenTSDB was not investigated since initial research appears to show a general dislike of it compared to InfluxDB and Prometheus.
2. We will be using Telegraf as our agent of choice on remote systems for the collection and transmission of events.
3. All comparisons are under identical load. All graphs are show with both servers receiving the same load via identical queries and ingestion. Load is based on mimicking 900 Telegraf agents that are sending/posting metrics every 7 seconds. This happens via a Telegraf imitator that I wrote in Go.
This whole bit of research has brought up a whole list of questions regarding not just simply the use of time series data, but a more general question of its role in monitoring. Please see https://www.usenix.org/conference/srecon17americas/program/presentation/wilkinson for a very informative talk on the subject.
Every comparison needs to have a set of pre-defined criteria to base our decisions and testing on. Our list will be:
Our Zenoss instance is integrated with ServiceNow so that our support organization can open an incident with the appropriate event details at the click of a button from the Zenoss Events Console. The workflow for this looks something like the below flowchart that I just threw together.
One of the projects that I am working on is enabling the forwarding of debug logs on all of our VMware vCloud Director Cells to our global Log Insight instance. To do this however we need a fairly accurate appraisal of what the increased overhead is going to look like. As part of this process I’m starting to create a python program that will allow me to quickly find what the current Events per Second (EPS) and log size in KBps are.
As you can see the script can be run locally or be pointed at a remote host and looks for the latest fully committed debug log. If you don’t want to use that one, no worries, you can easily specify a different log file to use. If the target is a remote server the script will copy the appropiate log file to the machine running the script and then do the analytics locally to remove any possibility of unnecessary overhead from the cell server.
The script is still in active development as a side project but I hope to add the ability to query vCenter Servers as well in the near future. If you’re curious the code is hosted on my Github repo and as always is not supported or affiliated with VMware in any way….
I will be presenting at VMworld 2015 in session MGT4579 on “Data In-Sight!! Experiences Running VMware’s Private Cloud with Log Insight”. If you are at VMworld feel free to attend and say hi as I’d love to get to meet you in person!