Zenoss Core 5 – Graphs not working

November 14, 2017

I was recently experimenting with Zenoss Core and noticed that some odd behavior was occuring with my Linux devices. I’m using SNMP v3 and the devices would model correctly and Components would show up, however graphing and performance data never showed up.

There’s an article at https://support.zenoss.com/hc/en-us/articles/204643769?input_string=how+to+recover that shows how to nuke HBase and OpenTSDB for data corruption, which is normally the nuclear option in this case, however this is a new build. There are several other articles around this subject, but none had the solution that I needed.

more “Zenoss Core 5 – Graphs not working”

Log Insight (vRLI) Agent Configuration – A basic primer

October 25, 2017

It came to my attention during a talk with a customer today that there is some ambiguity around what is needed to use the VMware vRealize Log Insight Agent, and when it’s required. Since I’m writing this up for them, I figured it’s best to just publish it for anyone else who might have the same questions.

vRealize Log Insight can ingest logs from native syslog sources, as well as via the vRLI Agent. The vRealize Log Insight Agent is a robust log collection mechanism that can read from log files in various formats, as well as channels in the Windows Event Log. It provides encryption of events over the wire and is very resource friendly, but the primary benefit is in endpoint management. Gone are the days of having to configure your endpoints individually, with the vRLI Agent you can manage what files to read, per device class, all from your vRLI Server’s web interface. Additionally, native syslog, especially in applications doesn’t forward all the events that you sometimes want to display. A perfect example for this scenario is when you are looking at Dashboards inside of vRLI, and even though you have syslog configured in Horizon View, your widgets are still blank. The reason for this is that a lot of the Content Packs require logs that won’t natively be sent over a generic syslog method, they rely on additional logs that are stored on the file system. To make this information easy to collect, most Content Packs provide an Agent Group with these files predefined. This begs the question, what is an Agent Group?

Simply put: An Agent Group is a set of instructions on what logs to gather, that is limited by a user-defined criteria to a subset of your devices. Let’s take a look at a practical example, Horizon View…

more “Log Insight (vRLI) Agent Configuration – A basic primer”

Configure SNMP on VMware’s PhotonOS

October 13, 2017

VMware’s Photon OS is a minimal container host that is also used as the host OS of VMware appliances such as the vCenter Server. When using an appliance such as vCenter you can use the API to configure SNMP, however, if you just use base Photon it’s not that simple.

The below Ansible playbook is runsafe (can run multiple times without negative consiquences) and installs/configures net-snmp and creates a SNMP v3 user. Enjoy!


Comparision of Time Series Database Options

September 27, 2017

We have been investigating how to mature our Time Series Database architecture and options. Towards this end, I have completed an assessment of a couple of the most popular TSDB options as well as exploring Wavefront. In our team we heavily depend on Open Source, but Wavefront is very interesting since it was recently acquired by VMware. Here’s a quick burndown of the assessment.

Every comparison has some assumptions, here are the major ones that I made during this comparison effort.


1. There are currently a couple options that warranted investigation:

a. Wavefront by VMware (Much of the below does not apply to Wavefront since it is a SaaS offering.)

b. Prometheus

c. InfluxDB

Although popular, OpenTSDB was not investigated since initial research appears to show a general dislike of it compared to InfluxDB and Prometheus.

2. We will be using Telegraf as our agent of choice on remote systems for the collection and transmission of events.

3. All comparisons are under identical load. All graphs are show with both servers receiving the same load via identical queries and ingestion. Load is based on mimicking 900 Telegraf agents that are sending/posting metrics every 7 seconds. This happens via a Telegraf imitator that I wrote in Go.

This whole bit of research has brought up a whole list of questions regarding not just simply the use of time series data, but a more general question of its role in monitoring. Please see https://www.usenix.org/conference/srecon17americas/program/presentation/wilkinson for a very informative talk on the subject.

Review Criteria

Every comparison needs to have a set of pre-defined criteria to base our decisions and testing on. Our list will be:

more “Comparision of Time Series Database Options”

Automated Build of a WordPress Site

September 26, 2017

With the plethora of readily available and economical hosting solutions, it’s super easy to spin up a WordPress site. Deciding that my blog needed moved from Blogger, I’ve built this one using Google Cloud Platform for the underlying OS (running CentOS7) and created a simple Ansible Playbook w/template to build a WordPress server with minimal manual intervention.


If you are interested in basic automation, checkout Ansible and feel free to use the playbook to spin up your own WordPress Server.

Deploying vRealize Log Insight (vRLI) via API

June 8, 2017

I’ve finally gotten around to upgrading the vRLI Configuration Management and Audit Tool to handle the full deployment process as well as clustering! Let’s take it for a spin to see what the new features allow us to do!

1. First we need to deploy the vRLI VMs from OVA that can be downloaded from my.vmware.com. Once they have been fully booted and you see them serving the following webpage we can start. You can close your browser at this point; nothing is required here other than we are checking to make sure that they are fully booted.

2. The tool uses a JSON configuration file that you can see a sample of by running the program with a “-d” flag or browsing the first part of the Python (my recommended approach). You can also generate a simplified version by calling the wizard using a “-b” flag. For now, I’m going to create my configuration file based on the sample in the documentation with a single Master Node under the “fqdn” key and 2 Secondary Nodes under the “nodes” key in my JSON file. This means that when the script is done I will have a new, 3 node vRLI Cluster.
Let’s kick off the program and tell it to use my configuration file by running:
python li-json-api.py -f  ctest.json -r

more “Deploying vRealize Log Insight (vRLI) via API”

Getting Fancy with Log Insight Alerting (aka. Monitoring DHCP pools via logs)

October 5, 2016

Recently, I was asked about monitoring Microsoft DHCP IP Address Pools using Log Insight to alert when the pool was exhausted and DHCP requests were failing. There are a couple ways to do this, but I’d like to cover two as a demonstration of getting a bit fancy with your alert queries and it paying off big time!

First off, Microsoft DHCP Servers write their events to a log file – at the end of the day…. so we can parse that file for an Event ID of 14 to see when we ran out. This is easy to do as shown below using Event ID 11 (DHCP Renew) as an example. The regex is simple but unfortunately we get the information way too late!

Enter the Log Insight Agent’s ability to read Windows Event Logs! As your DHCP Server starts running low on available addresses in a certain pool it starts to throw warnings in the System Event Log with an Event ID of 1376 that state what percent is currently used and how many addresses are still available.

It would be really cool if we could have Log Insight fire off an alert if these messages showed that we were above 90% used, right? But it’s text… how do we do math on text in log messages? The good news is that not only can you accomplish this; it’s easy to do!

First off, we need to create an Extracted Field that allows us to treat the value of percentage used as an integer. Simply highlight the number and select “Extract Field”

more “Getting Fancy with Log Insight Alerting (aka. Monitoring DHCP pools via logs)”

Corrupt Microsoft SQL Database Log in AlwaysOn High Availability Group (AAG)

September 22, 2016
We recently ran into an issue with one of our environments where the Microsoft SQL Server experienced corruption in the database log. This issue is usually discovered when you attempt to create a new backup and it fails with the message “BACKUP detected corruption in the database log”

Resolving this issue is normally fairly easy (set the database from a Full Recovery Model to simple and then back again) but it gets a bit more complex when you database is replicated via an AlwaysOn High Availability Group. Here are the steps to fix it (assuming no other databases are in the AAG).

1. Remove Secondary Replica – First we need to stop replication to the secondary replica. To do this we are going to connect to the primary node in our cluster and right click on the SECONDARY replica. Then we select “Remove from Availability Group” and follow the wizard.

2. Remove Database from AAG – Next we need to remove the database from the AAG by right clicking on it under the Availability Databases folder and selecting “Remove Database from Availability Group”
At this point you should have your primary node as the only member of the AAG with no databases associated. At this point you are going to delete the database from the SECONDARY node. Your secondary server should now have no replicas, no availability databases and no database. 
3. Next we need to change the remaining copy of the database on our primary node from Full to a Simple Recovery Model by right clicking on the database and selecting properties > Options.
4. Next we need to do a full backup of the database.
5. Repeat the steps in #3 but in this case change it from simple back to the original Full Recovery Model.
6. Backup the database again.
Now we are ready to re-add the secondary replica
7. On the primary server right click on the Available Replicas folder and select “Add Replica…”
Next you will need to select the “Add Replica” button and will be prompted to connect to your secondary server.
After this you will want to configure your replica. In our case we have selected to have the secondary copy of the database as readable as well as enabling automatic failover.
In the next screen you will need to configure your sync preferences. We are using a Full sync which requires a file share accessible by both SQL Servers. Using this file share SQL will run a backup and place it on the remote share and the secondary node will restore the database from this initial backup. 
Follow the wizard and verify that everything passes
After this you can track the progress of the backup/restore/sync
With that you should have a working AlwaysOn Availability Group again!

FreeTDS and Microsoft SQL Server Windows Authentication – Part 1

September 16, 2016

I’ve been trying to get the Zenoss SQL Transaction Zenpack working so that we can use Zenoss to run SQL queries for specific monitoring purposes and ran into a few things that might be worth sharing.

Using tsql for troubleshooting

Zenoss, among many other tools uses pymssql to connect to your SQL Servers; and pymssql uses FreeTDS behind the scenes. If you can’t get pymssql to work them you can go a layer deeper to see if you can find the issues. In my case I have the following configuration:

Fedora Server 23

First off, FreeTDS uses a config file at /etc/freetds.conf that has a [Global] section and examples for configuring individual server types. This is important because you need to use TDS version 7.0+ for Windows Authentication to work.

If we try to connect using the diagnostic tool tsql (not to be confused with the language T-SQL) without changing the default TDS version or adding a server record in the config file our attempts will fail

To fix this you can either:
Change the Global value for “tds version” to be 7+ (sounds like a good idea to me if you only have MSSQL):

or you can add a server record for each Microsoft SQL Server and leave the global version less than 7.

The catch to second method is that when you do your queries you will have to call the name as shown in the config file (in this case us01-0-srs1) and you cannot use the FQDN or it will fail because it defaults back to the Global setting. This method also creates overhead in managing the list of MSSQL Servers in the freetds.conf file.
Either way, at this point you should have tsql being able to query your MSSQL Servers using Windows Authentication
Getting started with pymssql
To make sure that pymssql is working I threw together a quick bit of python that allows you to connect using Windows Authentication

It’s basically a simplified version of the example on the pymssql web page, but will prove if pymssql and MSSQL Windows Authentication is working or not.

————-BEGIN Code
import pymssql

print(‘Connecting to SQL’)
conn = pymssql.connect(server=’server.domain.com’, user=’DOMAIN\username’, password=’Super Secret P@ssW0rds’, database=’master’)

print(‘Creating cursor’)
cursor = conn.cursor()

print(‘Executing query’)
SELECT MAX(req.total_elapsed_time) AS [total_time_ms]
FROM sys.dm_exec_requests AS req
WHERE req.sql_handle IS NOT NULL

print(‘Fetching results’)
row = cursor.fetchone()
while row:
    row = cursor.fetchone()

print(‘Closing connection’)

————-END Code 
After filling in the details on your MSSQL Server you can simply run it and get the results
Part 2 will cover the Zenoss specific aspects of this…