We have been investigating how to mature our Time Series Database architecture and options. Towards this end, I have completed an assessment of a couple of the most popular TSDB options as well as exploring Wavefront. In our team we heavily depend on Open Source, but Wavefront is very interesting since it was recently acquired by VMware. Here’s a quick burndown of the assessment.
Every comparison has some assumptions, here are the major ones that I made during this comparison effort.
1. There are currently a couple options that warranted investigation:
a. Wavefront by VMware (Much of the below does not apply to Wavefront since it is a SaaS offering.)
Although popular, OpenTSDB was not investigated since initial research appears to show a general dislike of it compared to InfluxDB and Prometheus.
2. We will be using Telegraf as our agent of choice on remote systems for the collection and transmission of events.
3. All comparisons are under identical load. All graphs are show with both servers receiving the same load via identical queries and ingestion. Load is based on mimicking 900 Telegraf agents that are sending/posting metrics every 7 seconds. This happens via a Telegraf imitator that I wrote in Go.
This whole bit of research has brought up a whole list of questions regarding not just simply the use of time series data, but a more general question of its role in monitoring. Please see https://www.usenix.org/conference/srecon17americas/program/presentation/wilkinson for a very informative talk on the subject.
Every comparison needs to have a set of pre-defined criteria to base our decisions and testing on. Our list will be:
Wavefront – While wavefront is a SaaS offering it does require the installation and configuration of a proxy. The deployment process is quick and painless with many available options.
InfluxDB – Deploying an InfluxDB Server is easily done using the provided repos and RPMs. Native integration with Systemd and journald makes management easy.
Prometheus – There really isn’t an “install” process. You simply download the compressed folder, create a configuration file and execute the Prometheus executable. No native integration with OS management.
ACL’s for data access
Prometheus – No security/ACL measures. Prohibits any multi-tenancy possibilities
InfluxDB – Yes, three levels of ACLs (Cluster Admin, Database Admin, Database User)
Wavefront – Yes, based on vIDM.
Data transmission via Telegraf
Prometheus does not have a mechanism to receive data, it will only pull data. This approach brings some rewards as well as downfalls that will be evident throughout the following review
Prometheus – Using the Telegraf output plugin for Prometheus you are simply posting a webpage specifically formatted output for Prometheus to scrape. You would have to open that specific port for exclusive access by Prometheus to prevent data leakage as well as configuring HTTPs on the Prometheus Server.
InfluxDB – Secure transmission using Telegraf over HTTPS is supported
Wavefront – Fully supports a custom RPM of Telegraf. Output plugin is not published as part of the Telegraf project, but is planned to be included in Telegraf 1.5 once released.
InfluxDB – During the bake-off an outage was experienced where the server rapidly went from ~69% disk usage to 100% in a matter of 15 minutes. This was very unusual considering the overall trend, but it appears that there are several bugs in the past with Influx where disk tidying went awry filled the disk. More information on this below in the Compression/Storage section.
No data corruption appears to have happened, more disk space was added and the service restarted to restore service.
Prometheus – During a reboot, all metric data appeared to be lost. Root cause is unknown as it was very early in my evaluation. Disk storage was configured at the time.
Wavefront – As a SaaS offering by VMware, support is enterprise grade
InfluxDB – Support is available via the Enterprise version of Influx
Prometheus – Support is available via 3rd party vendors
InfluxDB – 2.2 bytes per data point
Prometheus – 1.3 bytes per data point
These data points are roughly verified by the overall consumption of disk over time showing the rate of growth on the Influx server (left) and Prometheus (right). The grade appears to be steeper for Prometheus in the screenshots, but the is an illusion since the scale is different.
An interesting note is that InfluxDB tends to need a significantly large (see below warning) amount of temporary disk space over time as shown in the above left graph.
During the outage experienced during the testing period, this temporary amount of space was not available and caused a crash
Several datapoints tend to point out that you must have between 1.3 and 1.7 times the amount of used storage for this “bursting”. This might be configurable, but it warrants further investigation if the product is to be used.
Prometheus – Part of the reason for the superior performance of Prometheus queries is the amount of data that is cached in memory. By default, around 80% memory consumption is used but the value can be modified.
InfluxDB – Memory consumption is not static like Prometheus, but goes in cycles depending on ingestion and query load.
CPU usage seems relatively low and stable on Prometheus, during normal operations InfluxDB spikes up around 60%
(Both datapoints from non-verified 3rd party)
Prometheus – 800k/sec
InfluxDB – 470k/sec
Prometheus – Because Prometheus uses a pull instead of listen method, all endpoints must be added to an array of devices in the configuration file. This file can be updated and reloaded via API or kill HUP signal. One significant downside is that no path can be specified except the default “/metrics”
InfluxDB – Because any device can send metrics if it is using the correct credentials there is no overhead for device management.
InfluxDB – Can be configured per database
Prometheus – Only supported method is double writes which is essentially having 2 servers query and store mirrored results
InfluxDB – Available for Enterprise Customers only
Query Language and Functional Capabilities
InfluxDB – While providing lots of functions, that are well documented, and being similar enough that anyone with a SQL background can use it, it is nowhere near as capable as normal T-SQL. A perfect example of this is that it has some support for a “join” concept, but unlike T-SQL it grabs all fields of the same name and you cannot identify which series (think SQL Table). If you want to show the “available_percent” of the cpu series but not the mem series but are gathering other data from the mem series there is no way to select just the one. Simply put, I think that the approach to querying TS data is different enough that prior SQL syntax knowledge is not a critical deciding factor; I actually think that it might be a detraction.
Wavefront – This is honestly a thing of beauty. For beginners like myself there is a wizard and helps guide you and auto-complete to make creation easier. I’ll cover more of this in a future post.
Prometheus – Out of the box, It’s ugly, clumsy and has no options for security. Nothing here is attractive. Use Grafana…
InfluxDB – InfluxDB is part of a suite of tools and as such it doesn’t offer a UI without additional components. That was out of scope for this effort so has not been investigated. However, Grafana was used and performed well.
Prometheus – In my mind, the biggest downfall for Prometheus is that it is a pull-only model. I really wish that they also had the ability to push to the Prometheus Server. Granted, you can push to a middle ground and have it scraped; but in my opinion it is still a serious downside. More on this in the conclusion.
InfluxDB – Nothing here is unique and there is beauty in simplicity. Ingestion is native if you are using Telegraf and if not, using HTTP Post does the job quite nicely. During my testing I was mimicking first 450 hosts and then 900 with Telegraf metrics and sending the data every 7 seconds. Influx never blinked at the amount of sent metrics. One disadvantage when using the API is that data types matter and non-intuitively, integers must be followed with an “i”, for example “3i”, or it will be misidentified as a float and your API command will fail.
Prometheus – This feature is available but was not tested
InfluxDB – This feature is available but was not tested
Prometheus – Doesn’t expose any configuration over API
InfluxDB – Offers the ability to create databases over API, nothing more.
Prometheus – While providing rudimentary alerting naively, it is recommended to use 3rd party tool AlertManager. Alternatively, like InfluxDB, alerting can also be handled by Grafana.
InfluxDB – No native alerting, dependent on Grafana or other components of the TICK Stack.
Wavefront – Robust alerting with a very friendly UI
All 3 products work well with Telegraf
Grafana supports both InfluxDB and Prometheus out of the box
3rd Party Comparative Analysis
Attempted Comparison Only
The problem with comparing apples to oranges to pineapple juice is that they are all substantially different. All investigated solutions appear to be very capable of handling Time Series Data for our environments as long as they are properly architected, in some cases, with their limitations and weaknesses in mind. Wavefront is certainly a very attractive option among the three. Besides the reduced management overhead and very good performance, it is also the very clear winner in terms of a friendly UI. If SaaS is an option, I highly recommend that you investigate this one. If not…
InfluxDB has its strength in product maturity (even though it is version 1.3), a more familiar query syntax and expanded data types. The use of int64, float64, bool, and string data types provides an advantage in my mind over Prometheus’ limitation to float64 data types exclusively. That said, it does appear to be outperfomed by Prometheus on most fronts (partly due to a less restrictive data model). Additionally, there are published use cases of it working at enormous scale in production environments. HA is also an option here which is a large potential benefit. To expand that maturity a bit more, you can have RBAC to your stored data which allows for multi-tenancy.
Prometheus is obviously a less mature tool, and to a somewhat frustrating level at times, but shows great feature strengths. Some of the immaturity (such as lack of systemd integration and RPM) can be easily overcome by contributing to the project. I am very impressed with its stable resource usage, speed of query returns and efficient data storage. On the other hand, I very much dislike the pull only methodology. It seems to invite unnecessary complexity as well as opening the attack surface of your environment as well as easily allowing data leakage. While I readily admit that there are advantages, such as the ability to have multiple devices query endpoints without endpoint configuration, I feel this advantage is a bit overstated due to the security requirements that should be implemented so “not everybody” can read the published metrics. I’m not fully against a pull architecture, I just feel that the lack of an accompanying listener is its Achilles Heel. Also, to add a device you have to modify a text array in the configuration file of your Prometheus Server. After a dozen, it gets tedious, after a thousand, it’s a nightmare… If it were not for these two issues, Prometheus would be a clear winner in my mind against InfluxDB.
While I do admit that I have a bias towards Wavefront, it is not without merit. I’m a huge fan of usability, and Wavefront has the best, by far, out of the options I’ve investigated. In the end, we will be going forward using Wavefront for our TSDB, not simply because the tool is owned by VMware; but on it’s own merits, especially around usability. I’m looking forward to publishing many more posts around our journey implementing it in our environment.