, the open source monitoring system for Docker-style containers running in cloud architectures, has formally released a 2.0 version with major architectural changes to improve its performance.
Among the changes that have landed since the release of version 1.6 earlier this year:
- An entirely new storage format for the data accumulated by Prometheus.
- A new way for Prometheus to handle “staleness,” i.e. problems resulting when data reported by Prometheus doesn’t match the actual state of the cluster.
- A method for taking efficient snapshot backups of the entire database.
Most of the changes shouldn’t force experienced Prometheus users to retool their environments. The new features are meant to work under the hood, without significantly altering workflow, although there are a few breaking changes ().
New in Prometheus 2.0: More efficient time-series database storage format
Under the hood, Prometheus is a time-series database—a system for gathering statistics about running containers and storing them in a way that’s indexed by timestamps. Because time-series data arrives at high speed and from many sources, it’s hard to aggregate properly. Writing the data to disk becomes a major bottleneck.
. The result is far less CPU and disk usage, more manageable latency for queries, and a better mechanism for mopping up data that isn’t needed anymore.
Again, the vast majority of Prometheus deployments won’t need to do anything to leverage these improvements, other than deploy Prometheus 2.0.
New in Prometheus 2.0: Better handling of stale data from containers
Another problem Prometheus users have observed is how the system has trouble handling stale data. For instance, users sometimes get bombarded with alerts about a service being down, even after that service has already come back up. Another problem is if a resource disappears from monitoring and then reappears within a certain timefrane, it can end up being counted twice and produce misleading statistics.
Prometheus 2.0 deals with this by having more explicit rules for handling events from sources that have gone stale. The logic for handling this is surprisingly complex (), but the end user doesn’t have to deal with the vast majority of the details.
. Source code for the project, and all its related subprojects, is .