Running a Minimal Prometheus

clux September 07, 2024 [software] #kubernetes #observability #prometheus #homelab

Prometheus does not need to be hugely complicated, nor a massive resource hog, provided you follow some principles.

Background

My last #prometheus posts have been exclusively about large scale production setups, and the difficulties this pulls in.

I would like to argue that these difficulties are often self-imposed, and a combination result of inadequate cardinality control and induced demand.

👺: You should be able to run a prometheus on your handful-of-machine-sized homelab with <10k time series active, using less than 512Mi memory, and 10m CPU.

Disclaimer: Who am I?

I am a platform engineer working maintenance of observability infrastructure, and a maintainer of kube-rs working on rust integration of observability with Kubernetes platforms. My knowledge of prometheus is superficial and mostly based on practical observations around operating it for years. Suggestions or corrections are welcome (links at bottom).

Signals, Symptoms & Causes

To illustrate this, let's try to answer the (perhaps obvious question): why do you install a metrics system at all?

Primarily; we want to be able to track and get notified on changes to key signals. At the very basic level you want to tell if your "service is down", because you want to be able to use it. That's the main, user-facing signal. Setup something that test if the service responds with a 2XX, and alert on deviations. You don't even need prometheus for this.

However, while you can do basic sanity outside the cluster, you need utilisation and saturation, to tell you about less obvious / upcoming failures:

message queues full :: rejected work
high cpu utilisation :: degraded latency
memory utilisation high :: oom kill imminent
disk space nearing full :: failures imminent

You can argue idealistically about whether you should only be "aware of something going wrong", or be "aware that something is in a risky state" (e.g. symptoms not causes), but utilisation/saturation/errors are ultimately very good indications/predictors for degraded performance, so we will focus on these herein.

How Many Signals

MAIN POINT: You should be able to enumerate the basic signals that you want to have visibility of.

Let's do some simplified enumeration maths on how many signals you actually want to properly identify failures quickly.

Compute Utilisation/Saturation

Consider an example cluster with 200 pods, and 5 nodes.

Want {utilization,saturation} of {cpu,memory} at cluster level :: 2 * 2 = 4 signals
Want to break these down per Pod :: 200 * 4 = 800 signals
Want to break these down per Node :: 5 * 4 = 20 signals

So, in theory, we should be able to visualise cluster, node, and pod level utilisation for cpu and memory with only 820 metrics (but likely more if you want to break down node metrics).

NB: Pods are only found on one node, so Pod cardinality does not multiply with Node cardinality.

These will come from a combination of cadvisor and kube-state-metrics; huge beasts with lots of functionality and way more signals than we need. Depending on how much of the kubernetes mixin you want, you may want more signals.

Node State Breakdown

If you want to break down things within a node on a more physical level, then you you can also grab node-exporter.

Assuming, for simplicity, 10 cores per node, 10 disk devices per node, and 10 network interfaces:

Want {utilization,saturation,errors} of cpu :: 3*10 * 5 = 60 signals
Want {utilization,saturation,errors} for memory :: 3*5 = 15 signals
Want {utilisation,saturation,errors} of disks :: 3*10 * 5 = 100 signals
Want {utilisation,saturation,errors} of network interfaces :: 10*3 * 5 = 150 signals

In theory, you should be able to get decent node monitoring with less than 400 metrics.

In Practice

Doing this type of enumeration is helpful as a way to tell how close you are to your theoretical 100% optimised system, but how does the number hold up in practice?

Metric Inefficiencies

The problems with expecting this type of perfection in practice is that many metric producers are very inefficient / overly lenient with their output. You can click on the addendum below for some examples, but without extreme tuning you can minimally expect a 5x inefficiency factor to apply to the above in practice.

Addendum: Inefficiency Examples

Take for instance, job="node-exporter" putting tons of labels on metrics such as node_cpu_seconds_total:

{cpu="15", mode="idle"} 1
{cpu="15", mode="iowait"} 1
{cpu="15", mode="irq"} 1
{cpu="15", mode="nice"} 1
{cpu="15", mode="softirq"} 1
{cpu="15", mode="steal"} 1
{cpu="15", mode="system"} 1
{cpu="15", mode="user"}

Maybe you dont care about either of these splits, you just want to have a cpu total and data for one mode.

Well, firstly, it's hard to get rid of mode because the standard recording rules depend on mode. Secondly, while you could try to rid of cpu, there's no great way of doing this without something like stream aggregation as labeldrop: cpu leads to collisions.

Similarly, if you want to support a bunch of the mixin dashboards with container level breakdown, you are also forced to grab a bunch of container level info for e.g. k8s.rules.container_resource unless if you want to rewrite the world.

And on top of this, many of these exporters also export things in very suboptimal ways. E.g. job="kube-state-metrics" denormalising disjoint phases (never letting old/zero phases go out of scope):

kube_pod_status_phase{phase="Pending", pod="forgejo-975b98575-fbjz8"} 0
kube_pod_status_phase{phase="Succeeded", pod="forgejo-975b98575-fbjz8"} 0
kube_pod_status_phase{phase="Failed", pod="forgejo-975b98575-fbjz8"} 0
kube_pod_status_phase{phase="Unknown", pod="forgejo-975b98575-fbjz8"} 0
kube_pod_status_phase{phase="Running", pod="forgejo-975b98575-fbjz8"} 1

This particular one may get a fix though.

Visualising Cardinality

I have a dashboard for this with this panel:

Shows metrics produced, how many we decided to keep, and a breakdown link leading to topk(50, count({job="JOB-IN-ROW"}) by (__name__)). It's probably the panel that has helped the most in slimming down prometheus.

Outing Out

The most unfortunate reality is that this stuff is all opt-out. The defaults are crazy inclusive, and new versions of exporters introduce new metrics forcing you to watch your usage graph upgrades.

Thankfully, in a gitops repo this is easy to do (if you actually generate your helm templates into a deploy folder) because all your dashboards and recording rules live there.

Procedure: Unsure if you need this container_sockets metric? rg container_sockets deploy/ and see if anything comes up:

No hits? action: drop.
Only used in a dashboard? Do you care about this dashboard? No? action: drop.
Used in a recording rule? Does the recording rule go towards an alert/dashboard you care about? No? action: drop.

Grafana cloud has its own opt-out ML based adaptive metrics thing to do this, and it's probably helpful if you are locked into their cloud. The solution definitely has a big engineering solution feel to it, but can't really blame it as it's not like you can easily remove polluting labels from needed metrics these days without causing collisions in OSS prom.

Things You Don't Need

Sub Minute Data Fidelity

You are probably checking your homelab once every few days at most, so why do you expect you would need 15s/30s data fidelity? You can set 60s scrape/eval intervals and be fine:

scrapeInterval: 60s
evaluationInterval: 60s

You could go even higher, but above 1m grafana does starts to look less clean.

To compensate, you can hack your mixin dashboards to increase the time window, and set a less aggressive refresh:

# change default refreshes on grafana dashboards to be 1m rather than 10s
fd -g '*.yaml' deploy/promstack/promstack/charts/kube-prometheus-stack/templates/grafana/dashboards-1.14/ \
   -x sd '"refresh":"10s"' '"refresh":"1m"' {}
# change default time range to be now-12h rather than now-1h on boards that cannot handle 2 days...
fd -g '*.yaml' deploy/promstack/promstack/charts/kube-prometheus-stack/templates/grafana/dashboards-1.14/ \
   -x sd '"from":"now-1h"' '"from":"now-12h"' {}
# change default time range to be now-2d rather than now-1h on solid dashboards...
fd -g 'k8s*.yaml' deploy/promstack/promstack/charts/kube-prometheus-stack/templates/grafana/dashboards-1.14/ \
   -x sd '"from":"now-12h"' '"from":"now-2d"' {}

..which is actually practical if you use helm template rather than helm upgrade.

👺: Somewhere out there, jsonnet users are gouging their eyes out reading this.

Most Kubelet Metrics

97% of kubelet metrics are junk. In a small cluster it's the biggest waste producer in a small cluster, often producing 10x more metrics than anything else. Look at the top 10 metrics they produce with just 1 node and 30 pods:

None of these are useful / contribute towards my above goal. Number 4 and 5 on that list together produce more signals than I consume in TOTAL in my actual setup. The kube-prometheus-stack chart does attempt to reduce this somewhat, but it by far too little.

Histograms

Histograms generally account for a huge percentage of your metric utilisation. On a default kube-prometheus-stack install with one node, the distribution is >70% histograms:

Of course, this is largely due to kubelet metric inefficiencies, but it's a trend you will see repeated elsewhere (thankfully usually with less eye-watering percentages).

Unless you have a need to see/debug detailed breakdown of latency distributions, do not ingest or produce these metrics (see addendum for more info). It's the easiest win you'll get in prometheus land; wildcard drop them from your ServiceMonitor (et al) objects:

      metricRelabelings:
      - sourceLabels: [__name__]
        action: drop
        regex: '^.*bucket.*'

Addendum: Why does histograms suck?

Multiplicative Cardinality: histogram buckets multiply metric cardinality by 5-30x (the number of buckets).

This number is multiplicative with regular cardinality:

pod / node labels added by prometheus operator :: if 2000 metric per pod, and scaling to 20 pods, now you have 40000 metrics
request information added by users (endpoint, route, status, error) :: 5 eps * 10 routes * 8 statuses * 4 errors = 1600 metrics, but with 30 buckets = 48000
See the Containing Your Cardinality talk for more maths details

Inefficiency 30x multiplied information is a bad way to store a few additional signals

If you want P50s or P99s you can compute these in the app with things like rolling averages or rolling quantiles. Some of these are more annoying than others, but there's a lot you can do by just tracking a mean.

Strive for one signal per question where possible.

That said, if you do actually need them, try to decouple them from your other labels (drop pod labels, drop peripheral information) so that you can get the answer you need cheaply. A histogram should answer one question, if your histogram has extra parameters, you can break them down to smaller histograms (addition beats multiplication for cardinality)

👺: Histograms will get better with the native-histograms feature (see e.g. this talk). However, this requires the ecosystem to move on to protobufs and it being propagated into client libraries (and is at the moment still experimental at the time of writing), it's only been 2 years though.

A Solution

Because I keep needing an efficient, low-cost homelab setup for prometheus (that still has the signals I care about), so now here is a chart.

It's mostly a wrapper chart over kube-prometheus-stack with aggressive tunings / dropping (of what can't be tuned), and it provides the following default values.yaml.

You could use it directly with:

helm repo add clux https://clux.github.io/homelab
helm install [RELEASE_NAME] clux/promstack

but more sensibly, you can take/dissect the values.yaml file and run with it in your own similar subchart.

👺: You shouldn't trust me for maintenance of this, and you don't want to be any more steps abstracted away from kube-prometheus-stack.

Architecture Diagram

The chart is effectively a slimmed down 2022 thanos setup; no HA, no thanos, no prometheus-adapter, but a including a lightweight tempo for exemplars:

Features

A low-compromises prometheus, all the signals you need at near minimum cost.

Cardinality Control

Drops 97% of kubelet metrics, configures minimal kube-state-metrics, node-exporter (with some cli arg configurations and some relabelling based drops) and a few other apps. In the end we have a cluster running with 7000 metrics.

It's higher than my idealistic estimate - particularly considering this is a one node cluster atm. It can definitely be tuned lower with extra effort, but this is a good checkpoint. This is low enough that grafana cloud considers it free. So is it?

Low Utilization

The end result is a prometheus averaging <0.01 cores and with <512Mi memory use over a week with 30d retention:

cpu / memory utilisation 7d means

Auxilary Features

Tempo for exemplars, so we can cross-link from metric panels to grafana's trace viewer.
Metrics Server for basic HPA scaling and kubectl top.

👺: You technically don't need metrics-server if you are in an unscalable homelab, but having access to kubectl top is nice. Another avenue for homelab scaling is keda with its scale to zero ability.

CX Dashboards

The screenshots here are from my own homebrew cx-dashboards released separately. See the future post for these.

Links / Comments

Posted on mastodon. Feel free to comment / raise issues / suggest an edit.