shipcat retrospectiveclux December 15, 2018 Updated: September 28, 2023 [software] #rust #kubernetes
The now defunct unicorn startup babylon health needed to micrate about 50 microservices to Kubernetes in early 2018. At Kubernetes 1.8, supporting tooling was weak, but the company pace was fast.
This is an historically updated post about
shipcat, a standardisation tool written to control the declarative format and lifecycle of every microservice, and get safety quickly.
This article was updated after babylon's demise in 2023. It now serves as a mini-retrospective instead of the mostly broken announcement (with the original repo being down). We add some better showcases and examples, and historical context, that together should help avoid some common misconceptions about why this weird tool was written.
First, a bit about the problem:
Migrating to Kubernetes was a non-trivial task for a DevOps team, when the requirements where basically that we would do it for the engineers. We had to standardise, and we had to decide on what a microservice ought to look like based on what was already there.
We didn't want engineers to all have to learn everything about the following objects at once:
We needed validation. Admission control was new, didn't work well with gitops for fast client-side validation, and we just needed ci checks to prevent
master from being broken.
The most successful abstraction attempt Kubernetes had seen in this space;
helm. A client side templating system (ignoring the bad server side part) that lets you abstract away much of the above into
charts (a collection of
yaml go templates) ready to be filled in with
helm values; the more concise
yaml that developers write directly.
Simplistic usage of
helm would involve having a
and calling it with your substitute
which will garbage collect older kube resources with the
myapp label, and start any necessary rolling upgrades in kubernetes.
Even though you can avoid a lot of the common errors by re-using charts across apps, there were still very little sanity on what helm values could contain. Here are some values you could pass through a helm chart to kubernetes and still be accepted:
- misspelled optional values (silently ignored)
- resource requests exceeding largest node (cannot schedule nor vertically auto scale)
- resource requests > resource limits (illogical)
- out of date secrets (generally causing crashes)
- missing health checks /
readinessProbe(broken services can rollout)
- images and versions that does not exist (fails to install/upgrade)
And that's once you've gotten over how frurstrating it can be to write helm templates in the first place.
While validation is a fixable annoyance, a bigger observation is that these helm values files become a really interesting, but entirely accidental abstraction. These files become the canonical representation of your services, but you have no useful logic around it. You have very little validation, almost no definition of what's allowed in there (
helm lint is lackluster), you have no process of standardisation, it's hard to test sprawling automation scripts around the values files, and you do not have any sane way of evolving these charts.
What if if we could take the general idea that developers just write simplified yaml manifests for their app, but we actually define that API instead? By actually defining the structs we can provide a bunch of security checking and validation on top of it, and we will have a well-defined boundary for automation / ci / dev tools.
By defining all our syntax in a library we can have cli tools for automation, and executables running as kubernetes operators using the same definitions. It effectively provides a way to versioning the platform.
This also allowed us to solve a secrets problem. We extended the manifests with syntax that allows synchronsing secrets from Vault at both deploy and validation time. There are better solutions for this now, but we needed something quickly.
This style of tool was not a revolutionary (nor clean) idea. At KubeCon Eu 2018 pretty much everyone had their own wrappers around
yaml to help with these problems. For instance,
helmfile, all try to help out in this space, but they were all missing most of the sanity we required when we started experimenting.
so, how to homebrew Kubernetes validation in an early stage gitops world
The result, perhaps unsurprisingly, was babylon dependent, fast moving, and not fit for general purpose. But it was still very helpful for the company.
The user interaface we settled on were service-level manifests:
- name: "Eirik"
This encapsulated the usual kubernetes apis that developers needed to configure themselves, who's responsible for it, what regions it's deployed in, what secrets are needed (notice the
IN_VAULT marker), and how resource intensive it is.
It's obviously quite limiting in terms of what you actually can do on Kubernetes, but this simple "one deployment per microservice" with some optional extras was generally sufficient for years.
Because these manifests were going to be the entry point for CI pipelines and handle platform specific validation (for medical software), we wanted maximum strictness everywhere and that includes the ability to catch errors before manifests are committed to
We leant heavily on serde's customisable codegeneration to encapsulate awkward k8s apis, and to auto-generate the boilerplate validation around types and spelling errors.
The Kubernetes structs were handrolled for the most part, but later incorporated parts of
k8s-openapi structs - however these were too
Option-heavy to catch most missed-out fields on their own.
Here are some structs we used to ensure
limits had the right format:
/// Kubernetes resource requests or limit
/// Kubernetes resources
serde enforces the "schema" validation. It catches spelling-errors as extraneous types/keys due to the
#[serde(deny_unknown_fields)] instruction, and it enforces the correct types. But on the flip side, having this in code also required us updating the spec (to say, support ephemeral storage requirements).
Still, this provided cheap schema validation (before helm got it) and there was also a
verify method that every struct could implement. This genenrally encapsulated common mistakes that were clearly errors and should be caught before they are sent out to the clusters:
Resources struct above was attached straight onto to the core
Manifest struct (representing the microservice defn above). Devs would write standard resources and be generally unaware of the constraints until they were violated:
In this case, the syntax matches the Kubernetes API directly - and this was preferred - but had extra validation.
We did plan on moving validation to a more declarative format (like OPAs) down the line, but there was no rush; this worked.
All of the syntax ended up in shipcat/structs - and required developer code-review to modify since it could affect the whole platform.
Once a new version of
shipcat was released, we bumped a pin for it in a configuration management monorepo with all the manifests, and the new syntax + feature become available for the whole company.
Developers could check that their manifests pass validation rules locally, or wait for pre-merge validation on CI:
the last being roughly equivalent to:
We did always lean on helm charts for templating yaml, but this was always an implementation detail that only a handful of engineers needed to touch as we followed the one chart to rule them all approach. Templates were also linted heavily with
kubeval against all services in all regions during chart upgrades.
We had wrappers around the normal
shipcat template myapp | kubectl X pipeline:
We didn't really apply locally except when doing local testing, but we could. There was a glorified kubernetes context switcher that ensured we were pointing to the correct vault for the cluster, so it was pretty easy to test on and get accurate diffs.
The upgrade was much nicer than any other CLI that existed at the time, it tracked upgrades with deployment-replica progress bars, bubbled up errors, captured error logs from crashing pods, provided inline diffs pre-upgrade, gated on validation, sent successful rollout notifications to maintainers on slack.
CI actually used this apply setup and reconciled the whole cluster in parallel using async rust with
this help avoid the numerous tiller bugs and actually let us define a sensible amount of time to wait for a deployment to complete (there's an algorithm in there for it).
In the end, we almost turned it into a CD controller, but in an awkward clash of new and old tech, we just ran the above reconcile command on jenkins every 5m lol.
at the time helm 3 was planning to architect away tiller entirely.
The dev ergonomics was one of its biggest selling points (and possibly prevented a revolt against a ops-led + mandated tool). In my later jobs, achieving a similar level of dev ergonomics would take multiple microservices talking to flux.
Perhaps all this does not seem that impressive now, but it helps if you have visited that precise layer of hell that
helm 2 dominated. It had such a painful and broken CD flow.
Looking back at this, it's a kind of wild everything-CLI. It accomplished the goal though. It moved fast, but did so safely. It was not universally well-received, but most of the people who complained about it early on later came to me later to say "i don't know how else we could have done this".
It also let us build a quick and simple service-registry on top of the service spec (there's a controller called raftcat that cross-linked to all the tools we used for each service).
Ultimately, it's not a tool most people know about, or at the very least not very well understood, and this makes sense. It was ultimately tied to babylon's platform. Why would you tell people about this, if not except out of interest? The more surprising nail in that coffin was in late 2022, when it was made private from its repo without much ceremony. Now only my safety fork remains. Similar unravellings later happened to the company, but unfortunately you cannot safety fork your share value.
For anyone super interested, there is also our original talk: Babylon Health - Leveraging Kubernetes for global scale from DoxLon2018 that provides some context.
Don't make me watch it again though.