Talk log from KubeCon LA
clux November 06, 2021 [software] #rust #kubernetesFirst KubeCon in a while I haven't done anything for (didn't even buy an ticket). This post is largely for myself, but thought I'd put some thoughts here public. All talks referenced were recently published on the CNCF youtube channel, and the posts here are really just my notes (make of them what you will).
My interest areas this kubecon fall broadly into these categories;
- observability related :: maintain a lot of metrics related tooling + do a lot of dev advocacy
- community related :: am trying to donate kube-rs to cncf and grow that community
- misc tech :: engineer likes shiny things
sorted in order of interest (grouped by category):
Observability
Using SLOs for Continuous Performance Optimizations
keptn and its evented automation system does seem really good. treats SLOs as first class things. higher level abstraction than other CD systems. no need to write automation systems.. pretty new (cncf sandbox). I should try it.
Keptn Office Hours also goes into a lot of details here for this.
Evolving Prometheus for More Use Cases
Bartek on latest news:
- New config:
sample_limit
(body limit) +label_limit
(num labels) +label_name_length_limit
(label len) +target_limit
(per-scrape config limit). - Configure scraping by labels e.g.
prometheus.io/scrape
. - Exemplars with OpenMetrics format. Supported in java/golang/python. (NB: I closed my rust pr due to time constraints / lack of support)
Thanos remote-read to help federated setups. Via G-Research.
But remote_write
more popular. Can set prometheus to only remote_write
recording rule results!
- Prometheus Agent based on Grafana Agent (contributed by them) (better disk usage, DS mode presumably).
- Grafana Operator; dashboards as CRDs (can split configmap monorepo that normally uses sidecars)
- prom-label-proxy: isolation. each team only sees their own metrics + resources.
Upcoming: ingestion scaling automation; HPA scaling scraping via dynamically assign scrape targets. High density histograms.
What You Need to Know About OpenMetrics
prometheus + its exposition format is a global standard. Now big collaboration on new standard.
largely the same; but some cleanups and new features.
- counters require
_total
suffix, timestamp unit is in seconds (used to be ms) - added metadata (units in scrapes), exemplar support
- (minor breaking changes, opt in with header)
- push/pull considerations (cannot emulate all of pull with push though)
- text format mandatory / optional protobuf
- python client is the reference impl (also go/java)
prometheus conformance program (vendors need to do things to get "Prometheus Compliant" logo) separate talk:
- to use the mark (for a period of time) have to sign LF paperwork
- includes: good faith testing clauses, submit tests to prom team
- monetary incentives - because they plan on iterating on test suite quickly
EBF Superpowers
- cilium hubble works as a CNI and can help visualise traffic
- falco can detect syscalls
- pixie can show flamegraphs within containers
"observability / networking sidecars needs yaml, but ebpf is kernel level."
linkerd people go into limitation of ebpf as a "mesh" in this thread (link dead, rip twitter):
twitter Oct 27, 2021 @wm: Was a little bummed to see this article earlier this week from some people I respect, which promotes things that I I believe are not the future of cloud native security.
similar overview to rakyll's eBPF in Microservices Observability, which additionally notes the distribution problem with ebpf at the end.
Understanding Service Mesh Metric Merging
How scraping works with istio (to ensure you get app + proxy) from meshday. Awkward, but ok.
Effortless Profiling on Kubernetes
kubectl flame
- creating a container on the same node as target container with profiler binaries (sharing process ids + ns and fs).
=> can use capturing tools like py-spy
/async-profiler
to capture flamegraphs without touching running containers
it then runs kubectl cp's the thing out to disk and cleans up thing (no rust support though)
might be obsolete / rewritten with ephemeralContainers
(no need find node and grab ps/ns/fs stuff)
prodfiler does something similar as a service
Misc Tech
Leveraging WebAssembly to Write Kubernetes Admission Policies
Kubewarden! Rust dynamic admission controller using kube-rs
with WASM.
No DLS. OCI registry to publish policies. Runs all of them through the policy server.
- Tracing support into policy wasms!
- CRD now for policies:
module
(oci path) + rbac + constraints. opa build -t wasm
wasmify viaopa
- testing:
kwctl run -e gatekeeper ---settings-json '{...}' --request-path some.json gatekeeper/policy.wasa
Should test this out properly. Looks like less of a hassle than OPA/gatekeeper.
Edge Computing using K3s on Raspberry Pi
nice up to date tutorial to look into in case of apocalypse.
Allocation Optimizer for Minimizing Power Consumption
using science on cpu power usage based on cpu utilization %.
Shifting Spotify from Spreadsheets to Backstage
great service catalog. tons of plugins. costs. trigger incidents.
probably better than opslevel? but backstage needs to be in-cluster.
also wants to do things that keptn
wants to do.
Building Catalogs of Operators for OLM the Declarative Way
OLM craziness on top of controllers. opm
serves a registry of controllers in a catalog...
Faster Container Image Distribution
tared image distribution problematic coz you have to download all of it. so two new systems:
eStargz
: extension to OCI (backwards compat) - subproject of containerd- looks like 20-40% of pull speeds of original
- can enable with
k3s server --snapshotter=stargz
(but need lazy pull enabled images) - can buildkit build using
buildx build -o type=registry,name=org/repo:tag,oci-mediatypes=true,compression=estargz
- also ways to convert images nerdctl ord ctr-remote
- opencontainers/image-spec#815
and
nydus
- future looking (incubator dragonfly sub-project)- next OCI image spec propoasal
- improved lazy pulling, better ecosystem integration
- benchmarks looks better than estargz?
- harbor with auto-conversion
What We Learned from Reading 100+ Kubernetes Post-Mortems
nice quick failure stories
- cronjob
concurrencyPolicy: Forbid
otherwise crashing causing pod duplication "fork bombs" - incorrect yaml placements discards bad yaml on bad CI
- ingress: no
*
inrules[].host
- pods: no limits on 3rd party image -> took down cluster when it memory leaked
TL;DR: use good validation and good CD.
Community Related
From Storming to Performing: Growing Your Project's Contributor Experience
matt butcher. 4 stages on how they apply to OS:
- FORM: deal with prs positively / identity / website / branding / communications / twitter (think early) / maintainer guide docs
- STORM: conflicts (dispute resolution / CoC / Governance / coding standards / contributors != employees (ask + thank)
- NORM: sharing responsibilities (issue mgmt / triage / delegate (find volunteers) / standardising communication channels)
- PERFORM: optimising for long haul (retaining maintainers / burnout / turnover / acquire new maintainers)
at all stages; people are still volunteers, be kind, thank them, give them something (responsibility / status) if possible sometimes people need to step down. steps are not hard-delineated
- adjourning could be the last step (nothing more to really do?)
triage maintainer could be a good idea.
Kubernetes SIG CLI: Intro and Updates
scope: standardisation of cli framework / posix compliance / conventions - owns kubectl kui, cli-runtime cli-experimental cli-utils, krew kustomize
- they are conceding that
apply --prune
is awful and has drawbacks. (alpha and probably won't ever graduate). cli-utils has experiments for improvements. - all stuff use cobra (want to remove that) - want to pull apply into something people can use (so can use their stuff as library)
kubectl
has many imperative things (like kubectl create - hard to maintain)kubectl
is bad on performance - too much serialization (json -> yaml -> json -> go structs ...) go is strictly typed without generics. memory usage balloons.- "kubectl is a very difficult codebase to work on" -_-
Measuring the Health of Your CNCF Project
Via CNCF project-health and devstats cncf dashboards. Project health metrics:
- Responsiveness (more likely to retain contributors)
- First Response time on PRs (1 hour good, 3 days bad)
- Resolution (time to close - dislike this - autoclose bot)
- Contributor Activity (community toxic? clear contribution policies makes it easier for new/episodic contribs)
- Contributor activity
- Contributors new and episodic (shows growth of contributors)
- Contributor Risk (low risk; many contributors, org diversity)
- Project Velocity (decrease => maturity or health issues)
- Release Activity (regular cadence improves trust, quick security response)
- Inclusivity (inclusive / welcoming porjects attract + retain diverse contributors)
- mentoring programs? Timeframe? Can run sensibly if you have a regular release cadence, otherwise have to pick a time frame. They have dashboards.
Turn Contributors Into Maintainers with TAG Contributor Strategy
produces templates, guide for governance (already used it!)
- descriptive helps. goals need to align.
- clarify what to do when making a PR - minimize manual steps
- thank people, recognition programs (in releases), create a warming community
- get people on the contribution ladder. linkerd has a linkerd hero. define the ladder (gamifies the task).
- maintainers value code and are biased towards that. need people that have other skills. need someone to help with docs?
- they have a contributor ladder
- governance == membership. people want to belong to something. proves to them that they are treated equally, and htey have ownership.
- corporate contributors are shown they won't be railroaded. investment ~~ influence.
Design Up Front: Socializing Ideas with Enhancement Proposals
On enhancement proposals / RFCs. key takeaways were good:
- taking time to communicate your ideas clearly and getting feedback / responding to that feedback makes your ideas better and makes you grow as an engineer.
- helps improve stability, but can be intimidating.
- need to invest in it, and follow up on reviewers and contributors.
- the system dies if you don't.
CNCF Technical Oversight at Scale
creates TAGS (technical advisory groups). help cncf projects incubate/graduate.
-
we might be in the runtime tag; https://github.com/cncf/tag-runtime
-
cncf project updates talk: crossplane/keda/cilium/flux/opentelemetry incubating
-
flux uses ss apply, drift detection, stable apis (although their GA Roadmap talk had just docs/test/standardisation stuff)
-
prometheus high res histograms
-
keda: event driven autoscaler: listens to eventing systems -> translates to metrics -> turns it into cpu/memory metrics "tricking the system"
Technical Oversight Committee
a public meeting. interesting just to get an overview of its goals. good links and reasonable goals (discussion was ok):
- https://github.com/cncf/toc/blob/main/PRINCIPLES.md
- https://github.com/cncf/toc/blob/main/process/sandbox.md
- https://github.com/cncf/tag-runtime (our target TAG)
CNCF Tag-Runtime
Useful because it's the TAG that seems likely for kube-rs donation. dims is a liaison!
- Scope areas limited so far, but "open to expanding".
- Contains:
krustlet
+Akri
Kubernetes SIG Docs
....is apprently mostly hugo + netlify. they have a contributor role of a PR wrangler (and rotate that).
Miscellaneous Notes
- PSPs are going away
- "webassembly; neither web nor assembly"
kustomize
still a thing.. now with generators + transformer pipelines via crds..- sieve-project (from talk on kubernetes controller testing) is interesting, but kind of insane sounding - hope we can make this nicer in kube..
- people using linkerd to solve the grpc load balancing problem