return to table of content

Go, Containers, and the Linux Scheduler

The common problem I see across many languages is: applications detect machine cores by looking at /proc/cpuinfo. However, in a docker container (or other container technology), that file looks the same as the container host (listing all cores, regardless of how few have been assigned to the container).

I wondered for a while if docker could make a fake /proc/cpuinfo that apps could parse that just listed "docker cpus" allocated to the job, but upon further reflection, that probably wouldn't work for many reasons.

That's not what Go does though. Go looks at the population of the CPU mask at startup. It never looks again, which of problematic in K8s where the visible CPUs may change while your process runs.

which of problematic in K8s where the visible CPUs may change while your process runs

This is new to me. What is this… behavior? What keywords should I use to find any details about it?

The only thing that rings a bell is requests/limit parameters of a pod but you can't change them on an existing pod AFAIK.

If you have one pod that has Burstable QoS, perhaps because it has a request and not a limit, its CPU mask will be populated by every CPU on the box, less one for the Kubelet and other node services, less all the CPUs requested by pods with Guaranteed QoS. Pods with Guaranteed QoS will have exactly the number of CPUs they asked for, no more or less, and consequently their GOMAXPROCS is consistent. Everyone else will see fewer or more CPUs as Guaranteed pods arrive and depart from the node.

If by "CPU mask" you refer to the `sched_getaffinity` syscall, I can't reproduce this behavior.

What I tried: I created a "Burstable" Pod and run `nproc` [0] on it. It returned N CPUs (N > 1).

Then I created a "Guaranteed QoS" Pod with both requests and limit set to 1 CPU. `nproc` returned N CPUs on it.

I went back to the "Burstable" Pod. It returned N.

I created a fresh "Burstable" Pod and run `nproc` on it, got N again. Please note that the "Guaranteed QoS" Pod is still running.

Pods with Guaranteed QoS will have exactly the number of CPUs they asked for, no more or less

Well, in my case I asked for 1 CPU and got more, i.e. N CPUs.

Also, please note that Pods might ask for fractional CPUs.

[0]: coreutils `nproc` program uses `sched_getaffinity` syscall under the hood, at least on my system. I've just checked it with `strace` to be sure.

I don't know what nproc does. Consider `taskset`

I re-did the experiment again with `taskset` and got the same results, i.e. the mask is independent of creation of the "Guaranteed QoS" Pod.

FWIW, `taskset` uses the same syscall as `nproc` (according to `strace`).

Perhaps it is an artifact of your and my various container runtimes. For me, in a guaranteed qos pod, taskset shows just 1 visible CPU for a Guaranteed QoS pod with limit=request=1.

  # taskset -c -p 1
  pid 1's current affinity list: 1

  # nproc
  1

I honestly do not see how it can work otherwise.

After reading https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana..., I think we have different policies set for the CPU Manager.

In my case it's `"cpuManagerPolicy": "none"` and I suppose you're using `"static"` policy.

Well, TIL. Thanks!

TIL also. The difference between guaranteed and burstable seems meaningless without this setting.

Even way back in the day (1996) it was possible to hot-swap a CPU. Used to have this Sequent box, 96 Pentiums in there, 6 on a card. Could do some magic, pull the card and swap a new one in. Wild. And no processes died. Not sure if a process could lose a CPU then discover the new set.

What is the population of the CPU mask at startup? Is this a kernel call? A /proc file? Some register?

On Linux, it likely calls sched_getaffinity().

hmm, I can see that as being useful but I also don't see that as the way to determine "how many worker threads I should start"

It's not a bad way to guess, up to maybe 16 or so. Most Go server programs aren't going to just scale up forever, so having 188 threads might be a waste.

Just setting it to 16 will satisfy 99% of users.

There's going to be a bunch of missing info, though, in some cases I can think of. For example, more and more systems have asymmetric cores. /proc/cpuinfo can expose that information in detail, including (current) clock speed, processor type, etc, while cpu_set is literally just a bitmask (if I read the man pages right) of system cores your process is allowed to schedule on.

Fundamentally, intelligent apps need to interrogate their environment to make concurrency decisions. But I agree- Go would probably work best if it just picked a standard parallelism constant like 16 and just let users know that can be tuned if they have additional context.

Yes, running on a set of heterogenous CPUs presents further challenges, for the program and the thread scheduler. Happily there are no such systems in the cloud, yet.

Most people are running on systems where the CPU capacity varies and they haven't even noticed. For example in EC2 there are 8 victim CPUs that handle all the network interrupts, so if you have an instance type with 32 CPUs, you already have 24 that are faster than the others. Practically nobody even notices this effect.

in EC2 there are 8 victim CPUs that handle all the network interrupts, so if you have an instance type with 32 CPUs, you already have 24 that are faster than the others

Fascinating. Could you share any (all) more detail on this that you know? Is it a specific instance type, only ones that use nitro? (or only ones without?) This might be related to a problem I've seen in the wild but never tracked down...

I've only observed it on Nitro, but I have also rarely used pre-Nitro instances.

We use https://github.com/uber-go/automaxprocs after we joyfully discovered that Go assumed we had the entire cluster's cpu count on any particular pod. Made for some very strange performance characteristics in scheduling goroutines.

My opinion is that setting GOMAXPROCS that way is a quite poor idea. It tends to strand resources that could have been used to handle a stochastic burst of requests, which with a capped GOMAXPROCS will be converted directly into latency. I can think of no good reason why GOMAXPROCS needs to be 2 just because you expect the long-term CPU rate to be 2. That long-term quota is an artifact of capacity planning, while GOMAXPROCS is an artifact of process architecture.

How do you suggest handling that?

Point of clarification: Containers, when using quota based limits, can use all of the CPU cores on the host. They're limited in how much time they can spend using them.

(There are exceptions, such as documented here: https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...)

Maybe I should be clearer: Let's say I have a 16 core host and I start a flask container with cpu=0.5 that forks and has a heavy post-fork initializer.

flask/gunicorn will fork 16 processes (by reading /proc/cpuinfo and counting cores) all of which will try to share 0.5 cores worth of CPU power (maybe spread over many physical CPUs; I don't really care about that).

I can solve this by passing a flag to my application; my complaint is more that apps shouldn't consult /proc/cpuinfo, but have another standard interface to ask "what should I set my max parallelism (NOT CONCURRENCY, ROB) so my worker threads get adequate CPU time so the framework doesn't time out on startup.

You generally shouldn't set CPU limits. You might want to configure CPU requests which is guaranteed chunk of CPU time that container will always receive. With CPU limits you'll encounter situation when host CPU is not loaded, but your container workloaded is throttled at the same time, which is just waste of CPU resources.

According to the article, this is not true. The limits become active only when the host cpu is under pressure.

I don't think that's correct. --cpus is the same as --cpu-period which is cpu limit. You can easily check it yourself, just run docker container with --cpus set, run multi-core load there and check your activity monitor.

CFS quotas only become active under contention and even then are relative: if you’re the only thing running on the box and want all the cores but only set one cpu, you get all of them anyway.

If you set cpus to 2 and another process sets to 1 and you both try to use all CPUs all out, you’ll get 66% and they’ll get 33%.

This isn’t the same as cpusets, which work differently.

CFS quotas only become active under contention

That's not true at all. Take a look at `cpu.cfs_quota_us` in https://kernel.googlesource.com/pub/scm/linux/kernel/git/glo...

It's a hard time limit. It doesn't care about contention at all.

`cpu.shares` is relative, for choosing which process gets scheduled, and how often, but the CFS quota is a hard limit on runtime.

Yes, there are hard limits in the CFS. I have used them for thermal reasons in the past, such that the system remained mostly idle although some threads would have had more work to do.

Not at my work environment right now, don't remember the parameters I used.

It's complicated. I've worked on every kind of application in a container environment: ones that ran at ultra-low priority while declaring zero CPU request and infinite CPU limit. I ran one or a few of these on nearly every machine in Google production for over a year, and could deliver over 1M xeon cores worth of throughput for embarassingly parallel jobs. At other times, I ran jobs that asked for and used precisely all the cores on a machine (a TPU host), specifically setting limits and requests to get the most predictable behavior.

The true objective function I'm trying to optimize isn't just "save money" or "don't waste CPU resources", but rather "get a million different workloads to run smoothly on a large collection of resources, ensuring that revenue-critical jobs always can run, while any spare capacity is available for experimenters, up to some predefined limits determined by power capacity, staying within the overall budget, and not pissing off any really powerful users." (well, that's really just a simplified approximation)

The problem is your experience involves a hacked up Linux that was far more suitable for doing this than is the upstream. Upstream scheduler can't really deal with running a box hot with mixed batch and latency-sensitive workloads and intentionally abusive ones like yours ;-) That is partly why kubernetes doesn't even really try.

This. Some googlers forget there is a whole team of kernel devs in TI that are maintaining patched kernel (including patched CFS) specifically for Borg

I used Linux for mixed workloads (as in, my desktop that was being used for dev work was also running multi-core molecular dynamics jobs in the background). Not sure I agree completely that the Google linux kernel is significantly better at this.

At work at my new job we run mixed workloads in k8s and I don't really see a problem, but we also don't instrument well enough that I could say for sure. In our case it usually just makes sense to not oversubscribe machines (Google oversubscribed and then paid a cost due to preemptions and random job failures that got masked over by retries) by getting more machines.

It's not clear to me what the max parallelism should actually be on a container with a CPU limit of .5. To my understanding that limits CPU time the container can use within a certain time interval, but doesn't actually limit the parallel processes an application can run. In other words that container with .5 on the CPU limit can indeed use all 16 physical cores of that machine. It'll just burn through it's budget 16x faster. If that's desirable vs limiting itself to one process is going to be highly application dependent and not something kubernetes and docker can just tell you.

It won’t burn through the budget faster by having more cores. You’re given a fixed time-slice of the whole CPU (in K8s, caveats below), whether you use all the cores or just one doesn’t particularly matter. On one hand, it would be would nice to be able to limit workloads on K8s to a subset of cores too, on the other, I can only imagine how much catastrophically complex that would make scheduling and optimisation.

Caveats: up to the number of cores exposed to your VM. I also believe the later versions of K8s let you do some degree of workload-core pinning and I don’t yet know how that interacts with core availability .

`gunicorn --workers $(nproc)`, see my comment on the parent

https://stackoverflow.com/questions/65551215/get-docker-cpu-...

Been a bit but I do believe that dotnet does this exact behavior. Sounds like gunicorn needs a pr to mimic, if they want to replicate this.

https://github.com/dotnet/runtime/issues/8485

That interface partly exists. It's /sys/fs/cgroup/(cgroup here)/cpu.max

I know the JVM automatically uses it, and there's a popular library for Go that sets GONAXPROCS using it.

I only use `nproc` and see it used in other containers as well, ie `bundle install -j $(nproc)`. This honors cpu assignment and provides the functionality you're seeking. Whether or not random application software uses nproc if available, idk

Print the number of processing units available to the current process, which may be less than the number of online processors. If this information is not accessible, then print the number of processors installed

https://www.gnu.org/software/coreutils/manual/html_node/npro...

https://www.flamingspork.com/blog/2020/11/25/why-you-should-...

This is not very robust. You probably should use the cgroup cpu limits where present, since `docker --cpus` uses a different way to set quota:

    if [[ -e /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]] && [[ -e /sys/fs/cgroup/cpu/cpu.cfs_period_us ]]; then
        GOMAXPROCS=$(perl -e 'use POSIX; printf "%d\n", ceil($ARGV[0] / $ARGV[1])' "$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)" "$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)")
    else
        GOMAXPROCS=$(nproc)
    fi
    export GOMAXPROCS

This follows from how `docker --cpus` works (https://docs.docker.com/config/containers/resource_constrain...), as well as https://stackoverflow.com/a/65554131/207384 to get the /sys paths to read from.

Or use https://github.com/uber-go/automaxprocs, which is very comprehensive, but is a bunch of code for what should be a simple task.

A shell script that invokes perl to set an environment variable used by Go. Some days I feel like there is a lot of duct tape involved in these applications.

Containers are a crappy abstraction and VMware fumbled the bag, is my takeaway from this comment…

VMware fumbled the bag

Oh they did, they're a modern day IBM.

Containers are a crappy abstraction

They're one of the best abstractions we have (so far) because they contain only the application and what it needs.

I still don't get the benefit of running Go binaries in containers. Totally get it for Rails, Python, etc, where the program needs a consistent set of supporting libraries. But Go doesn't need that. Especially now we can embed whole file systems into the actual binary.

I've been looking at going in the other direction and using a Go binary with a unikernel; the machine just runs the binary and nothing else. I haven't got this working to my satisfaction yet - it works but there's still too much infrastructure, and deployment is still "fun". But I think this is way more interesting for Go deployments than using containers.

Here's what a container gives you:

  - Isolation of networking
  - Isolation of process namespace
  - Isolation of filesystem namespace
  - CPU and Memory limit enforcement
  - A near-universal format for packaging, distributing, and running applications, with metadata, and support for multiple architectures
  - A simple method for declaring a multi-step build process
  - Other stuff I don't remember

You're certainly welcome to go without them, but over time, everybody ends up needing at least one of the features containers bring. I get the aversion to adding more complexity, but complexity has this annoying habit of being useful.

A VM provides the first 4 anyway - if you're deploying to a cloud instance then having these in the container is redundant. If you're deploying to bare metal then it's possibly useful, but only if you're deploying multiple containers to the same machine.

Go doesn't need a format for packaging - it's one file. It's becoming common practice to embed everything else into the binary. (side note: I haven't done this with env files yet, and tend to deploy them separately, but I don't see any reason why we don't do this and produce binaries targetted at the specific deployment environments. I might give it a go).

I kinda prefer makefiles for the build stuff, or even just a script. The whole process of creating a Docker instance, pushing source files to it, trigging go build and then pulling back the binary seems redundant; there's no advantage of doing this in a container over doing it on the local machine. And it's a lot faster on the local machine.

Talking to people, it appears to be as mholt said: everyone just does everything in containers so apparently we do this too.

Are you suggesting to setup a separate VM for each process that may only require like 0.25 cpu? Another thing you can’t do with VMs is oversubscribe (at least not with cloud ones)

You can't have both oversubscription and isolation, almost by definition. If you want isolation, VMs are great. If you want oversubscription, OSes are still better than container runtimes at managing competing processes on the same host.

OK but I can tho - oversub cpu, isolate memory, systems resources (like ports) etc.

OSes are still better than container runtimes at managing competing processes on the same host.

OSes and container runtimes are the same thing

OSes and container runtimes are the same thing

For a subset of OSes

Go binaries tend not to be that lightweight, because we have goroutines for that.

And yes, setting up a separate VM for each instance of a process is perfectly feasible. That's what all this cloud business was about in the first place.

This goes against everything we've learned about effectively deploying and managing software at runtime. Using the golang binary as a packaging format for your app has the same energy as crafting it exclusively from impenetrable one-liners.

Containers can give you better isolation between the application and the host, as well as making horizontal scaling easier. It's also a must if you are using a system like Kubernetes. If you are running multiple applications in the same host, you also have control over resource limits.

Kubernetes requires a platform runtime that answers its requests, but there's no law enforcement agency that will prevent you from using a custom runtime that ignores the very existence of Linux control groups.

Yes, that is true. Though if you are using Google's or Amazon's managed Kubernetes services, I think you need to use Docker.

Certainly EKS is less managed than it appears, I’m quite confident that a node running something implementing the kubelet API convincingly would work. They changed recently (1.23 maybe) from Docker to containerd.

As the author of Caddy, I have often wondered why people run it in containers. The feedback I keep hearing is basically workflow/ecosystem lock-in. Everything else uses containers, so the stuff that doesn't need containers needs them now, too.

For my personal site I run Caddy in a Docker container along with other containers using a compose file. By doing it this way, getting things up when moving to a new instance is as simple as running `docker compose up`. Also making changes to the config or upgrading Caddy version on deployments is the same as a other services since they're all containers. So it's easy to add CI/CD and have it re-deploy Caddy whenever the config changes and there's no need for extra GitHub Actions .yaml's. Setup as code like this also documents all the dependencies and I think it might be helpful in the future.

Having said that, for serious business, this setup doesn't make sense. It possibly takes more work to operate as container when the gateway runs on a dedicated instance.

I find putting http proxies in containers to be a very effective method of building interesting dynamic L7 dataplanes on orchestrators like k8s. Packaging applications (particularly modern static SPAs), with a webserver embedded, is also a very intuitive way of plugging them into the leaves of this topology while abstracting away a lot of the connectivity policy from the app.

Of course there's also the well-known correlation between the quality of your k8s deployment and the number of proxies it hosts. /s

Love your work :) Good to know I'm not totally out there on this :)

Go programs still make use of, e.g., /etc/resolv.conf and /etc/ssl/certs/ca-certificates.crt. These aren't strictly necessary, but using them makes it easier--more consistent, more transparent--to configure these resource across different languages and frameworks.

I use the semi-official Acme package [0], which handles all that really well. I haven't touched SSL or TLS config for years. I mean, I might be an outlier, but this seems pretty standard for Go deployments these days.

[0] https://pkg.go.dev/golang.org/x/crypto/acme

A lot of companies are Kubernetes based now so containers are the default delivery mechanism for all software.

This is subtly incorrect - as far as Docker is concerned CFS cgroup extension has several knobs to tune - cfs_quota_us, cfs_period_us (typical default is 100ms not a second) and shares. When you set shares you get weighted proportional scheduling (but only when there's contention). The former two enforce strict quota. Don't use Docker's --cpu flag and instead use --cpu-shares to avoid (mostly useless) quota enforcement.

From Linux docs:

  - cpu.shares: The weight of each group living in the same hierarchy, that
    translates into the amount of CPU it is expected to get. Upon cgroup creation,
    each group gets assigned a default of 1024. The percentage of CPU assigned to
    the cgroup is the value of shares divided by the sum of all shares in all
    cgroups in the same level.
  - cpu.cfs_period_us: The duration in microseconds of each scheduler period, for
    bandwidth decisions. This defaults to 100000us or 100ms. Larger periods will
    improve throughput at the expense of latency, since the scheduler will be able
    to sustain a cpu-bound workload for longer. The opposite of true for smaller
    periods. Note that this only affects non-RT tasks that are scheduled by the
    CFS scheduler.
  - cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us
    in for the current group will be allowed to run. For instance, if it is set to
    half of cpu_period_us, the cgroup will only be able to peak run for 50 % of
    the time. One should note that this represents aggregate time over all CPUs
    in the system. Therefore, in order to allow full usage of two CPUs, for
    instance, one should set this value to twice the value of cfs_period_us.

People using Kubernetes don't tune or change those settings, it's up to the app to behave properly.

False. Kubernetes cpu request sets the shares, cpu limit sets the cfs quota

You said to change docker flags. Anyway your post is irrelevant, the goal is to let know the runtime about how many posix threads should it use.

If you set request / limit to 1 core but you run on 64 cores node , then you runtime will see that which will bring performance down.

Original article is about docker. That’s the point of my comment - dont set cpu limit

I intended it to be applicable to all containerised environments. Docker is just easiest on my local machine.

I still believe it's best to set these variables regardless of cpu limits and/or cpu shares

All you did is kneecapped your app to have lower performance so it fits under your arbitrary limit. Hardly what most people describe as “best” - only useful in small percentage of usecases (like reselling compute)

I've seen significant performance gains from this in production.

Other people have encountered it too hence libraries like Automaxprocs existing and issues being open with Go for it.

Gains by what metric? Are you sure you didn't trade in better latency for worse overall throughput? Also, sure you didn't hit one of many CFS overaccounting bugs which we've seen a few? Have you compared performance without the limit at all?

Previously we had no limit. We observed gains in both latency and throughput by implementing Automaxprocs and decided to roll it out widely.

This aligns with what others have reported on the Go runtime issue open for this.

"When go.uber.org/automaxprocs rolled out at Uber, the effect on containerized Go services was universally positive. At least at the time, CFS imposed such heavy penalties on Go binaries exceeding their CPU allotment that properly tuning GOMAXPROCS was a significant latency and throughput improvement."

https://github.com/golang/go/issues/33803#issuecomment-14308...

"Don't use Docker's --cpu flag and instead use"

This is rather strong language without any real qualifiers. It is definitely not "mostly useless". Shares and quotas are for different use-cases, that's all. Understand your use-case and choose accordingly.

It doesn’t make any sense to me why —cpu flag is tweaking quota and not shares since quota is useful in tiny minority of usecases. A lot of people waste a ton of time debugging weird latency issues as a result of this decision

With shares you're going to experience worse latency if all the containers on the system size their thread pool to the maximum that's available during idle periods and then constantly context-switch due to oversubscription under load. With quotas you can do fixed resource allocation and the runtimes (not Go apparently) can fit themselves into that and not try to service more requests than they can currently execute given those resources.

And how is that different from worse latency due to cpu throttling from your users’ perspective?

Fixed queue, so it'll only take as many as it can process and reject the rest, which can be used to do scaling, if you have a cluster. With shares it would think it has all the CPU cores available and oversize the queue.

These two options are not mutually exclusive.

When you want to limit the max CPU time available to a container use quotas (--cpus). When you want to set relative priorities (compared to other containers/processes), use shares.

These two options can be combined, it all depends on what you need.

Don't use Docker's --cpu flag and instead use --cpu-shares to avoid (mostly useless) quota enforcement.

One caveat is that an application can detect when --cpu is used as I think it's using cpuset. When quota are used it cannot detect and more threads than necessary will likely be spawned

—cpu sets the quota, there is is a —cpuset-cpu flag for cpuset and you can detect both by looking at the /sys/fs/cgroup

It is not using cpuset (there is a separate flag for this). --cpus tweaks the cfs quota based on the number of cpus on the system and the requested amount.

Hi I'm the blog author, thanks for the feedback

I'll try and clarify this. I think this is how the sympton presents but I should be clearer.

This sort of tuning isn't necessary if you use CPU reservations instead of limits, as you should: https://home.robusta.dev/blog/stop-using-cpu-limits

CPU reservations are limits, just implicit ones and declared as guarantees.

So let the Go runtime use all the CPUs available, and let the Linux scheduler throttle according to your declared reservations if the CPU is contended for.

I don't set limits because I'm afraid of how a pod is going to affect other pods. I set limits because I don't want to get used to being able to tap on the excess CPU available because that's not guaranteed to be available.

As the node fills up with more and more other pods, it's possible that a pod that was running just fine a moment ago is crawling to a halt.

Limits allow me to simulate the same behavior and plan for it by doing the right capacity planning.

They are not the only way to approach it! But they are the simplest way to so it.

Limiting CPU to the amount guaranteed to be available also guarantees very significant wasted resource utilization unless all your pods spin 100% CPU continuously.

The best way to utilize resources is to overcommit, and the smart way to overcommit is to, say, allow 4x overcommit with each allocation limited to 1/4th of the available resources so not individual peak can choke the system. Given varied allocations, things average out with a reasonable amount of performance variability.

Idle CPUs are wasted CPUs and money out the window.

with each allocation limited to 1/4th of the available resources so not individual peak can choke the system.

This assumes that the scheduled workloads are created equal which isn't the case. The app owners do not have control over what else gets scheduled on the node which introduces uncontrollable variability in the performance of what should be identical replicas and environments. What helps here is .. limits. The requests-to-limits ratio allows application owners to reason about the variability risk they are willing to take in relation to the needs of the application (e.g. imagine a latency-sensitive workload on a critical path vs a BAU service vs a background job which just cares about throughput -- for each of these classes, the ratio would probably be very different). This way, you can still overcommit but not by a rule-of-thumb that is created centrally by the cluster ops team (e.g. aim for 1/4) but it's distributed across each workload owner (ie application ops) where this can be done a lot more accurately and with better results. This is what the parent post is also talking about.

1/4th was merely an example for one resource type, and a suitable limit may be much lower depending on the cluster and workloads. The point is that a limit set to 1/workloads guarantees wasted resources, and should be set significantly higher based on realistic workloads, while still ensuring that it takes N workloads to consume all resource to average out the risk of peak demand collisions.

This assumes that the scheduled workloads are created equal which isn't the case.

This particular allocation technique benefits from scheduled workloads not being equal as equality would increase likelihood of peak demand collisions.

That's why you use monitor and alerting, so you notice degraded performances before the pods is crawling to a halt.

You need to do it anyway because a service might progressively need more resources as it's getting more traffic, even if you're not adding any other pod.

Sure you need monitoring and alerting and sure there are other reasons why you need to update your requests.

But having _neighbours_ affecting the behaviour of your workload is precisely what creates the kind of fatigue that then results in people claiming that it's hard to run k8s workloads. K8s is highly dynamical, pods can get scheduled on a node by chance sometimes and on some clusters; pagers will ring, incidents will be created for conditions that may solve themselves because of another deployment (possibly of another team) happening.

Overcommit/bursting is an advanced cost saving feature.

Let me say it again: splitting up a large machine into smaller parts that can use the unused capacity of other parts in order to reduce waste is an advanced feature!

The problem is that the request/limits feature is presented in the configuration spec and in the documentation in a deceptively simple way and we're tricked to think it's a basic feature.

Not all companies have ops teams that are well equipped to do more sophisticated things. My advice for those teams who cannot setup full automation around capacity management is to just not use this advanced features.

An alternative is to just use smaller dedicated nodes and (anti)affinity rules, so you always understand which pods go with which other pods. It's clunky but it's actually easier to reason about what's going to happen.

EDIT: typos

Interesting. This is not true for Memory, correct? The OOMKiller might get you.

You also cannot achieve a QoS class of Guaranteed without both CPU and Memory limits, so the pod might be evicted at some point.

Memory is different because it is non-compressible - once you give memory you can't take it away without killing the process

Swap (Disk, RDMA, Compression)? Page migration (NUMA, CXL)?

Correct regarding memory - not true for memory because it's non-fungible unlike CPU shares

You also cannot achieve a QoS class of Guaranteed without both CPU and Memory limits, so the pod might be evicted at some point.

Evicted due to node pressure - yes (but if all other pods also don't have limits it doesn't matter). For preemption QoS is not factored in the decision [0]

[0] - https://kubernetes.io/docs/concepts/scheduling-eviction/pod-...

I run a few things on 128 core setups and I set CPU limits to much higher than request but still set them to make sure nothing runs ammok.

I would be curious to see this discussed but your article only states that people think you need limit to ensure CPU for all pods.

As someone not that familiar with Docker or Go, is this behavior intentional? Could the Go team make it aware of the CGroups limit? Do other runtimes behave similarly?

I'm fairly certain that that .net had to deal with it and Java had or still has a problem, I forget which. (Or did you mean runtimes like containerd?)

Supported in Java 10 (and backported to Java 8) since 2018. Not sure about .NET.

- "The JVM has been modified to be aware that it is running in a Docker container and will extract container specific configuration information instead of querying the operating system. The information being extracted is the number of CPUs and total memory that have been allocated to the container." https://www.oracle.com/java/technologies/javase/8u191-relnot...

- Here's a more detailed explanation and even a shared library that can be used to patch container unaware versions of Java. I wonder if the same could be done for Go?

"LD_PRELOAD=/path/to/libproccount.so java <args>"

https://stackoverflow.com/a/64271429

https://gist.github.com/apangin/78d7e6f7402b1a5da0fa3abd9381...

There are more recent changes to Java container awareness as well:

https://developers.redhat.com/articles/2022/04/19/java-17-wh...

Then in Java, if you don't set the limits, it gets the CPU from the VM via Runtime.getRuntime().availableProcessors()... this method returns the number of CPUs of the VM or the value set as CPU Quota. Starting from Java 11 the -XX:+PreferContainerQuotaForCPUCount is by default true. For Java <= 10 the CPU count is equal to the CPU shares. That method then is used to calculate the GC threads, fork join pool size, compiler threads etc. The solution would be to set -XX:ActiveProcessorCount=X where X is ideally the CPU shares value but as we know shares can change over time, so you would change this value over time...

Edit: or set -XX:-PreferContainerQuotaForCPUCount

Yes, I've experienced the same problem with the JVM (in Scala).

I've been bitten many times by the CFS scheduler while using containers and cgroups. What's the new scheduler? Has anyone here tried it in a production cluster? We're now going on two decades of wasted cores: https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf.

The problem here isn't the scheduler. It's resource restrictions imposed by the container but the containerized process (Go) not checking the OS features used to do that when calculating the available amount of parallelism.

https://kernelnewbies.org/Linux_6.6#New_task_scheduler:_EEVD...

Discovered this sometime last year in my previous role as a platform engineer managing our on-prem kubernetes cluster as well as the CI/CD pipeline infrastructure.

Although I saw this dissonance between actual and assigned CPU causing issues, particularly CPU throttling, I struggled to find a scalable solution that would affect all Go deployments on the cluster.

Getting all devs to include that autoprocs dependency was not exactly an option for hundreds of projects. Alternatively, setting all CPU request/limit to a whole number and then assigning that to a GOMAXPROCS environment variable in a k8s manifest was also clunky and infeasible.

I ended up just using this GOMAXPROCS variable for some of our more highly multithreaded applications which yielded some improvements but I’ve yet to find a solution that is applicable to all deployments in a microservices architecture with a high variability of CPU requirements for each project.

You could define a mutating webhook to inject GOMAXPROCS into all pod containers.

There isn't one answer for this. Capping GOMAXPROCS may cause severe latency problems if your process gets a burst of traffic and has naive queueing. It's best really to set GOMAXPROCS to whatever the hardware offers regardless of your ideas about how much time the process will use on average.

I know that the .NET CLR team adjusted its behavior to address this scenario, fwiw!

So did OpenJDK and the Rust standard library.

Besides GOMAXPROCS there's also GOMEMLIMIT in recent Go releases. You can use https://github.com/KimMachineGun/automemlimit to automatically set this this limit, kinda like https://github.com/uber-go/automaxprocs.

Isn't this a bug in the Go runtime and shouldn't they fix it? It looks like they are using the wrong metric to tune the internal scheduler.

There are also GC techniques to make the pause shorter, for example, doing the work for the pause concurrently and then repeating it in the safepoint. The hope is that the concurrent work will turn the safepoint work into a simpler check that no work is necessary. Doubling the work may hurt GC throughput.

How about GKE and containerd?

I feel like this isn't the first time I've read about issues with schedulers-in-schedulers, but I also can't find any immediate references on hand for other examples. Anyone know of any?

I think this is a great article talking about a thorny point in Golang but boy do I wish I never read this article. I wish this article was never useful to anyone.

Thanks for sharing this!

And as a maintainer of ko[1], it was a pleasant surprised to see ko mentioned briefly, so that's for that too :)

1: https://ko.build