return to table of content

Go, Containers, and the Linux Scheduler

dekhn
43 replies
20h12m

The common problem I see across many languages is: applications detect machine cores by looking at /proc/cpuinfo. However, in a docker container (or other container technology), that file looks the same as the container host (listing all cores, regardless of how few have been assigned to the container).

I wondered for a while if docker could make a fake /proc/cpuinfo that apps could parse that just listed "docker cpus" allocated to the job, but upon further reflection, that probably wouldn't work for many reasons.

jeffbee
20 replies
19h52m

That's not what Go does though. Go looks at the population of the CPU mask at startup. It never looks again, which of problematic in K8s where the visible CPUs may change while your process runs.

n3t
8 replies
17h8m

which of problematic in K8s where the visible CPUs may change while your process runs

This is new to me. What is this… behavior? What keywords should I use to find any details about it?

The only thing that rings a bell is requests/limit parameters of a pod but you can't change them on an existing pod AFAIK.

jeffbee
6 replies
16h14m

If you have one pod that has Burstable QoS, perhaps because it has a request and not a limit, its CPU mask will be populated by every CPU on the box, less one for the Kubelet and other node services, less all the CPUs requested by pods with Guaranteed QoS. Pods with Guaranteed QoS will have exactly the number of CPUs they asked for, no more or less, and consequently their GOMAXPROCS is consistent. Everyone else will see fewer or more CPUs as Guaranteed pods arrive and depart from the node.

n3t
5 replies
13h20m

If by "CPU mask" you refer to the `sched_getaffinity` syscall, I can't reproduce this behavior.

What I tried: I created a "Burstable" Pod and run `nproc` [0] on it. It returned N CPUs (N > 1).

Then I created a "Guaranteed QoS" Pod with both requests and limit set to 1 CPU. `nproc` returned N CPUs on it.

I went back to the "Burstable" Pod. It returned N.

I created a fresh "Burstable" Pod and run `nproc` on it, got N again. Please note that the "Guaranteed QoS" Pod is still running.

Pods with Guaranteed QoS will have exactly the number of CPUs they asked for, no more or less

Well, in my case I asked for 1 CPU and got more, i.e. N CPUs.

Also, please note that Pods might ask for fractional CPUs.

[0]: coreutils `nproc` program uses `sched_getaffinity` syscall under the hood, at least on my system. I've just checked it with `strace` to be sure.

jeffbee
4 replies
13h12m

I don't know what nproc does. Consider `taskset`

n3t
3 replies
13h5m

I re-did the experiment again with `taskset` and got the same results, i.e. the mask is independent of creation of the "Guaranteed QoS" Pod.

FWIW, `taskset` uses the same syscall as `nproc` (according to `strace`).

jeffbee
2 replies
13h2m

Perhaps it is an artifact of your and my various container runtimes. For me, in a guaranteed qos pod, taskset shows just 1 visible CPU for a Guaranteed QoS pod with limit=request=1.

  # taskset -c -p 1
  pid 1's current affinity list: 1

  # nproc
  1
I honestly do not see how it can work otherwise.
n3t
1 replies
12h45m

After reading https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana..., I think we have different policies set for the CPU Manager.

In my case it's `"cpuManagerPolicy": "none"` and I suppose you're using `"static"` policy.

Well, TIL. Thanks!

jeffbee
0 replies
2h29m

TIL also. The difference between guaranteed and burstable seems meaningless without this setting.

djbusby
0 replies
15h59m

Even way back in the day (1996) it was possible to hot-swap a CPU. Used to have this Sequent box, 96 Pentiums in there, 6 on a card. Could do some magic, pull the card and swap a new one in. Wild. And no processes died. Not sure if a process could lose a CPU then discover the new set.

dekhn
7 replies
19h46m

What is the population of the CPU mask at startup? Is this a kernel call? A /proc file? Some register?

EdSchouten
6 replies
19h44m

On Linux, it likely calls sched_getaffinity().

dekhn
5 replies
19h43m

hmm, I can see that as being useful but I also don't see that as the way to determine "how many worker threads I should start"

jeffbee
4 replies
19h38m

It's not a bad way to guess, up to maybe 16 or so. Most Go server programs aren't going to just scale up forever, so having 188 threads might be a waste.

Just setting it to 16 will satisfy 99% of users.

dekhn
3 replies
19h33m

There's going to be a bunch of missing info, though, in some cases I can think of. For example, more and more systems have asymmetric cores. /proc/cpuinfo can expose that information in detail, including (current) clock speed, processor type, etc, while cpu_set is literally just a bitmask (if I read the man pages right) of system cores your process is allowed to schedule on.

Fundamentally, intelligent apps need to interrogate their environment to make concurrency decisions. But I agree- Go would probably work best if it just picked a standard parallelism constant like 16 and just let users know that can be tuned if they have additional context.

jeffbee
2 replies
19h15m

Yes, running on a set of heterogenous CPUs presents further challenges, for the program and the thread scheduler. Happily there are no such systems in the cloud, yet.

Most people are running on systems where the CPU capacity varies and they haven't even noticed. For example in EC2 there are 8 victim CPUs that handle all the network interrupts, so if you have an instance type with 32 CPUs, you already have 24 that are faster than the others. Practically nobody even notices this effect.

loxias
1 replies
17h34m

in EC2 there are 8 victim CPUs that handle all the network interrupts, so if you have an instance type with 32 CPUs, you already have 24 that are faster than the others

Fascinating. Could you share any (all) more detail on this that you know? Is it a specific instance type, only ones that use nitro? (or only ones without?) This might be related to a problem I've seen in the wild but never tracked down...

jeffbee
0 replies
16h0m

I've only observed it on Nitro, but I have also rarely used pre-Nitro instances.

sethammons
2 replies
17h32m

We use https://github.com/uber-go/automaxprocs after we joyfully discovered that Go assumed we had the entire cluster's cpu count on any particular pod. Made for some very strange performance characteristics in scheduling goroutines.

jeffbee
1 replies
16h16m

My opinion is that setting GOMAXPROCS that way is a quite poor idea. It tends to strand resources that could have been used to handle a stochastic burst of requests, which with a capped GOMAXPROCS will be converted directly into latency. I can think of no good reason why GOMAXPROCS needs to be 2 just because you expect the long-term CPU rate to be 2. That long-term quota is an artifact of capacity planning, while GOMAXPROCS is an artifact of process architecture.

sethammons
0 replies
7h0m

How do you suggest handling that?

dharmab
16 replies
19h59m

Point of clarification: Containers, when using quota based limits, can use all of the CPU cores on the host. They're limited in how much time they can spend using them.

(There are exceptions, such as documented here: https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...)

dekhn
15 replies
19h44m

Maybe I should be clearer: Let's say I have a 16 core host and I start a flask container with cpu=0.5 that forks and has a heavy post-fork initializer.

flask/gunicorn will fork 16 processes (by reading /proc/cpuinfo and counting cores) all of which will try to share 0.5 cores worth of CPU power (maybe spread over many physical CPUs; I don't really care about that).

I can solve this by passing a flag to my application; my complaint is more that apps shouldn't consult /proc/cpuinfo, but have another standard interface to ask "what should I set my max parallelism (NOT CONCURRENCY, ROB) so my worker threads get adequate CPU time so the framework doesn't time out on startup.

vbezhenar
9 replies
18h1m

You generally shouldn't set CPU limits. You might want to configure CPU requests which is guaranteed chunk of CPU time that container will always receive. With CPU limits you'll encounter situation when host CPU is not loaded, but your container workloaded is throttled at the same time, which is just waste of CPU resources.

lovasoa
4 replies
17h22m

According to the article, this is not true. The limits become active only when the host cpu is under pressure.

vbezhenar
3 replies
16h55m

I don't think that's correct. --cpus is the same as --cpu-period which is cpu limit. You can easily check it yourself, just run docker container with --cpus set, run multi-core load there and check your activity monitor.

pbh101
2 replies
15h14m

CFS quotas only become active under contention and even then are relative: if you’re the only thing running on the box and want all the cores but only set one cpu, you get all of them anyway.

If you set cpus to 2 and another process sets to 1 and you both try to use all CPUs all out, you’ll get 66% and they’ll get 33%.

This isn’t the same as cpusets, which work differently.

ecnahc515
1 replies
13h46m

CFS quotas only become active under contention

That's not true at all. Take a look at `cpu.cfs_quota_us` in https://kernel.googlesource.com/pub/scm/linux/kernel/git/glo...

It's a hard time limit. It doesn't care about contention at all.

`cpu.shares` is relative, for choosing which process gets scheduled, and how often, but the CFS quota is a hard limit on runtime.

usr1106
0 replies
12h2m

Yes, there are hard limits in the CFS. I have used them for thermal reasons in the past, such that the system remained mostly idle although some threads would have had more work to do.

Not at my work environment right now, don't remember the parameters I used.

dekhn
3 replies
15h47m

It's complicated. I've worked on every kind of application in a container environment: ones that ran at ultra-low priority while declaring zero CPU request and infinite CPU limit. I ran one or a few of these on nearly every machine in Google production for over a year, and could deliver over 1M xeon cores worth of throughput for embarassingly parallel jobs. At other times, I ran jobs that asked for and used precisely all the cores on a machine (a TPU host), specifically setting limits and requests to get the most predictable behavior.

The true objective function I'm trying to optimize isn't just "save money" or "don't waste CPU resources", but rather "get a million different workloads to run smoothly on a large collection of resources, ensuring that revenue-critical jobs always can run, while any spare capacity is available for experimenters, up to some predefined limits determined by power capacity, staying within the overall budget, and not pissing off any really powerful users." (well, that's really just a simplified approximation)

jeffbee
2 replies
15h11m

The problem is your experience involves a hacked up Linux that was far more suitable for doing this than is the upstream. Upstream scheduler can't really deal with running a box hot with mixed batch and latency-sensitive workloads and intentionally abusive ones like yours ;-) That is partly why kubernetes doesn't even really try.

dilyevsky
0 replies
12h21m

This. Some googlers forget there is a whole team of kernel devs in TI that are maintaining patched kernel (including patched CFS) specifically for Borg

dekhn
0 replies
1h41m

I used Linux for mixed workloads (as in, my desktop that was being used for dev work was also running multi-core molecular dynamics jobs in the background). Not sure I agree completely that the Google linux kernel is significantly better at this.

At work at my new job we run mixed workloads in k8s and I don't really see a problem, but we also don't instrument well enough that I could say for sure. In our case it usually just makes sense to not oversubscribe machines (Google oversubscribed and then paid a cost due to preemptions and random job failures that got masked over by retries) by getting more machines.

Volundr
1 replies
18h24m

It's not clear to me what the max parallelism should actually be on a container with a CPU limit of .5. To my understanding that limits CPU time the container can use within a certain time interval, but doesn't actually limit the parallel processes an application can run. In other words that container with .5 on the CPU limit can indeed use all 16 physical cores of that machine. It'll just burn through it's budget 16x faster. If that's desirable vs limiting itself to one process is going to be highly application dependent and not something kubernetes and docker can just tell you.

FridgeSeal
0 replies
8h7m

It won’t burn through the budget faster by having more cores. You’re given a fixed time-slice of the whole CPU (in K8s, caveats below), whether you use all the cores or just one doesn’t particularly matter. On one hand, it would be would nice to be able to limit workloads on K8s to a subset of cores too, on the other, I can only imagine how much catastrophically complex that would make scheduling and optimisation.

Caveats: up to the number of cores exposed to your VM. I also believe the later versions of K8s let you do some degree of workload-core pinning and I don’t yet know how that interacts with core availability .

wutwutwat
0 replies
16h50m

`gunicorn --workers $(nproc)`, see my comment on the parent

status_quo69
0 replies
19h20m

https://stackoverflow.com/questions/65551215/get-docker-cpu-...

Been a bit but I do believe that dotnet does this exact behavior. Sounds like gunicorn needs a pr to mimic, if they want to replicate this.

https://github.com/dotnet/runtime/issues/8485

dharmab
0 replies
15h12m

That interface partly exists. It's /sys/fs/cgroup/(cgroup here)/cpu.max

I know the JVM automatically uses it, and there's a popular library for Go that sets GONAXPROCS using it.

wutwutwat
2 replies
16h54m

I only use `nproc` and see it used in other containers as well, ie `bundle install -j $(nproc)`. This honors cpu assignment and provides the functionality you're seeking. Whether or not random application software uses nproc if available, idk

Print the number of processing units available to the current process, which may be less than the number of online processors. If this information is not accessible, then print the number of processors installed

https://www.gnu.org/software/coreutils/manual/html_node/npro...

https://www.flamingspork.com/blog/2020/11/25/why-you-should-...

telotortium
1 replies
16h30m

This is not very robust. You probably should use the cgroup cpu limits where present, since `docker --cpus` uses a different way to set quota:

    if [[ -e /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]] && [[ -e /sys/fs/cgroup/cpu/cpu.cfs_period_us ]]; then
        GOMAXPROCS=$(perl -e 'use POSIX; printf "%d\n", ceil($ARGV[0] / $ARGV[1])' "$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)" "$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)")
    else
        GOMAXPROCS=$(nproc)
    fi
    export GOMAXPROCS
This follows from how `docker --cpus` works (https://docs.docker.com/config/containers/resource_constrain...), as well as https://stackoverflow.com/a/65554131/207384 to get the /sys paths to read from.

Or use https://github.com/uber-go/automaxprocs, which is very comprehensive, but is a bunch of code for what should be a simple task.

fn-mote
0 replies
13h15m

A shell script that invokes perl to set an environment variable used by Go. Some days I feel like there is a lot of duct tape involved in these applications.

rrdharan
1 replies
16h40m

Containers are a crappy abstraction and VMware fumbled the bag, is my takeaway from this comment…

sofixa
0 replies
9h32m

VMware fumbled the bag

Oh they did, they're a modern day IBM.

Containers are a crappy abstraction

They're one of the best abstractions we have (so far) because they contain only the application and what it needs.

marcus_holmes
19 replies
15h18m

I still don't get the benefit of running Go binaries in containers. Totally get it for Rails, Python, etc, where the program needs a consistent set of supporting libraries. But Go doesn't need that. Especially now we can embed whole file systems into the actual binary.

I've been looking at going in the other direction and using a Go binary with a unikernel; the machine just runs the binary and nothing else. I haven't got this working to my satisfaction yet - it works but there's still too much infrastructure, and deployment is still "fun". But I think this is way more interesting for Go deployments than using containers.

throwaway892238
7 replies
14h50m

Here's what a container gives you:

  - Isolation of networking
  - Isolation of process namespace
  - Isolation of filesystem namespace
  - CPU and Memory limit enforcement
  - A near-universal format for packaging, distributing, and running applications, with metadata, and support for multiple architectures
  - A simple method for declaring a multi-step build process
  - Other stuff I don't remember
You're certainly welcome to go without them, but over time, everybody ends up needing at least one of the features containers bring. I get the aversion to adding more complexity, but complexity has this annoying habit of being useful.
marcus_holmes
6 replies
13h54m

A VM provides the first 4 anyway - if you're deploying to a cloud instance then having these in the container is redundant. If you're deploying to bare metal then it's possibly useful, but only if you're deploying multiple containers to the same machine.

Go doesn't need a format for packaging - it's one file. It's becoming common practice to embed everything else into the binary. (side note: I haven't done this with env files yet, and tend to deploy them separately, but I don't see any reason why we don't do this and produce binaries targetted at the specific deployment environments. I might give it a go).

I kinda prefer makefiles for the build stuff, or even just a script. The whole process of creating a Docker instance, pushing source files to it, trigging go build and then pulling back the binary seems redundant; there's no advantage of doing this in a container over doing it on the local machine. And it's a lot faster on the local machine.

Talking to people, it appears to be as mholt said: everyone just does everything in containers so apparently we do this too.

dilyevsky
4 replies
12h54m

Are you suggesting to setup a separate VM for each process that may only require like 0.25 cpu? Another thing you can’t do with VMs is oversubscribe (at least not with cloud ones)

lmm
2 replies
11h51m

You can't have both oversubscription and isolation, almost by definition. If you want isolation, VMs are great. If you want oversubscription, OSes are still better than container runtimes at managing competing processes on the same host.

dilyevsky
1 replies
11h23m

OK but I can tho - oversub cpu, isolate memory, systems resources (like ports) etc.

OSes are still better than container runtimes at managing competing processes on the same host.

OSes and container runtimes are the same thing

marcus_holmes
0 replies
9h49m

OSes and container runtimes are the same thing

For a subset of OSes

marcus_holmes
0 replies
9h52m

Go binaries tend not to be that lightweight, because we have goroutines for that.

And yes, setting up a separate VM for each instance of a process is perfectly feasible. That's what all this cloud business was about in the first place.

nurple
0 replies
3h33m

This goes against everything we've learned about effectively deploying and managing software at runtime. Using the golang binary as a packaging format for your app has the same energy as crafting it exclusively from impenetrable one-liners.

tail_exchange
3 replies
15h2m

Containers can give you better isolation between the application and the host, as well as making horizontal scaling easier. It's also a must if you are using a system like Kubernetes. If you are running multiple applications in the same host, you also have control over resource limits.

jeffbee
2 replies
14h52m

Kubernetes requires a platform runtime that answers its requests, but there's no law enforcement agency that will prevent you from using a custom runtime that ignores the very existence of Linux control groups.

tail_exchange
1 replies
14h11m

Yes, that is true. Though if you are using Google's or Amazon's managed Kubernetes services, I think you need to use Docker.

jen20
0 replies
4h18m

Certainly EKS is less managed than it appears, I’m quite confident that a node running something implementing the kubelet API convincingly would work. They changed recently (1.23 maybe) from Docker to containerd.

mholt
3 replies
15h0m

As the author of Caddy, I have often wondered why people run it in containers. The feedback I keep hearing is basically workflow/ecosystem lock-in. Everything else uses containers, so the stuff that doesn't need containers needs them now, too.

wonrax
0 replies
12h38m

For my personal site I run Caddy in a Docker container along with other containers using a compose file. By doing it this way, getting things up when moving to a new instance is as simple as running `docker compose up`. Also making changes to the config or upgrading Caddy version on deployments is the same as a other services since they're all containers. So it's easy to add CI/CD and have it re-deploy Caddy whenever the config changes and there's no need for extra GitHub Actions .yaml's. Setup as code like this also documents all the dependencies and I think it might be helpful in the future.

Having said that, for serious business, this setup doesn't make sense. It possibly takes more work to operate as container when the gateway runs on a dedicated instance.

nurple
0 replies
3h51m

I find putting http proxies in containers to be a very effective method of building interesting dynamic L7 dataplanes on orchestrators like k8s. Packaging applications (particularly modern static SPAs), with a webserver embedded, is also a very intuitive way of plugging them into the leaves of this topology while abstracting away a lot of the connectivity policy from the app.

Of course there's also the well-known correlation between the quality of your k8s deployment and the number of proxies it hosts. /s

marcus_holmes
0 replies
13h52m

Love your work :) Good to know I'm not totally out there on this :)

wahern
1 replies
14h50m

Go programs still make use of, e.g., /etc/resolv.conf and /etc/ssl/certs/ca-certificates.crt. These aren't strictly necessary, but using them makes it easier--more consistent, more transparent--to configure these resource across different languages and frameworks.

marcus_holmes
0 replies
13h44m

I use the semi-official Acme package [0], which handles all that really well. I haven't touched SSL or TLS config for years. I mean, I might be an outlier, but this seems pretty standard for Go deployments these days.

[0] https://pkg.go.dev/golang.org/x/crypto/acme

cedws
0 replies
14h42m

A lot of companies are Kubernetes based now so containers are the default delivery mechanism for all software.

dilyevsky
19 replies
21h6m

This is subtly incorrect - as far as Docker is concerned CFS cgroup extension has several knobs to tune - cfs_quota_us, cfs_period_us (typical default is 100ms not a second) and shares. When you set shares you get weighted proportional scheduling (but only when there's contention). The former two enforce strict quota. Don't use Docker's --cpu flag and instead use --cpu-shares to avoid (mostly useless) quota enforcement.

From Linux docs:

  - cpu.shares: The weight of each group living in the same hierarchy, that
    translates into the amount of CPU it is expected to get. Upon cgroup creation,
    each group gets assigned a default of 1024. The percentage of CPU assigned to
    the cgroup is the value of shares divided by the sum of all shares in all
    cgroups in the same level.
  - cpu.cfs_period_us: The duration in microseconds of each scheduler period, for
    bandwidth decisions. This defaults to 100000us or 100ms. Larger periods will
    improve throughput at the expense of latency, since the scheduler will be able
    to sustain a cpu-bound workload for longer. The opposite of true for smaller
    periods. Note that this only affects non-RT tasks that are scheduled by the
    CFS scheduler.
  - cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us
    in for the current group will be allowed to run. For instance, if it is set to
    half of cpu_period_us, the cgroup will only be able to peak run for 50 % of
    the time. One should note that this represents aggregate time over all CPUs
    in the system. Therefore, in order to allow full usage of two CPUs, for
    instance, one should set this value to twice the value of cfs_period_us.
Thaxll
8 replies
20h48m

People using Kubernetes don't tune or change those settings, it's up to the app to behave properly.

dilyevsky
7 replies
20h41m

False. Kubernetes cpu request sets the shares, cpu limit sets the cfs quota

Thaxll
6 replies
20h38m

You said to change docker flags. Anyway your post is irrelevant, the goal is to let know the runtime about how many posix threads should it use.

If you set request / limit to 1 core but you run on 64 cores node , then you runtime will see that which will bring performance down.

dilyevsky
5 replies
20h11m

Original article is about docker. That’s the point of my comment - dont set cpu limit

riv991
4 replies
20h6m

I intended it to be applicable to all containerised environments. Docker is just easiest on my local machine.

I still believe it's best to set these variables regardless of cpu limits and/or cpu shares

dilyevsky
3 replies
19h19m

All you did is kneecapped your app to have lower performance so it fits under your arbitrary limit. Hardly what most people describe as “best” - only useful in small percentage of usecases (like reselling compute)

riv991
2 replies
18h54m

I've seen significant performance gains from this in production.

Other people have encountered it too hence libraries like Automaxprocs existing and issues being open with Go for it.

dilyevsky
1 replies
10h11m

Gains by what metric? Are you sure you didn't trade in better latency for worse overall throughput? Also, sure you didn't hit one of many CFS overaccounting bugs which we've seen a few? Have you compared performance without the limit at all?

riv991
0 replies
6h7m

Previously we had no limit. We observed gains in both latency and throughput by implementing Automaxprocs and decided to roll it out widely.

This aligns with what others have reported on the Go runtime issue open for this.

"When go.uber.org/automaxprocs rolled out at Uber, the effect on containerized Go services was universally positive. At least at the time, CFS imposed such heavy penalties on Go binaries exceeding their CPU allotment that properly tuning GOMAXPROCS was a significant latency and throughput improvement."

https://github.com/golang/go/issues/33803#issuecomment-14308...

cpuguy83
5 replies
20h12m

"Don't use Docker's --cpu flag and instead use"

This is rather strong language without any real qualifiers. It is definitely not "mostly useless". Shares and quotas are for different use-cases, that's all. Understand your use-case and choose accordingly.

dilyevsky
4 replies
19h4m

It doesn’t make any sense to me why —cpu flag is tweaking quota and not shares since quota is useful in tiny minority of usecases. A lot of people waste a ton of time debugging weird latency issues as a result of this decision

the8472
2 replies
18h26m

With shares you're going to experience worse latency if all the containers on the system size their thread pool to the maximum that's available during idle periods and then constantly context-switch due to oversubscription under load. With quotas you can do fixed resource allocation and the runtimes (not Go apparently) can fit themselves into that and not try to service more requests than they can currently execute given those resources.

dilyevsky
1 replies
16h4m

And how is that different from worse latency due to cpu throttling from your users’ perspective?

the8472
0 replies
7h57m

Fixed queue, so it'll only take as many as it can process and reject the rest, which can be used to do scaling, if you have a cluster. With shares it would think it has all the CPU cores available and oversize the queue.

cpuguy83
0 replies
15h51m

These two options are not mutually exclusive.

When you want to limit the max CPU time available to a container use quotas (--cpus). When you want to set relative priorities (compared to other containers/processes), use shares.

These two options can be combined, it all depends on what you need.

mratsim
2 replies
20h30m

Don't use Docker's --cpu flag and instead use --cpu-shares to avoid (mostly useless) quota enforcement.

One caveat is that an application can detect when --cpu is used as I think it's using cpuset. When quota are used it cannot detect and more threads than necessary will likely be spawned

dilyevsky
0 replies
20h9m

—cpu sets the quota, there is is a —cpuset-cpu flag for cpuset and you can detect both by looking at the /sys/fs/cgroup

cpuguy83
0 replies
20h19m

It is not using cpuset (there is a separate flag for this). --cpus tweaks the cfs quota based on the number of cpus on the system and the requested amount.

riv991
0 replies
20h35m

Hi I'm the blog author, thanks for the feedback

I'll try and clarify this. I think this is how the sympton presents but I should be clearer.

otterley
11 replies
11h31m

This sort of tuning isn't necessary if you use CPU reservations instead of limits, as you should: https://home.robusta.dev/blog/stop-using-cpu-limits

CPU reservations are limits, just implicit ones and declared as guarantees.

So let the Go runtime use all the CPUs available, and let the Linux scheduler throttle according to your declared reservations if the CPU is contended for.

ithkuil
5 replies
9h34m

I don't set limits because I'm afraid of how a pod is going to affect other pods. I set limits because I don't want to get used to being able to tap on the excess CPU available because that's not guaranteed to be available.

As the node fills up with more and more other pods, it's possible that a pod that was running just fine a moment ago is crawling to a halt.

Limits allow me to simulate the same behavior and plan for it by doing the right capacity planning.

They are not the only way to approach it! But they are the simplest way to so it.

arghwhat
2 replies
6h59m

Limiting CPU to the amount guaranteed to be available also guarantees very significant wasted resource utilization unless all your pods spin 100% CPU continuously.

The best way to utilize resources is to overcommit, and the smart way to overcommit is to, say, allow 4x overcommit with each allocation limited to 1/4th of the available resources so not individual peak can choke the system. Given varied allocations, things average out with a reasonable amount of performance variability.

Idle CPUs are wasted CPUs and money out the window.

midko
1 replies
5h35m

with each allocation limited to 1/4th of the available resources so not individual peak can choke the system.

This assumes that the scheduled workloads are created equal which isn't the case. The app owners do not have control over what else gets scheduled on the node which introduces uncontrollable variability in the performance of what should be identical replicas and environments. What helps here is .. limits. The requests-to-limits ratio allows application owners to reason about the variability risk they are willing to take in relation to the needs of the application (e.g. imagine a latency-sensitive workload on a critical path vs a BAU service vs a background job which just cares about throughput -- for each of these classes, the ratio would probably be very different). This way, you can still overcommit but not by a rule-of-thumb that is created centrally by the cluster ops team (e.g. aim for 1/4) but it's distributed across each workload owner (ie application ops) where this can be done a lot more accurately and with better results. This is what the parent post is also talking about.

arghwhat
0 replies
5h21m

1/4th was merely an example for one resource type, and a suitable limit may be much lower depending on the cluster and workloads. The point is that a limit set to 1/workloads guarantees wasted resources, and should be set significantly higher based on realistic workloads, while still ensuring that it takes N workloads to consume all resource to average out the risk of peak demand collisions.

This assumes that the scheduled workloads are created equal which isn't the case.

This particular allocation technique benefits from scheduled workloads not being equal as equality would increase likelihood of peak demand collisions.

eloisant
1 replies
8h50m

That's why you use monitor and alerting, so you notice degraded performances before the pods is crawling to a halt.

You need to do it anyway because a service might progressively need more resources as it's getting more traffic, even if you're not adding any other pod.

ithkuil
0 replies
7h29m

Sure you need monitoring and alerting and sure there are other reasons why you need to update your requests.

But having _neighbours_ affecting the behaviour of your workload is precisely what creates the kind of fatigue that then results in people claiming that it's hard to run k8s workloads. K8s is highly dynamical, pods can get scheduled on a node by chance sometimes and on some clusters; pagers will ring, incidents will be created for conditions that may solve themselves because of another deployment (possibly of another team) happening.

Overcommit/bursting is an advanced cost saving feature.

Let me say it again: splitting up a large machine into smaller parts that can use the unused capacity of other parts in order to reduce waste is an advanced feature!

The problem is that the request/limits feature is presented in the configuration spec and in the documentation in a deceptively simple way and we're tricked to think it's a basic feature.

Not all companies have ops teams that are well equipped to do more sophisticated things. My advice for those teams who cannot setup full automation around capacity management is to just not use this advanced features.

An alternative is to just use smaller dedicated nodes and (anti)affinity rules, so you always understand which pods go with which other pods. It's clunky but it's actually easier to reason about what's going to happen.

EDIT: typos

yipbub
3 replies
11h0m

Interesting. This is not true for Memory, correct? The OOMKiller might get you.

You also cannot achieve a QoS class of Guaranteed without both CPU and Memory limits, so the pod might be evicted at some point.

iTokio
1 replies
10h5m

Memory is different because it is non-compressible - once you give memory you can't take it away without killing the process
afr0ck
0 replies
6h38m

Swap (Disk, RDMA, Compression)? Page migration (NUMA, CXL)?

dilyevsky
0 replies
10h13m

Correct regarding memory - not true for memory because it's non-fungible unlike CPU shares

You also cannot achieve a QoS class of Guaranteed without both CPU and Memory limits, so the pod might be evicted at some point.

Evicted due to node pressure - yes (but if all other pods also don't have limits it doesn't matter). For preemption QoS is not factored in the decision [0]

[0] - https://kubernetes.io/docs/concepts/scheduling-eviction/pod-...

kubiton
0 replies
10h1m

I run a few things on 128 core setups and I set CPU limits to much higher than request but still set them to make sure nothing runs ammok.

I would be curious to see this discussed but your article only states that people think you need limit to ensure CPU for all pods.

bruh2
4 replies
19h42m

As someone not that familiar with Docker or Go, is this behavior intentional? Could the Go team make it aware of the CGroups limit? Do other runtimes behave similarly?

yjftsjthsd-h
2 replies
18h58m

I'm fairly certain that that .net had to deal with it and Java had or still has a problem, I forget which. (Or did you mean runtimes like containerd?)

richdougherty
1 replies
12h24m

Supported in Java 10 (and backported to Java 8) since 2018. Not sure about .NET.

- "The JVM has been modified to be aware that it is running in a Docker container and will extract container specific configuration information instead of querying the operating system. The information being extracted is the number of CPUs and total memory that have been allocated to the container." https://www.oracle.com/java/technologies/javase/8u191-relnot...

- Here's a more detailed explanation and even a shared library that can be used to patch container unaware versions of Java. I wonder if the same could be done for Go?

"LD_PRELOAD=/path/to/libproccount.so java <args>"

https://stackoverflow.com/a/64271429

https://gist.github.com/apangin/78d7e6f7402b1a5da0fa3abd9381...

-

There are more recent changes to Java container awareness as well:

https://developers.redhat.com/articles/2022/04/19/java-17-wh...

jpolidor
0 replies
7h17m

Then in Java, if you don't set the limits, it gets the CPU from the VM via Runtime.getRuntime().availableProcessors()... this method returns the number of CPUs of the VM or the value set as CPU Quota. Starting from Java 11 the -XX:+PreferContainerQuotaForCPUCount is by default true. For Java <= 10 the CPU count is equal to the CPU shares. That method then is used to calculate the GC threads, fork join pool size, compiler threads etc. The solution would be to set -XX:ActiveProcessorCount=X where X is ideally the CPU shares value but as we know shares can change over time, so you would change this value over time...

Edit: or set -XX:-PreferContainerQuotaForCPUCount

eloisant
0 replies
8h43m

Yes, I've experienced the same problem with the JVM (in Scala).

ntonozzi
2 replies
21h27m

I've been bitten many times by the CFS scheduler while using containers and cgroups. What's the new scheduler? Has anyone here tried it in a production cluster? We're now going on two decades of wasted cores: https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf.

the8472
0 replies
18h38m

The problem here isn't the scheduler. It's resource restrictions imposed by the container but the containerized process (Go) not checking the OS features used to do that when calculating the available amount of parallelism.

donaldihunter
0 replies
20h54m
gregfurman
2 replies
20h53m

Discovered this sometime last year in my previous role as a platform engineer managing our on-prem kubernetes cluster as well as the CI/CD pipeline infrastructure.

Although I saw this dissonance between actual and assigned CPU causing issues, particularly CPU throttling, I struggled to find a scalable solution that would affect all Go deployments on the cluster.

Getting all devs to include that autoprocs dependency was not exactly an option for hundreds of projects. Alternatively, setting all CPU request/limit to a whole number and then assigning that to a GOMAXPROCS environment variable in a k8s manifest was also clunky and infeasible.

I ended up just using this GOMAXPROCS variable for some of our more highly multithreaded applications which yielded some improvements but I’ve yet to find a solution that is applicable to all deployments in a microservices architecture with a high variability of CPU requirements for each project.

linuxftw
0 replies
19h43m

You could define a mutating webhook to inject GOMAXPROCS into all pod containers.

jeffbee
0 replies
19h49m

There isn't one answer for this. Capping GOMAXPROCS may cause severe latency problems if your process gets a burst of traffic and has naive queueing. It's best really to set GOMAXPROCS to whatever the hardware offers regardless of your ideas about how much time the process will use on average.

evntdrvn
1 replies
19h40m

I know that the .NET CLR team adjusted its behavior to address this scenario, fwiw!

the8472
0 replies
18h40m

So did OpenJDK and the Rust standard library.

rickette
0 replies
20h49m

Besides GOMAXPROCS there's also GOMEMLIMIT in recent Go releases. You can use https://github.com/KimMachineGun/automemlimit to automatically set this this limit, kinda like https://github.com/uber-go/automaxprocs.

perryizgr8
0 replies
7h31m

Isn't this a bug in the Go runtime and shouldn't they fix it? It looks like they are using the wrong metric to tune the internal scheduler.

irogers
0 replies
11h29m

There are also GC techniques to make the pause shorter, for example, doing the work for the pause concurrently and then repeating it in the safepoint. The hope is that the concurrent work will turn the safepoint work into a simpler check that no work is necessary. Doubling the work may hurt GC throughput.

hiroshi3110
0 replies
20h49m

How about GKE and containerd?

alilleybrinker
0 replies
15h9m

I feel like this isn't the first time I've read about issues with schedulers-in-schedulers, but I also can't find any immediate references on hand for other examples. Anyone know of any?

WesternStar
0 replies
14h51m

I think this is a great article talking about a thorny point in Golang but boy do I wish I never read this article. I wish this article was never useful to anyone.

ImJasonH
0 replies
20h43m

Thanks for sharing this!

And as a maintainer of ko[1], it was a pleasant surprised to see ko mentioned briefly, so that's for that too :)

1: https://ko.build