The very first 16 KB enabled Android system will be made available on select devices as a developer option. This is so you can use the developer option to test and fix
once an application is fixed to be page size agnostic, the same application binary can run on both 4 KB and 16 KB devices
I am curious about this. When could an app NOT be agnostic to this? Like what an app must be doing to cause this to be noticeable?
The fundamental problem is that system headers don't provide enough information. In particular, many programs need both "min runtime page size" and "max runtime page size" (and by this I mean non-huge pages).
If you call `mmap` without constraint, you need to assume the result will be aligned to at least "min runtime page size". In practice it is probably safe to assume 4K for this for "normal" systems, but I've seen it down to 128 bytes on some embedded systems, and I don't have much breadth there (this will break many programs though, since there are more errno values than that). I don't know enough about SPARC binary compatibility to know if it's safe to push this up to 8K for certain targets.
But if you want to call `mmap` (etc.) with full constraint, you must work in terms of "max runtime page size". This is known to be up to at least 64K in the wild (aarch64), but some architectures have "huge" pages not much beyond that so I'm not sure (256K, 512K, and 1M; beyond that is almost certainly going to be considered huge pages).
Besides a C macro, these values also need to be baked into the object file and the linker needs to prevent incompatible assumptions (just in case a new microarchitecture changes them)
you can also do 2M and 1G huge pages on x86, it gets kind of silly fast.
What? Any pointers on how 1G speeds things up? I'd have taken a bigger page size to wreak havoc on process scheduling and filesystem.
Because of virtual address translation [1] speed up. When a memory access is made by a program, the CPU must first translate the virtual address to a physical address, by walking a hierarchical data structure called a page table [2]. Walking the page tables is slow, thus CPUs implement a small on-CPU cache of virtual-to-physical translations called a TLB [1]. The TLB has a limited number of entries for each page size. With 4 KiB pages, the contention on this cache is very high, especially if the workload has a very large workingset size, therefore causing frequent cache evictions and slow walk of the page tables. With 2 MiB or 1 GiB pages, there is less contention and more workingset size is covered by the TLB. For example, a TLB with 1024 entries can cover a maximum of 4 MiB of workingset memory. With 2 MiB pages, it can cover up to 2 GiB of workingset memory. Often, the CPU has different number of entries for each page size.
However, it is known that larger page sizes have higher internal fragmentation and thus lead to memory wastage. It's a trade off. But generally speaking, for modern systems, the overhead of managing memory in 4 KiB is very high and we are at a point where switching to 16/64 KiB is almost always a win. 2 MiB is still a bit of a stretch, though, but transparent 2 MiB pages for heap memory is enabled by default on most major Linux distributions, aka THP [2]
Source: my PhD is on memory management and address translation on large memory systems, having worked both on hardware architecture of address translation and TLBs as well as the Linux kernel. I'm happy to talk about this all day!
[1] https://blogs.vmware.com/vsphere/2020/03/how-is-virtual-memo... [2] https://docs.kernel.org/admin-guide/mm/transhuge.html
Oh really :)
I'd like to ask how applications should change their memory allocation or usage patterns to maximise the benefit of THP. Do memory allocators (glibc mainly) need config tweaking to coalesce tiny mallocs into 2MB+ mmaps, will they just always do that automatically, do you need to use a custom pool allocator so you're doing large allocations, or are you never going to get the full benefit of huge tables without madvise/libhugetlbfs? And does this apply to Mac/Windows/*BSD at all?
[Edit: ouch, I see /sys/kernel/mm/transparent_hugepage/enabled is default set to 'madvise' on my system (Slackware) and as a result doing nearly nothing. But I saw it enabled in the past. Well that answers a lot of my questions: got to use madvise/libhugetlbfs.]
I read you also need to ensure ELF segments are properly aligned to get transparent huge pages for code/data.
Another question. From your link [2]:
Do the heuristics used by Linux THP (khugepaged) really allow completely ignoring whether pages have actually been page-faulted in or even initialised? Is a possibility unlikely to happen in practice?
In current Linux systems, there are two main ways to benefit from huge pages. 1) There is the explicit, user-managed approach via hugetlbfs. That's not very common. 2) Transparently managed by the kernel via THP (userpsace is completely unaware and any application using mmap() and malloc() can benefit from that).
As I mentioned before, most major Linux distributions ship with THP enabled by default. THP automatically allocates huge pages for mmap memory whenever possible (that is when, at least, the region is 2 MiB aligned and is at least 2 MiB in size). There is also a separate kernel thread, khugepaged, that opportunistically tries to coalesce/promote base 4K pages into 2 MiB huge pages, whenever possible.
Library support is not really required for THP, but could be detrimental for its performance and availability on the long run. A library that is not aware of kernel huge pages may employ suboptimal memory management strategies, resulting in inefficient utilization, for example by unintentionally breaking those huge pages (e.g. via unaligned unmapping), or failing to properly release them to the OS as one full unit, undermining their availability on the long run. Afaik, Tcmalloc from Google is the only library with extensive huge page awareness [1].
Linux allocates huge pages on first touch. For khugepaged, it only coalesces the pages if all the base pages covering the 2 MiB virtual region exist in some form (not necessary faulted-in. For example, some of those base pages could be in swap space and Linux will first fault them in then migrate them)
[1] https://www.usenix.org/system/files/osdi21-hunter.pdf
Mimalloc has support for huge pages. It also has an option to reserve 1GB pages on program start, and I've had very good performance results using that setting and replacing factorio's allocator. On windows and linux.
Thanks!
With noobs, too? ;)
- Does it mean userspace is free to allocate up to a maximum of 1G? I took pages to have a fixed size.
- Or, you mean CPUs reserve TLB sizes depending on the requested page size?
- Would memory allocators / GCs need to be changed to deal with blocks of 1G? Would you say, the current ones found in popular runtimes/implementations are adept at doing so?
- Does it not adversely affect databases accustomed to smaller page sizes now finding themselves paging in 1G at once?
If the dissertation is public, please do link it, if you're comfortable doing so.
The TLB is a hardware cache with a limited number of entries that cannot dynamically change. Your CPU is shipped with a fixed number of entries dedicated for each page size. Translations of base 4 KiB pages could, for example, have 1024 entries. Translations of 2 MiB pages could have 512 entries and those of 1 GiB usually have a very limited number of only 8 or 16. Nowadays, most CPU vendors increased their 2 MiB TLBs to have the same number of entries dedicated for 4 KiB pages.
If you're wondering why they have to be separate caches, it's because, for any page in memory, you can have both mappings at the same time from different processes or different parts of the same process, with possibly different protections.
Runtimes and databases have full control and Linux allows per-process policies via madvise) system call. If a program is not happy with huge pages, it can ask the kernel to be ignored, as it can choose to be cooperative.
I'm still in the PhD process, so no cookies atm :D
I think modern Intel/AMD have same amount of dTLB entries for all page sizes. For example a modern CPU with 3k TLB entries one can access at max: - 12MB with 4k page size - 6GB with 2M page size - 3TB with 1G page size
If the working set per core is bigger than above numbers you get 10-20% slower memory accesses due to TLB miss penalty.
Are huge pages expected to share code (X) and data (RW)?
There's probably no good reason to put code and data on the same page, it's just one extra TLB entry to use two pages instead so the data page can be marked non-executable.
Quoting Windows Internals 7th Edt Part 1:
It's nice for type 1 hypervisors when carving up memory for guests. When page walks for guest virtual to host physical end up taking sixteen levels, a 1G page short circuits that in half to eight.
That's what most hypervisors (e.g. Qemu) do on Linux when THP are enabled and allowed for the process.
Search for huge pages in the documentation of a DBMS that implements its own caching in shared memory: Oracle [1], PostgreSQL [2], MySQL [3], etc. When you're caching hundreds of gigabytes, it makes a difference. Here's a benchmark comparing PostgreSQL performance with regular, large, and huge pages [4].
There was a really bad performance regression in Linux a couple of years ago that killed performance with large memory regions like this (can't find a useful link at the moment), and the short-term mitigation was to increase the huge page size from 2MB to 1GB.
[1] https://blogs.oracle.com/exadata/post/huge-pages-in-the-cont...
[2] https://www.postgresql.org/docs/current/kernel-resources.htm...
[3] https://dev.mysql.com/doc/refman/8.4/en/large-page-support.h...
[4] https://www.percona.com/blog/benchmark-postgresql-with-linux...
1G huge pages had (have?) performance benefits on managed runtimes for certain scenarios (Both the JIT code cache and the GC space saw uplift on the SpecJ benchmarks if I recall correctly)
If using relatively large quantities of memory 2M should enable much higher TLB hit rates assuming the CPU doesn't do something silly like only having 4 slots for pages larger than 4k ¬.¬
Yes, but the context here is Java or Kotlin running on Android, not embedded C.
Or do some Android applications run embedded C with only a Java UI? I'm not an Android dev.
Android apps can call into native code via JNI, which the platform supports.
Wonder if Android apps can also be fully native (C++)?
It is possible to have a project set up with a manifest which contains only a single activity with android.app.NativeActivity pointing to a .so, and zero lines of java/kotlin/flutter/whatever else - though your app initialization will go through usual hoops of spawning a java-based instance.
Minimal example would be https://github.com/android/ndk-samples/blob/master/native-ac..., though there are well established Qt based apps as well
I saw a project posted on here a while back about writing android apps with no java, only c.
There is no good reason to do it, but it is apparently possible.
https://github.com/cnlohr/rawdrawandroid
Apps written in Flutter/Dart and React Native/Javascript both compile to native code with only shims to interface with the Java UI framework.
Flutter/Dart, yes, React Native/Javascript, no. With RN the app's code runs via an embedded JavaScript engine, and even when, say, Hermes is being used, it's still executing bytecode not native machine code.
Also important to note that any code that runs on Android's ART runtime (i.e. Kotlin and/or Java) can get some or all of its code AOT-compiled to machine code by the OS, either upon app install (if the app ships with baseline profiles) or in the background while the device is idle and charging.
Chrome browser on Android uses the same code base as Chrome on desktop including multi-process architecture. But it’s UI is in Java communicating with C++ using JNI.
I had no idea, thank you!
The Android Native Development Kit (NDK) allows building native code libraries for Android (typically C/C++, but this can include Rust). These can then be loaded and accessed by JNI on the Java/Kotlin side
* Brief overview of the NDK: https://developer.android.com/ndk/guides
* Guide to supporting 16KB page sizes with the NDK https://developer.android.com/guide/practices/page-sizes
Good to know, thank you!
Yes, Android apps can and do have native libraries. Sometimes this can be part of a SDK, or otherwise out of the developers control.
Wine doesn't work on 16 KB page size among other things.
This seems especially peculiar given Windows has a 64K mapping granularity.
Windows uses 4KB pages.
Right (on x86-32 and -64, because you can’t have 64KB pages there, though larger page sizes do exist and get used). You still cannot (e.g.) MapViewOfFile() on an address not divisible by 64KB, because Alpha[1]. As far as I understand, Windows is mostly why the docs for the Blink emulator[2] (a companion project of Cosmopolitan libc) tell you any programs under it need to use sysconf(_SC_PAGESIZE) [aka getpagesize() aka getauxval(AT_PAGESZ)] instead of assuming 4KB.
[1] https://devblogs.microsoft.com/oldnewthing/20031008-00/?p=42...
[2] https://github.com/jart/blink/blob/master/README.md#compilin...
this is no longer true with MapViewOfFile3: Third time's a charm, now you can map to page boundaries
TIL about MapViewOfFile3 and NtMapViewOfSectionEx, thanks! Still, the Microsoft docs say[1]:
The peculiar part is where base address and offset must be divisible by 64K (also referred to as the “allocation granularity”) but the size only needs to be divisible by the page size. Maybe you’re right and the docs are wrong?..
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryap...
the new behavior works under the MEM_REPLACE_PLACEHOLDER flag, you can create those regions with VirtualAlloc2
4K, 2M ("large page"), or 1G ("huge page") on x86-64. A single allocation request can consist of multiple page sizes. From Windows Internal 7th Edt Part 1:
If you use a database library that does mmap to create a db file with SC_PAGE_SIZE (4KB) pages, and then upgrade your device to a 16KB one and backup/restore the app, now your data isn't readable.
Which is the reason you need to format your data to experiment with 16k
Also ELF segment alignment, which is defaults to 4k.
Only on Android, for what it's worth; most "vanilla" Linux aarch64 linkers chose 64K defaults several years ago. But yes, most Android applications with native (NDK) binaries will need to be rebuilt with the new 16kb max-page-size.
For example use mmap and just assume 4 kB pages.
Pages sizes are often important to code that relies on low-level details of the environment it’s running in, like language runtimes. They might do things like mark some sections of code as writable or executable and thus would need to know what the granularity of those requests can be. It’s also of importance to things like allocators that hand out memory backed by mmap pages. If they have, say, a bit field for each 16-byte region of a page that has been used that will change in size in ways they can detect.
I don't know if this fits but I've seen code that allocated say 32 bytes from a function that allocated 1meg under the hood. Not knowing that's what was happening the app quickly ran out of memory. It arguably was not the app's fault. The API it was calling into was poorly designed and poorly named, such that the fact that you might need to know the block size to use the function was in no way indicated by the name of the function nor the names of any of its parameters.
If you use mmap/munmap directly within your application you could probably get into trouble by hardcoding the page size.
jemalloc bakes in page size assumptions, see eg https://github.com/jemalloc/jemalloc/issues/467.
When the app has a custom memory allocator, the allocator might have hardcoded the page size for performance. Otherwise you have to load a static variable (knocks out a cache line you could've used for something else) and then do a multiplication (or bit shift, if you assume power of 2) by a runtime value instead of a shift by a constant, which can be slower.
No idea if Android apps are ever this performance sensitive, though.