HN comments for: Adding 16 kb page size to Android

a1o

47 replies

2024-08-23 17:49:32 UTC

The very first 16 KB enabled Android system will be made available on select devices as a developer option. This is so you can use the developer option to test and fix

once an application is fixed to be page size agnostic, the same application binary can run on both 4 KB and 16 KB devices

I am curious about this. When could an app NOT be agnostic to this? Like what an app must be doing to cause this to be noticeable?

o11c

28 replies

2024-08-23 18:25:10 UTC

The fundamental problem is that system headers don't provide enough information. In particular, many programs need both "min runtime page size" and "max runtime page size" (and by this I mean non-huge pages).

If you call `mmap` without constraint, you need to assume the result will be aligned to at least "min runtime page size". In practice it is probably safe to assume 4K for this for "normal" systems, but I've seen it down to 128 bytes on some embedded systems, and I don't have much breadth there (this will break many programs though, since there are more errno values than that). I don't know enough about SPARC binary compatibility to know if it's safe to push this up to 8K for certain targets.

But if you want to call `mmap` (etc.) with full constraint, you must work in terms of "max runtime page size". This is known to be up to at least 64K in the wild (aarch64), but some architectures have "huge" pages not much beyond that so I'm not sure (256K, 512K, and 1M; beyond that is almost certainly going to be considered huge pages).

Besides a C macro, these values also need to be baked into the object file and the linker needs to prevent incompatible assumptions (just in case a new microarchitecture changes them)

lanigone

15 replies

23h38m

2024-08-23 18:51:04 UTC

you can also do 2M and 1G huge pages on x86, it gets kind of silly fast.

ignoramous

13 replies

19h54m

2024-08-23 22:35:38 UTC

What? Any pointers on how 1G speeds things up? I'd have taken a bigger page size to wreak havoc on process scheduling and filesystem.

afr0ck

9 replies

19h19m

2024-08-23 23:10:32 UTC

Because of virtual address translation [1] speed up. When a memory access is made by a program, the CPU must first translate the virtual address to a physical address, by walking a hierarchical data structure called a page table [2]. Walking the page tables is slow, thus CPUs implement a small on-CPU cache of virtual-to-physical translations called a TLB [1]. The TLB has a limited number of entries for each page size. With 4 KiB pages, the contention on this cache is very high, especially if the workload has a very large workingset size, therefore causing frequent cache evictions and slow walk of the page tables. With 2 MiB or 1 GiB pages, there is less contention and more workingset size is covered by the TLB. For example, a TLB with 1024 entries can cover a maximum of 4 MiB of workingset memory. With 2 MiB pages, it can cover up to 2 GiB of workingset memory. Often, the CPU has different number of entries for each page size.

However, it is known that larger page sizes have higher internal fragmentation and thus lead to memory wastage. It's a trade off. But generally speaking, for modern systems, the overhead of managing memory in 4 KiB is very high and we are at a point where switching to 16/64 KiB is almost always a win. 2 MiB is still a bit of a stretch, though, but transparent 2 MiB pages for heap memory is enabled by default on most major Linux distributions, aka THP [2]

Source: my PhD is on memory management and address translation on large memory systems, having worked both on hardware architecture of address translation and TLBs as well as the Linux kernel. I'm happy to talk about this all day!

[1] https://blogs.vmware.com/vsphere/2020/03/how-is-virtual-memo... [2] https://docs.kernel.org/admin-guide/mm/transhuge.html

versteegen

2 replies

17h50m

2024-08-24 00:39:18 UTC

I'm happy to talk about this all day!

Oh really :)

I'd like to ask how applications should change their memory allocation or usage patterns to maximise the benefit of THP. Do memory allocators (glibc mainly) need config tweaking to coalesce tiny mallocs into 2MB+ mmaps, will they just always do that automatically, do you need to use a custom pool allocator so you're doing large allocations, or are you never going to get the full benefit of huge tables without madvise/libhugetlbfs? And does this apply to Mac/Windows/*BSD at all?

[Edit: ouch, I see /sys/kernel/mm/transparent_hugepage/enabled is default set to 'madvise' on my system (Slackware) and as a result doing nearly nothing. But I saw it enabled in the past. Well that answers a lot of my questions: got to use madvise/libhugetlbfs.]

I read you also need to ensure ELF segments are properly aligned to get transparent huge pages for code/data.

Another question. From your link [2]:

An application may mmap a large region but only touch 1 byte of it, in that case a 2M page might be allocated instead of a 4k page for no good.

Do the heuristics used by Linux THP (khugepaged) really allow completely ignoring whether pages have actually been page-faulted in or even initialised? Is a possibility unlikely to happen in practice?

afr0ck

1 replies

16h38m

2024-08-24 01:51:41 UTC

In current Linux systems, there are two main ways to benefit from huge pages. 1) There is the explicit, user-managed approach via hugetlbfs. That's not very common. 2) Transparently managed by the kernel via THP (userpsace is completely unaware and any application using mmap() and malloc() can benefit from that).

As I mentioned before, most major Linux distributions ship with THP enabled by default. THP automatically allocates huge pages for mmap memory whenever possible (that is when, at least, the region is 2 MiB aligned and is at least 2 MiB in size). There is also a separate kernel thread, khugepaged, that opportunistically tries to coalesce/promote base 4K pages into 2 MiB huge pages, whenever possible.

Library support is not really required for THP, but could be detrimental for its performance and availability on the long run. A library that is not aware of kernel huge pages may employ suboptimal memory management strategies, resulting in inefficient utilization, for example by unintentionally breaking those huge pages (e.g. via unaligned unmapping), or failing to properly release them to the OS as one full unit, undermining their availability on the long run. Afaik, Tcmalloc from Google is the only library with extensive huge page awareness [1].

Do the heuristics used by Linux THP (khugepaged) really allow completely ignoring whether pages have actually been page-faulted in or even initialised? Is a possibility unlikely to happen in practice?

Linux allocates huge pages on first touch. For khugepaged, it only coalesces the pages if all the base pages covering the 2 MiB virtual region exist in some form (not necessary faulted-in. For example, some of those base pages could be in swap space and Linux will first fault them in then migrate them)

[1] https://www.usenix.org/system/files/osdi21-hunter.pdf

Dylan16807

0 replies

12h27m

2024-08-24 06:02:35 UTC

Mimalloc has support for huge pages. It also has an option to reserve 1GB pages on program start, and I've had very good performance results using that setting and replacing factorio's allocator. On windows and linux.

ignoramous

2 replies

19h5m

2024-08-23 23:24:14 UTC

Thanks!

I'm happy to talk about this all day!

With noobs, too? ;)

Often, the CPU has different number of entries for each page size.

- Does it mean userspace is free to allocate up to a maximum of 1G? I took pages to have a fixed size.

- Or, you mean CPUs reserve TLB sizes depending on the requested page size?

With 2 MiB or 1 GiB pages, there is less contention and more workingset size is covered by the TLB

- Would memory allocators / GCs need to be changed to deal with blocks of 1G? Would you say, the current ones found in popular runtimes/implementations are adept at doing so?

- Does it not adversely affect databases accustomed to smaller page sizes now finding themselves paging in 1G at once?

my PhD is on memory management and address translation on large memory systems

If the dissertation is public, please do link it, if you're comfortable doing so.

afr0ck

1 replies

16h20m

2024-08-24 02:09:31 UTC

- Does it mean userspace is free to allocate up to a maximum of 1G? I took pages to have a fixed size. > - Or, you mean CPUs reserve TLB sizes depending on the requested page size?

The TLB is a hardware cache with a limited number of entries that cannot dynamically change. Your CPU is shipped with a fixed number of entries dedicated for each page size. Translations of base 4 KiB pages could, for example, have 1024 entries. Translations of 2 MiB pages could have 512 entries and those of 1 GiB usually have a very limited number of only 8 or 16. Nowadays, most CPU vendors increased their 2 MiB TLBs to have the same number of entries dedicated for 4 KiB pages.

If you're wondering why they have to be separate caches, it's because, for any page in memory, you can have both mappings at the same time from different processes or different parts of the same process, with possibly different protections.

- Would memory allocators / GCs need to be changed to deal with blocks of 1G? Would you say, the current ones found in popular runtimes/implementations are adept at doing so?

- Does it not adversely affect databases accustomed to smaller page sizes now finding themselves paging in 1G at once?

Runtimes and databases have full control and Linux allows per-process policies via madvise) system call. If a program is not happy with huge pages, it can ask the kernel to be ignored, as it can choose to be cooperative.

If the dissertation is public, please do link it, if you're comfortable doing so.

I'm still in the PhD process, so no cookies atm :D

gmokki

0 replies

8h30m

2024-08-24 09:58:58 UTC

I think modern Intel/AMD have same amount of dTLB entries for all page sizes. For example a modern CPU with 3k TLB entries one can access at max: - 12MB with 4k page size - 6GB with 2M page size - 3TB with 1G page size

If the working set per core is bigger than above numbers you get 10-20% slower memory accesses due to TLB miss penalty.

CalChris

2 replies

17h46m

2024-08-24 00:43:34 UTC

Are huge pages expected to share code (X) and data (RW)?

versteegen

0 replies

17h33m

2024-08-24 00:56:00 UTC

There's probably no good reason to put code and data on the same page, it's just one extra TLB entry to use two pages instead so the data page can be marked non-executable.

nullindividual

0 replies

13h54m

2024-08-24 04:34:57 UTC

Quoting Windows Internals 7th Edt Part 1:

    There is an unfortunate side effect of large pages. Each page (whether huge, large, or small) must be mapped with a single protection that applies to the entire page. This is because hardware memory protection is on a per-page basis. If a large page contains, for example, both read-only code and read/write data, the page must be marked as read/write, meaning that the code will be writable. As a result, device drivers or other kernel-mode code could, either maliciously or due to a bug, modify what is supposed to be read-only operating system or driver code without causing a memory access violation.

monocasa

1 replies

17h57m

2024-08-24 00:32:30 UTC

It's nice for type 1 hypervisors when carving up memory for guests. When page walks for guest virtual to host physical end up taking sixteen levels, a 1G page short circuits that in half to eight.

afr0ck

0 replies

16h16m

2024-08-24 02:12:52 UTC

That's what most hypervisors (e.g. Qemu) do on Linux when THP are enabled and allowed for the process.

SloopJon

0 replies

4h3m

2024-08-24 14:25:54 UTC

Search for huge pages in the documentation of a DBMS that implements its own caching in shared memory: Oracle [1], PostgreSQL [2], MySQL [3], etc. When you're caching hundreds of gigabytes, it makes a difference. Here's a benchmark comparing PostgreSQL performance with regular, large, and huge pages [4].

There was a really bad performance regression in Linux a couple of years ago that killed performance with large memory regions like this (can't find a useful link at the moment), and the short-term mitigation was to increase the huge page size from 2MB to 1GB.

[1] https://blogs.oracle.com/exadata/post/huge-pages-in-the-cont...

[2] https://www.postgresql.org/docs/current/kernel-resources.htm...

[3] https://dev.mysql.com/doc/refman/8.4/en/large-page-support.h...

[4] https://www.percona.com/blog/benchmark-postgresql-with-linux...

ShroudedNight

0 replies

20h33m

2024-08-23 21:55:52 UTC

1G huge pages had (have?) performance benefits on managed runtimes for certain scenarios (Both the JIT code cache and the GC space saw uplift on the SpecJ benchmarks if I recall correctly)

If using relatively large quantities of memory 2M should enable much higher TLB hit rates assuming the CPU doesn't do something silly like only having 4 slots for pages larger than 4k ¬.¬

dotancohen

11 replies

23h43m

2024-08-23 18:46:25 UTC

Yes, but the context here is Java or Kotlin running on Android, not embedded C.

Or do some Android applications run embedded C with only a Java UI? I'm not an Android dev.

saagarjha

3 replies

23h39m

2024-08-23 18:50:34 UTC

Android apps can call into native code via JNI, which the platform supports.

ignoramous

2 replies

19h51m

2024-08-23 22:37:50 UTC

Wonder if Android apps can also be fully native (C++)?

fensgrim

0 replies

18h55m

2024-08-23 23:34:04 UTC

It is possible to have a project set up with a manifest which contains only a single activity with android.app.NativeActivity pointing to a .so, and zero lines of java/kotlin/flutter/whatever else - though your app initialization will go through usual hoops of spawning a java-based instance.

Minimal example would be https://github.com/android/ndk-samples/blob/master/native-ac..., though there are well established Qt based apps as well

extraduder_ire

0 replies

2h30m

2024-08-24 15:59:41 UTC

I saw a project posted on here a while back about writing android apps with no java, only c.

There is no good reason to do it, but it is apparently possible.

https://github.com/cnlohr/rawdrawandroid

warkdarrior

1 replies

22h18m

2024-08-23 20:11:03 UTC

Apps written in Flutter/Dart and React Native/Javascript both compile to native code with only shims to interface with the Java UI framework.

farmerbb

0 replies

14h44m

2024-08-24 03:45:18 UTC

Flutter/Dart, yes, React Native/Javascript, no. With RN the app's code runs via an embedded JavaScript engine, and even when, say, Hermes is being used, it's still executing bytecode not native machine code.

Also important to note that any code that runs on Android's ART runtime (i.e. Kotlin and/or Java) can get some or all of its code AOT-compiled to machine code by the OS, either upon app install (if the app ships with baseline profiles) or in the background while the device is idle and charging.

fpoling

1 replies

21h25m

2024-08-23 21:03:58 UTC

Chrome browser on Android uses the same code base as Chrome on desktop including multi-process architecture. But it’s UI is in Java communicating with C++ using JNI.

dotancohen

0 replies

3h23m

2024-08-24 15:06:18 UTC

I had no idea, thank you!

david_allison

1 replies

21h38m

2024-08-23 20:51:02 UTC

The Android Native Development Kit (NDK) allows building native code libraries for Android (typically C/C++, but this can include Rust). These can then be loaded and accessed by JNI on the Java/Kotlin side

* Brief overview of the NDK: https://developer.android.com/ndk/guides

* Guide to supporting 16KB page sizes with the NDK https://developer.android.com/guide/practices/page-sizes

dotancohen

0 replies

3h24m

2024-08-24 15:05:41 UTC

Good to know, thank you!

orf

0 replies

23h39m

2024-08-23 18:50:40 UTC

Yes, Android apps can and do have native libraries. Sometimes this can be part of a SDK, or otherwise out of the developers control.

sweeter

7 replies

2024-08-23 18:08:23 UTC

Wine doesn't work on 16 KB page size among other things.

mananaysiempre

6 replies

23h21m

2024-08-23 19:08:22 UTC

This seems especially peculiar given Windows has a 64K mapping granularity.

tredre3

5 replies

22h46m

2024-08-23 19:43:31 UTC

Windows uses 4KB pages.

mananaysiempre

3 replies

22h26m

2024-08-23 20:03:15 UTC

Right (on x86-32 and -64, because you can’t have 64KB pages there, though larger page sizes do exist and get used). You still cannot (e.g.) MapViewOfFile() on an address not divisible by 64KB, because Alpha[1]. As far as I understand, Windows is mostly why the docs for the Blink emulator[2] (a companion project of Cosmopolitan libc) tell you any programs under it need to use sysconf(_SC_PAGESIZE) [aka getpagesize() aka getauxval(AT_PAGESZ)] instead of assuming 4KB.

[1] https://devblogs.microsoft.com/oldnewthing/20031008-00/?p=42...

[2] https://github.com/jart/blink/blob/master/README.md#compilin...

alephr

2 replies

7h37m

2024-08-24 10:52:07 UTC

this is no longer true with MapViewOfFile3: Third time's a charm, now you can map to page boundaries

mananaysiempre

1 replies

7h19m

2024-08-24 11:10:04 UTC

TIL about MapViewOfFile3 and NtMapViewOfSectionEx, thanks! Still, the Microsoft docs say[1]:

[in, optional] BaseAddress

The desired base address of the view (the address is rounded down to the nearest 64k boundary).

[...]

[in] Offset

The offset from the beginning of the section.

The offset must be 64k aligned.

The peculiar part is where base address and offset must be divisible by 64K (also referred to as the “allocation granularity”) but the size only needs to be divisible by the page size. Maybe you’re right and the docs are wrong?..

[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryap...

alephr

0 replies

6h39m

2024-08-24 11:49:59 UTC

the new behavior works under the MEM_REPLACE_PLACEHOLDER flag, you can create those regions with VirtualAlloc2

nullindividual

0 replies

22h30m

2024-08-23 19:58:51 UTC

4K, 2M ("large page"), or 1G ("huge page") on x86-64. A single allocation request can consist of multiple page sizes. From Windows Internal 7th Edt Part 1:

    On Windows 10 version 1607 x64 and Server 2016 systems, large pages may also be mapped with huge pages, which are 1 GB in size. This is done automatically if the allocation size requested is larger than 1 GB, but it does not have to be a multiple of 1 GB. For example, an allocation of 1040 MB would result in using one huge page (1024 MB) plus 8 “normal” large pages (16 MB divided by 2 MB).

growse

1 replies

22h33m

2024-08-23 19:56:12 UTC

If you use a database library that does mmap to create a db file with SC_PAGE_SIZE (4KB) pages, and then upgrade your device to a 16KB one and backup/restore the app, now your data isn't readable.

phh

0 replies

11h32m

2024-08-24 06:57:16 UTC

Which is the reason you need to format your data to experiment with 16k

dmytroi

1 replies

2024-08-23 17:54:10 UTC

Also ELF segment alignment, which is defaults to 4k.

bri3d

0 replies

21h24m

2024-08-23 21:05:03 UTC

Only on Android, for what it's worth; most "vanilla" Linux aarch64 linkers chose 64K defaults several years ago. But yes, most Android applications with native (NDK) binaries will need to be rebuilt with the new 16kb max-page-size.

vardump

0 replies

2024-08-23 17:53:37 UTC

For example use mmap and just assume 4 kB pages.

saagarjha

0 replies

23h39m

2024-08-23 18:49:52 UTC

Pages sizes are often important to code that relies on low-level details of the environment it’s running in, like language runtimes. They might do things like mark some sections of code as writable or executable and thus would need to know what the granularity of those requests can be. It’s also of importance to things like allocators that hand out memory backed by mmap pages. If they have, say, a bit field for each 16-byte region of a page that has been used that will change in size in ways they can detect.

nox101

0 replies

21h9m

2024-08-23 21:19:49 UTC

I don't know if this fits but I've seen code that allocated say 32 bytes from a function that allocated 1meg under the hood. Not knowing that's what was happening the app quickly ran out of memory. It arguably was not the app's fault. The API it was calling into was poorly designed and poorly named, such that the fact that you might need to know the block size to use the function was in no way indicated by the name of the function nor the names of any of its parameters.

mlmandude

0 replies

2024-08-23 17:53:36 UTC

If you use mmap/munmap directly within your application you could probably get into trouble by hardcoding the page size.

edflsafoiewq

0 replies

2024-08-23 18:02:54 UTC

jemalloc bakes in page size assumptions, see eg https://github.com/jemalloc/jemalloc/issues/467.

dataflow

0 replies

17h37m

2024-08-24 00:52:30 UTC

When could an app NOT be agnostic to this

When the app has a custom memory allocator, the allocator might have hardcoded the page size for performance. Otherwise you have to load a static variable (knocks out a cache line you could've used for something else) and then do a multiplication (or bit shift, if you assume power of 2) by a runtime value instead of a shift by a constant, which can be slower.

No idea if Android apps are ever this performance sensitive, though.

devit

27 replies

23h27m

2024-08-23 19:01:49 UTC

Seems pretty dubious to do this without adding support for having both 4KB and 16KB processes at once to the Linux kernel, since it means all old binaries break and emulators which emulate normal systems with 4KB pages (Wine, console emulators, etc.) might dramatically lose performance if they need to emulate the MMU.

Hopefully they don't actually ship a 16KB default before supporting 4KB pages as well in the same kernel.

Also it would probably be reasonable, along with making the Linux kernel change, to design CPUs where you can configure a 16KB pagetable entry to map at 4KB granularity and pagefault after the first 4KB or 8KB (requires 3 extra bits per PTE or 2 if coalesced with the invalid bit), so that memory can be saved by allocating 4KB/8KB pages when 16KB would have wasted padding.

Veserv

14 replies

22h37m

2024-08-23 19:52:28 UTC

Having both 4KB and 16KB simultaneously is either easy or hard depending on which hardware feature they are using for 16KB pages.

If they are using the configurable granule size, then that is a system-wide hardware configuration option. You literally can not map at smaller granularity while that bit is set.

You might be able to design a CPU that allows your idea of partial pages, but there be dragons.

If they are not configuring the granule size, instead opting for software enforcement in conjunction with always using the contiguous hint bit, then it might be possible.

However, I am pretty sure they are talking about hardware granule size, since the contiguous hint is most commonly used to support 16 contiguous entrys (though the CPU designer is technically allowed to do whatever grouping they want) which would be 64KB.

stingraycharles

8 replies

21h48m

2024-08-23 20:41:08 UTC

I’m a total idiot, how exactly is page size a CPU issue rather than a kernel issue? Is it about memory channel protocols / communication?

Disks have been slowly migrating away from the 4kb sector size, is this a same thing going on? That you need to actual drive to support it, because of internal structuring (i.e. how exactly the CPU aligns things in RAM), and on some super low level 4kb / 16kb being the smallest unit of memory you can allocate?

And does that then mean that there’s less overhead in all kinds of memory (pre)fetchers in the CPU, because more can be achieved in less clock cycles?

IshKebab

4 replies

21h25m

2024-08-23 21:04:35 UTC

The CPU has hardware that does a page table walk automatically when you access an address for which the translation is not cached in the TLB. Otherwise virtual memory would be really slow.

Since the CPU hardware itself is doing the page table walk it needs to understand page tables and page table entries etc. including how big pages are.

Also you need to know how big pages are for the TLB itself.

The value of 4kB itself is pretty much arbitrary. It has to be a small enough number that you don't waste a load of memory by mapping memory that isn't used (e.g. if you ask for 4.01kB you're actually going to get 8kB), but a large enough number that you aren't spending all your time managing tiny pages.

That's why increasing the page size makes things faster but waste more memory.

4kB arguably isn't optimal anymore since we have way more memory now than when it was de facto standardised so it doesn't matter as much if we waste a bit. Maybe.

quotemstr

3 replies

21h7m

2024-08-23 21:22:36 UTC

As an aside, it's shame that hardware page table walking won out over software filled TLBs, as some older computers had. I wonder what clever and wonderful hacks we might have been able to invent had we not needed to give the CPU a raw pointer to a data structure the layout of which is fixed forever.

Denvercoder9

1 replies

20h16m

2024-08-23 22:13:47 UTC

Page table layout isn't really fixed forever, x86 has changed its multiple times.

quotemstr

0 replies

6h58m

2024-08-24 11:30:50 UTC

Not without revving the hardware though

IshKebab

0 replies

20h45m

2024-08-23 21:44:19 UTC

Yeah maybe, though in practice I think it would be just too slow.

s_tec

0 replies

19h56m

2024-08-23 22:33:05 UTC

Each OS process has its own virtual address space, which is why one process cannot read another's memory. The CPU implements these address spaces in hardware, since literally every memory read or write needs to have its address translated from virtual to physical.

The CPU's address translation process relies on tables that the OS sets up. For instance, one table entry might say that the 4K memory chunk with virtual address 0x21000-0x21fff maps to physical address 0xf56e3000, and is both executable and read-only. So yes, the OS sets up the tables, but the hardware implements the protection.

Since memory protection is a hardware feature, the hardware needs to decide how fine-grained the pages are. It's possible to build a CPU with byte-level protection, but this would be crazy-inefficient. Bigger pages mean less translation work, but they can also create more wasted space. Sizes in the 4K-64K range seem to offer good tradeoffs for everyday workloads.

pwg

0 replies

20h46m

2024-08-23 21:43:09 UTC

I’m a total idiot, how exactly is page size a CPU issue rather than a kernel issue?

Because the size of a page is a hardware defined size for Intel and ARM CPU's (well, more modern Intel and ARM CPU's give the OS a choice of sizes from a small set of options).

It (page size) is baked into the CPU hardware.

And does that then mean that there’s less overhead in all kinds of memory (pre)fetchers in the CPU, because more can be achieved in less clock cycles?

For the same size TLB (Translation Look-aside Buffer -- the CPU hardware that stores the "referencing info" for the currently active set of pages being used by the code running on the CPU) a larger page size allows more total memory to be accessible before taking a page fault and having to replace one or more of the entries in the TLB. So yes, it means less overhead, because CPU cycles are not used up in replacing as many TLB entries as often.

fpoling

0 replies

21h16m

2024-08-23 21:13:22 UTC

Samsung SSD still reports to the system that their logical sector size is 512 bytes. In fact one of the recent models even removed the option to reconfigure the disk to use 4k logical sectors. Presumably Samsung has figured that since the physical sector is much larger and they need complex mapping of logic sectors in any case, they decided not to support 4K option and stick with 512 bytes.

sweetjuly

4 replies

20h6m

2024-08-23 22:23:15 UTC

Hmm, I'm not sure that's quite right. ARMv8 supports per TTBR translation granules [1] and so you can have 4K and 16K user processes coexisting under an arbitrary page size kernel by just context switching TCR.TG0 at the same time as TTBR0. There is no such thing as a global granule size.

[1]: https://arm.jonpalmisc.com/2023_09_sysreg/AArch64-tcr_el2#fi...

Veserv

2 replies

18h35m

2024-08-23 23:54:38 UTC

Well, if you want to run headfirst into the magical land of hardware errata, I guess you could go around creating heterogeneous, switched mappings.

I doubt the TCRs were ever intended to support rapid runtime switching or that the TLBs were ever intended to support heterogeneous entrys even with ASID tagging.

saagarjha

0 replies

12h49m

2024-08-24 05:40:34 UTC

Apple has been literally doing this for years.

quotemstr

0 replies

17h17m

2024-08-24 01:12:10 UTC

You've listed things that could go wrong without citing specific errata. Should we just assume that hardware doesn't work as documented? It seems premature to deem the feature buggy without having tried it.

jonpalmisc

0 replies

1h32m

2024-08-24 16:56:59 UTC

cool site you linked there :)

mgaunard

4 replies

23h16m

2024-08-23 19:13:00 UTC

why does it break userland? if you need to know the page size, you should query sysconf SC_PAGESIZE.

ndesaulniers

0 replies

22h15m

2024-08-23 20:13:52 UTC

Ossification.

If the page size has been 4k for decades for most OS' and architectures, people get sloppy and hard code that literal value, rather than query for it.

fweimer

0 replies

22h19m

2024-08-23 20:09:57 UTC

It should not break userland. GNU/Linux (not necessarily Android though) has supported 64K pages pretty much from the start because that was the originally page size chosen for server-focus kernels and distributions. But there are some things that need to be worked around.

Certain build processes determine the page size at compile time and assume it's the same at run time, and fail if it is not: https://github.com/jemalloc/jemalloc/issues/467

Some memory-mapped files formats have assumptions about page granularity: https://bugzilla.redhat.com/show_bug.cgi?id=1979804

The file format issue applies to ELF as well. Some people patch their toolchains (or use suitable linker options) to produce slightly smaller binaries that can only be loaded if the page size is 4K, even though the ABI is pretty clear in that you should link for compatibility with up to 64K pages.

akdev1l

0 replies

23h11m

2024-08-23 19:18:11 UTC

Assumptions in the software.

Jemalloc is infamous for this: https://github.com/sigp/lighthouse/issues/5244

Dwedit

0 replies

22h25m

2024-08-23 20:04:34 UTC

Emulating a processor with 4K size pages becomes much higher performance if you can use real addresses directly.

username81

1 replies

23h20m

2024-08-23 19:09:16 UTC

Shouldn't there be some kind of setting to change the page size per program? AFAIK AMD64 CPUs can do this.

saagarjha

0 replies

12h46m

2024-08-24 05:42:53 UTC

Yes, ARM CPUs can do it too.

phh

1 replies

22h54m

2024-08-23 19:35:18 UTC

Google/Android doesn't care much about backward compatibility and broke programs released on Pixel 3 in Pixel 7. (the interdiction of 32bit-only apps is 2019 on Play Store, Pixel 7 is first 64bits only device, while Google still released 32bits only device in 2023...). They quite regularly break apps in new Android versions (despite their infrastructure to handle backward compatibility), and app developers are used to brace themselves around Android & Pixel releases

reissbaker

0 replies

20h43m

2024-08-23 21:46:22 UTC

Generally I've found Google to care much more about not breaking old apps compared to Apple, which often expects developers to rebuild apps for OS updates or else the apps stop working entirely (or buy entirely new machines to get OS updates at all, e.g. the Intel/Apple Silicon transition). Google isn't on the level of Windows "we will watch for specific binaries and re-introduce bugs in the kernel specifically for those binaries that they depend on" in terms of backwards compatibility, but I wouldn't go so far as to say they don't care. I'm not sure whether that's better or worse: there's definitely merit to Apple's approach, since it keeps them able to iterate quickly on UX and performance by dropping support for the old stuff.

lxgr

0 replies

21h49m

2024-08-23 20:40:22 UTC

all old binaries break and emulators which emulate normal systems with 4KB pages

Would it actually affect the kind of emulators present on Android, i.e. largely software-only ones, as opposed to hardware virtualizers making use of a CPU's vTLB?

Wine is famously not an emulator and as such doesn't really exist/make sense on (non-x86) Android (as it would only be able to execute ARM binaries, not x86 ones).

For the downvote: Genuinely curious here on which type of emulator this could affect.

fouronnes3

0 replies

23h19m

2024-08-23 19:09:51 UTC

Could they upstream that or would that require a fork?

Zefiroj

0 replies

21h40m

2024-08-23 20:49:23 UTC

The support for mTHP exists in upstream Linux, but the swap story is not quite there yet. THP availability also needs work and there are a few competing directions.

Supporting multiple page sizes well transparently is non-trivial.

For a recent summary on one of the approaches, TAO (THP Allocation Optimization), see this lwn article: https://lwn.net/Articles/974636/

twoodfin

25 replies

2024-08-23 17:42:03 UTC

A little additional background: iOS has used 16KB pages since the 64-bit transition, and ARM Macs have inherited that design.

arghwhat

22 replies

2024-08-23 17:57:08 UTC

A more relevant bit of background is that 4KB pages lead to quite a lot of overhead due to the sheer number of mappings needing to be configured and cached. Using larger pages reduce overhead, in particular TLB misses as fewer entries are needed to describe the same memory range.

While x86 chips mainly supports 4K, 2M and 1G pages, ARM chips tend to support more practical 16K page sizes - a nice balance between performance and wasting memory due to lower allocation granularity.

Nothing in particular to do with Apple and iOS.

jsheard

18 replies

2024-08-23 18:09:07 UTC

Makes me wonder how much performance Windows is leaving on the table with its primitive support for large pages. It does support them, but it doesn't coalesce pages transparently like Linux does, and explicitly allocating them requires special permissions and is very likely to fail due to fragmentation if the system has been running for a while. In practice it's scarcely used outside of server software which immediately grabs a big chunk of large pages at boot and holds onto them forever.

andai

11 replies

2024-08-23 18:17:46 UTC

A lot of low level stuff is a lot slower on Windows, let alone the GUI. There's also entire blogs cataloging an abundance of pathological performance issues.

The one I notice the most is the filesystem. Running Linux in VirtualBox, I got 7x the host speed for many small file operations. (On top of that Explorer itself has its own random lag.)

I think a better question is how much performance are they leaving on the table by bloating the OS so much. Like they could have just not touched Explorer for 20 years and it would be 10x snappier now.

I think the number is closer to 100x actually. Explorer on XP opens (fully rendered) after a single video frame... also while running virtualized inside Win10.

Meanwhile Win10 Explorer opens after a noticeable delay, and then spends the next several hundred milliseconds painting the UI elements one by one...

nullindividual

4 replies

23h13m

2024-08-23 19:15:50 UTC

The one I notice the most is the filesystem.

This is due to the extensible file system filter model in place; I'm not aware of another OS that implements this feature and is primarily used for antivirus, but can be used by any developer for any purpose.

It applies to all file systems on Windows.

DevDrive[0] is Microsoft's current solution to this.

Meanwhile Win10 Explorer opens after a noticeable delay

This could be, again, largely due to 3rd party hooks (or 1st party software that doesn't ship with Windows) into Explorer.

[0] https://devblogs.microsoft.com/visualstudio/devdrive/

redleader55

1 replies

22h14m

2024-08-23 20:15:16 UTC

I'm not aware of another OS that implements this feature

I'm not sure this is exactly what you mean, but Linux has inotify and all sorts of BPF hooks for filtering various syscalls, for example file operations.

rincebrain

0 replies

21h16m

2024-08-23 21:13:23 UTC

FSFilters are basically a custom kernel module that can and will do anything they want on any filesystem access. (There's also network filters, which is how things like WinPcap get implemented.)

So yes, you could implement something similar in Linux, but there's not, last I looked, a prebuilt toolkit and infrastructure for them, just the generic interfaces you can use to hook anything.

(Compare the difference between writing a BPF module to hook all FS operations, and the limitations of eBPF, to having an InterceptFSCalls struct that you define in your custom kernel module to run your own arbitrary code on every access.)

andai

1 replies

22h18m

2024-08-23 20:11:15 UTC

I'm glad you mentioned that. I noticed when running "Hello world" C program on Windows 10 that Windows performs over 100 reads of the Registry before running the program. Same thing when I right click a file...

A few of those are 3rd party, but most are not.

nullindividual

0 replies

22h4m

2024-08-23 20:25:39 UTC

Remember that Win32 process creation is expensive[0]. And on NT, processes don't run, threads do.

The strategy of applications, like olde-tymey Apache using multiple processes to handle incoming connections is fine on UN*X, but terrible on Windows.

[0] https://fourcore.io/blogs/how-a-windows-process-is-created-p...

Const-me

2 replies

23h29m

2024-08-23 19:00:46 UTC

The one I notice the most is the filesystem

I’m not sure it’s the file system per se, I believe the main reason is the security model.

NT kernel has rather sophisticated security. The securable objects have security descriptors with many access control entries and auditing rules, which inherit over file system and other hierarchies according to some simple rules e.g. allow+deny=deny. Trustees are members of multiple security groups, and security groups can include other security groups so it’s not just a list, it’s a graph.

This makes access checks in NT relatively expensive. The kernel needs to perform access check every time a process creates or opens a file, that’s why CreateFile API function is relatively slow.

temac

1 replies

21h22m

2024-08-23 21:07:28 UTC

I've been trying to use auditing rules for a usage that seems completely in scope and obvious to prioritize from a security point of view (tracing access to EFS files and/or the keys allowing the access) and my conclusion was that you basically can't, the doc is garbage, the implementation is probably ad-hoc with lots of holes, and MS probably hasn't prioritised the maintenance of this feature since several decades (too busy adding ads in the start menu I guess)

The NT security descriptors are also so complex they are probably a little useless in practice too, because it's too hard to use correctly. On top of that the associated Win32 API is also too hard to use correctly to the point that I found an important bug in the usage model described in MSDN, meaning that the doc writer did not know how the function actually work (in tons of cases you probably don't hit this case, but if you start digging in all internal and external users, who knows what you could find...)

NT was full of good ideas but the execution is often quite poor.

nullindividual

0 replies

13h47m

2024-08-24 04:42:31 UTC

From an NTFS auditing perspective, there’s no difference between auditing a non-EFS file or EFS file. Knowing that file auditing works just fine having done it many times, what makes you say it doesn’t work?

saagarjha

1 replies

23h44m

2024-08-23 18:44:56 UTC

None of this has to do with page size.

pantalaimon

0 replies

23h27m

2024-08-23 19:02:39 UTC

Death by 1000 cuts

hinkley

0 replies

20h34m

2024-08-23 21:55:09 UTC

The one I notice the most is the filesystem. Running Linux in VirtualBox, I got 7x the host speed for many small file operations. (On top of that Explorer itself has its own random lag.)

That’s a very old problem. In early days of subversion, the metadata for every directory existed in the directory. The rationale was that you could check out just a directory in svn. It was disastrously slow on Windows and the subversion maintainers had no answer for it, except insulting ones like “turn off virus scanning”. Telling a windows user to turn off virus scanning is equivalent to telling someone to play freeze tag in traffic. You might as well just tell them, “go fuck yourself with a rusty chainsaw”

Someone reorganized the data so it all happened at the root directory and the CLI just searched upward until it found the single metadata file. If memory serves that made large checkouts and updates about 2-3 times faster on Linux and 20x faster on windows.

arghwhat

4 replies

2024-08-23 18:16:57 UTC

Quite a bit, but 2M is an annoying size and the transparent handling is suboptimal. Without userspace cooperating, the kernel might end up having to split the pages at random due to an unfortunate unaligned munmap/madvise from an application not realizing it was being served 2M pages.

Having Intel/AMD add 16-128K page support, or making it common for userspace to explicitly ask for 2M pages for their heap arenas is likely better than the page merging logic. Less fragile.

1G pages are practically useless outside specialized server software as it is very difficult to find 1G contiguous memory to back it on a “normal” system that has been running for a while.

jsheard

1 replies

23h47m

2024-08-23 18:42:45 UTC

Would a reasonable compromise be to change the base allocation granularity to 2MB, and transparently sub-allocate those 2MB blocks into 64KB blocks (the current Windows allocation granularity) when normal pages are requested? That feels like it should keep 2MB page fragmentation to a minimum without breaking existing software, but given they haven't done it there's probably some caveat I'm overlooking.

lanigone

0 replies

23h34m

2024-08-23 18:55:24 UTC

you might find this interesting

https://www.hudsonrivertrading.com/hrtbeat/low-latency-optim...

bewaretheirs

1 replies

20h50m

2024-08-23 21:39:31 UTC

Intel's menu of page sizes is an artifact of its page table structure.

On x86 in 64-bit mode, page table entries are 64 bits each; the lowest level in the hierarchy (L1) is a 4K page containing 512 64-bit of PTEs which in total map 2M of memory, which is not coincidentally the large page size.

The L1 page table pages are themselves found via a PTE in a L2 page table; one L2 page table page maps 512*2M = 1G of virtual address space, which is again, not coincidentally, the huge page size.

Large pages are mapped by a L2 PTE (sometimes called a PDE, "page directory entry") with a particular bit set indicating that the PTE points at the large page rather than a PTE page. The hardware page table walker just stops at that point.

And huge pages are similarly mapped by an L3 PTE with a bit set indicating that the L3 PTE is a huge page.

Shoehorning an intermediate size would complicate page table updates or walks or probably both.

Note that an OS can, of its own accord independent of hardware maintain allocations as a coarser granularity and sometimes get some savings out of this. For one historic example, the VAX had a tiny 512-byte page size; IIRC, BSD unix pretended it had a 1K page size and always updated PTEs in pairs.

arghwhat

0 replies

11h33m

2024-08-24 06:55:59 UTC

Hmm? Pretending the page size is larger than it is would not yield the primary performance benefits of reduced TLB misses. Unless I am missing something, that seems more like a hack to save a tiny bit of kernel memory on a constrained system by having two PTE’s backed by the same internal page structure.

Unless we can change the size of the smallest page entry on Intel, I doubt there is room to do anything interesting there. If we could do like ARM and just multiply all the page sizes by 4 you would avoid any “shoehorning”.

tedunangst

0 replies

22h44m

2024-08-23 19:45:19 UTC

I've lost count of how many blog posts about poor performance ended with the punchline "so then we turned off page coalescing".

daghamm

1 replies

22h28m

2024-08-23 20:01:03 UTC

IIRC, 64-bit ARM can do 4K, 16K, 64K and 2M pages. But there are some special rules for the last one.

https://documentation-service.arm.com/static/64d5f38f4a92140...

sweetjuly

0 replies

15h47m

2024-08-24 02:42:26 UTC

It's a little weirder. At least one translation granule is required but it is up to the implementation to choose which one(s) they want. Many older Arm cores only support 4KB and 64KB but newer ones support all three.

The size of the translation granule determines the size of the block entries at each level. So 4K granules has super pages of 2MB and 1GB, 16KB granules has 32MB super pages, and 64K has 512MB super pages.

CalChris

0 replies

2h51m

2024-08-24 15:38:32 UTC

Armv8-A also supports 4K pages: FEAT_TGran4K. So Apple did indeed make a choice to instead use 16K, FEAT_TGran16K. Microsoft uses 4K for AArch64 Windows.

HumblyTossed

1 replies

23h35m

2024-08-23 18:54:09 UTC

How is this "additional background"? This was a post by Google regarding Android.

Kwpolska

0 replies

14h17m

2024-08-24 04:12:15 UTC

That this isn't the only 4K→16K transition in recent history? Some programs that assumed 4K had to be fixed as part of the transition, this can provide insights for the work required for Android.

monocasa

24 replies

2024-08-23 17:34:49 UTC

I wonder how much help they had by asahi doing a lot of the kernel and ecosystem work anablibg 16k pages.

RISC-V being fixed to 4k pages seems to be a bit of an oversight as well.

ashkankiani

14 replies

2024-08-23 17:40:56 UTC

It's pretty cool that I can read "anablibg" and know that means "enabling." The brain is pretty neat. I wonder if LLMs would get it too. They probably would.

evilduck

7 replies

2024-08-23 18:01:15 UTC

Question I wrote:

I encountered the typo "anablibg" in the sentence "I wonder how much help they had by asahi doing a lot of the kernel and ecosystem work anablibg 16k pages." What did they actually mean?

GPT-4o and Sonnet 3.5 understood it perfectly. This isn't really a problem for the large models.

For local small models:

* Gemma2 9b did not get it and thought it meant "analyzing".

* Codestral (22b) did not it get it and thought it meant "allocating".

* Phi3 Mini failed spectacularly.

* Phi3 14b and Qwen2 did not get it and thought it was "annotating".

* Mistral-nemo thought it was a portmanteau "anabling" as a combination of "an" and "enabling". Partial credit for being close and some creativity?

* Llama3.1 got it perfectly.

jandrese

1 replies

2024-08-23 18:05:40 UTC

Seems like there is a bit of a roll of the dice there. The ones that got it right may have just been lucky.

HeatrayEnjoyer

0 replies

23h25m

2024-08-23 19:03:50 UTC

Ran it a few times in new sessions, 0 failures so far.

Alifatisk

1 replies

21h22m

2024-08-23 21:06:55 UTC

Is there any task Gemma is better at compared to others?

evilduck

0 replies

18h18m

2024-08-24 00:11:38 UTC

Local LLM topics are a treadmill of “what’s best and what is preferred” changing basically weekly to monthly, it’s a rapidly evolving field, but right now I actually tend to gravitate to Gemma2 9b for coding assistance for Typescript work or general question and answer stuff. Its embedded knowledge and speed on the computers that I have (32GB M2 Max, 16GB M1 Air, 4080 gaming desktop) make for a good balance while also using the computer for other stuff, bigger models limit what else I can run simultaneously and are slower than my reading speed, smaller models have less utility and the speed increase is pointless if they’re dumb.

treyd

0 replies

19h40m

2024-08-23 22:49:39 UTC

I wonder if they'd do better if there was the context that it's in a thread titled "Adding 16 kb page size to Android"? The "analyzing" interpretation is plausible if you don't know what 16k pages, kernels, Asahi, etc are.

slaymaker1907

0 replies

23h51m

2024-08-23 18:38:35 UTC

I wonder how much of a test this is for the LLM vs whatever tokenizer/preprocessing they're doing.

Retr0id

0 replies

23h39m

2024-08-23 18:50:45 UTC

fwiw I failed to figure it out as a human, I had to check the replies.

mrob

3 replies

2024-08-23 17:48:24 UTC

LLMs are at a great disadvantage here because they operate on tokens, not letters.

platelminto

2 replies

2024-08-23 18:05:35 UTC

I remember reading somewhere that LLMs are actually fantastic at reading heavily mistyped sentences! Mistyped to a level where humans actually struggle.

(I will update this comment if I find a source)

thanatropism

1 replies

2024-08-23 18:13:20 UTC

Tihs probably refers to comon mispelllings an typo's.

HeatrayEnjoyer

0 replies

21h35m

2024-08-23 20:54:42 UTC

It's actually not. You can scramble every letter within words and it can mostly unscramble it. Keep the first letter and it recovers almost 100%.

mrbuttons454

0 replies

2024-08-23 18:00:04 UTC

Until I read your comment I didn't even notice...

im3w1l

0 replies

2024-08-23 18:03:18 UTC

I asked chatgpt and it did get it.

Personally, when I read the comment my brain kinda skipped over the word since it contained the part "lib" I assumed it was some obscure library that I didn't care about. It doesn't fit grammatically but I didn't give it enough thought to notice.

saagarjha

3 replies

23h43m

2024-08-23 18:45:59 UTC

Probably very little, since the Android ecosystem is quite divorced from the Linux one.

nabla9

2 replies

23h19m

2024-08-23 19:09:53 UTC

Android kernel is a mainstream Linux kernel, with additional drivers, and other functionality.

temac

0 replies

21h13m

2024-08-23 21:15:51 UTC

The linux kernel already works perfectly fine with various base page sizes.

saagarjha

0 replies

17h22m

2024-08-24 01:07:43 UTC

I am aware. This does not change what I said.

IshKebab

3 replies

2024-08-23 17:44:27 UTC

Probably wouldn't be too hard to add a 16 kB page size extension. But I think the Svnapot extension is their solution to this problem. If you're not familiar it lets you mark a set of pages as being part of a contiguously mapped 64 kB region. No idea how the performance characteristics vary. It relieves TLB pressure, but you still have to create 16 4kB page table entries.

monocasa

2 replies

2024-08-23 18:28:32 UTC

Svnapot is a poor solution to the problem.

On one hand it means that that each page table entry takes up half a cache line for the 16KB case, and two whole cache lines in the 64KB case. This really cuts down on the page walker hardware's ability to effectively prefetch TLB entries, leading to basically the same issues as this classic discussion about why tree based page tables are generally more effective than hash based page tables (shifted forward in time to today's gate counts). https://yarchive.net/comp/linux/page_tables.html This is why ARM shifted from a Svnapot like solution to the "translation granule queryable and partially selectable at runtime" solution.

Another issue is the fact that a big reason to switch to 16KB or even 64KB pages is to allow for more address range for VIPT caches. You want to allow high performance implementations to be able to look up the cache line while performing the TLB lookup in parallel, then compare the tag with the result of the TLB lookup. This means that practically only the untranslated bits of the address can be used by the set selection portion of the cache lookup. When you have 12 bits untranslated in a address, combined with 64 byte cachelines gives you 64 sets, multiply that by 8 ways and you get the 32KB L1 caches very common in systems with 4KB page sizes (sometimes with some heroic effort to throw a ton of transistors/power at the problem to make a 64KB cache by essentially duplicating large parts of the cache lookup hardware for that extra bit of address). What you really want is for the arch to be able to disallow 4KB pages like on apple silicon which is the main piece that allows their giant 128LB and 192KB L1 caches.

wren6991

0 replies

17h34m

2024-08-24 00:55:33 UTC

This means that practically only the untranslated bits of the address can be used by the set selection portion of the cache lookup

It's true that this makes things difficult, but Arm have been shipping D caches with way size > page size for decades. The problem you get is that virtual synonyms of the same physical cache block can become incoherent with one another. You solve this by extending your coherence protocol to cover the potential synonyms of each index in the set (so for example with 16 kB/way and 4 kB pages, there are four potential indices for each physical cache block, and you need to maintain their coherence). It has some cost and the cost scales with the ratio of way size : page size, so it's still desirable to stay under the limit, e.g. by just increasing the number of cache ways.

aseipp

0 replies

22h36m

2024-08-23 19:53:16 UTC

What you really want is for the arch to be able to disallow 4KB pages like on apple silicon which is the main piece that allows their giant 128LB and 192KB L1 caches.

Minor nit but they allow 4k pages. Linux doesn't support 16k and 4k pages at the same time; macOS does but is just very particular about 4k pages being used for scenarios like Rosetta processes or virtual machines e.g. Parallels uses it for Windows-on-ARM, I think. Windows will probably never support non-4k pages I'd guess.

But otherwise, you're totally right. I wish RISC-V had gone with the configurable granule approach like ARM did. Major missed opportunity but maybe a fix will get ratified at some point...

wren6991

0 replies

17h32m

2024-08-24 00:57:14 UTC

RV64 has some reserved encoding space in satp.mode so there's an obvious path to expanding the number of page table formats at a later time. Just requires everyone to agree on the direction (common issue with RISC-V).

For RV32 I think we are probably stuck with Sv32 4k pages forever.

CalChris

6 replies

17h11m

2024-08-24 01:17:55 UTC

  iOS has had 16K pages since forever.
  OSX switched to 16K pages in 2020 with the M1.
  Windows is stuck on 4K pages, even for AArch64.
  Linux has various page sizes. Asahi is 16K.

nullindividual

5 replies

16h51m

2024-08-24 01:38:13 UTC

Windows has 4K, 2M, and 1G page sizes on x86-64.

CalChris

2 replies

16h44m

2024-08-24 01:44:58 UTC

Normal, large and huge. But default normal pages (which is what Android is changing) are 4K. FWIW, Itanium and Alpha had 8K default pages.

https://devblogs.microsoft.com/oldnewthing/20210510-00/?p=10...

I wonder why Microsoft stayed with 4K for AArch64.

nullindividual

0 replies

13h43m

2024-08-24 04:46:09 UTC

I was confused as to why you were posting incorrect information when this thread already contained the correct information.

Kwpolska

0 replies

14h13m

2024-08-24 04:16:24 UTC

Microsoft wanted to make x86 compatibility as painless as possible. They adopted an ABI in which registers can be generally mapped 1:1 between the two architectures.

baby_souffle

1 replies

16h37m

2024-08-24 01:51:58 UTC

and 1G page sizes on x86-64.

I wonder who requested the 1G page size be implemented and what they use it for...

Kwpolska

0 replies

14h16m

2024-08-24 04:12:56 UTC

Another thread says virtual machines.

lxgr

3 replies

20h34m

2024-08-23 21:55:44 UTC

Now I wonder: Does increased page size have any negative impacts on I/O performance or flash lifetime, e.g. for writebacks of dirty pages of memory-mapped files where only a small part was changed?

Or is the write granularity of modern managed flash devices (such as eMMCs as used in Android smartphones) much larger than either 4 or 16 kB anyway?

tadfisher

2 replies

20h0m

2024-08-23 22:29:09 UTC

Flash controllers expose blocks of 512B or 4096KB, but the actual NAND chips operate in terms of "erase blocks" which range from 1MB to 8MB (or really anything); in these blocks, an individual bit can be flipped from "0" to "1" once, and flipping any bit back to "0" requires erasing the entire block and flipping the desired bits back to "1" [0].

All of this is hidden from the host by the NAND controller, and SSDs employ many strategies (including DRAM caching, heterogeneous NAND dies, wear-leveling and garbage-collection algorithms) to avoid wearing the storage NAND. Effectively you must treat flash storage devices as block devices of their advertised block size because you have no idea where your data ends up physically on the device, so any host-side algorithm is fairly worthless.

[0]: https://spdk.io/doc/ssd_internals.html

lxgr

1 replies

19h54m

2024-08-23 22:35:39 UTC

Writes on NAND happen at the block, not the page level, though. I believe the ratio between the two is usually something like 1:8 or so.

Even blocks might still be larger than 4KB, but if they’re not, presumably a NAND controller could allow such smaller writes to avoid write amplification?

The mapping between physical and logical block address is complex anyway because of wear leveling and bad block management, so I don’t think there’s a need for write granularity to be the erase block/page or even write block size.

to11mtm

0 replies

19h26m

2024-08-23 23:02:51 UTC

Even blocks might still be larger than 4KB, but if they’re not, presumably a NAND controller could allow such smaller writes to avoid write amplification?

Look at what SandForce was doing a decade+ ago. They had hardware compression to lower write amp and some sort of 'battery backup' to ensure operations completed. Various bits of this sort of tech is in most decent drives now.

The mapping between physical and logical block address is complex anyway because of wear leveling and bad block management, so I don’t think there’s a need for write granularity to be the erase block/page or even write block size.

The controller needs to know what blocks can get a clean write vs what needs an erase; that's part of the trim/gc process they do in background.

Assuming you have sufficient space, it works kinda like this:

- Writes are done to 'free-free' area, i.e. parts of the flash it can treat like SLC for faster access and less wear. If you have less than 25%-ish of drive free this becomes a problem. Controller is tracking all of this state.

- When it's got nothing better to do for a bit, controller will work to determine which old blocks to 'rewrite' with data from the SLC-treated flash into 'longer lived' but whatever-Level-cell storage. I'm guessing (hoping?) there's a lot of fanciness going on there, i.e. frequently touched files take longer to get a full rewrite.

TBH sounds like a fun thing to research more

eyalitki

3 replies

23h8m

2024-08-23 19:21:26 UTC

RHEL tried that in that past with 64KB on AARCH64, it led to MANY bugs all across the software stack, and they eventually reverted it - https://news.ycombinator.com/item?id=27513209.

I'm impressed by the effort on Google's side, yet I'll be surprised if this effort will pay off.

rincebrain

0 replies

21h14m

2024-08-23 21:15:45 UTC

I didn't realize they had reverted it, I used to run RHEL builds on Pi systems to test for 64k page bugs because it's not like there's a POWER SBC I could buy for this.

nektro

0 replies

22h58m

2024-08-23 19:30:50 UTC

apple's m-series chips use a 16kb page size by default so the state of things has improved significantly with software wanting to support asahi and other related endeavors

kcb

0 replies

22h54m

2024-08-23 19:35:01 UTC

Nvidia is pushing 64KB pages on their Grace-Hopper system.

ein0p

3 replies

22h39m

2024-08-23 19:50:37 UTC

No mention of Apple on the page. Apple has been using 16K pages for years now.

zahlman

2 replies

17h26m

2024-08-24 01:03:35 UTC

Why would an "Android news" blog mention what competitors are doing?

ein0p

1 replies

17h15m

2024-08-24 01:14:08 UTC

Because the whole thing sounds like they’re doing something new, but they’re just catching up to something Apple has done back when they switched to aarch64.

deadlydose

0 replies

15h8m

2024-08-24 03:21:09 UTC

A page right out of Apple's playbook then.

lostmsu

2 replies

2024-08-23 18:29:06 UTC

Not entirely related (except the block size), but I am considering making and standardizing a system-wide content-based cache with default block size 16KB.

The idea is that you'd have a system-wide (or not) service that can do two or three things:

- read 16KB block by its SHA256 (also return length that can be <16KB), if cached

- write a block to cache

- maybe pin a block (e.g. make it non-evictable)

I would be like a block-level file content dedup + eviction to keep the size limited.

Should reduce storage used by various things due to dedup functionality, but may require internet for corresponding apps to work properly.

With a peer-to-peer sharing system on top of it may significantly reduce storage requirements.

The only disadvantage is the same as with shared website caches prior to cache isolation introduction: apps can poke what you have in your cache and deduce some information about you from it.

treyd

0 replies

19h36m

2024-08-23 22:53:01 UTC

I would go for higher than 16K. I believe BitTorrent's default minimum chunk size is 64K, for example. It really depends on the use case in question though, if you're doing random writes then larger chunk sizes quickly waste a ton of bandwidth, especially if you're doing recursive rewrites of a tree structure.

Would a variable chunk size be acceptable for whatever it is you're building?

monocasa

0 replies

23h56m

2024-08-23 18:33:09 UTC

I'd probably pick a size greater than 16KB for that. Windows doesn't expose translations less than 64KB in their version of mmap, and internally their file cache works in increments of 256KB. And these were numbers they picked back in the 90s.

daghamm

2 replies

22h43m

2024-08-23 19:46:39 UTC

Can someone explain those numbers to me?

5-10% performance boost sounds huge. Wouldn't we have much larger TLBd if page walk was really this expensive?

On the other hand 9% increase in memory usage also sounds huge. How did this affect memory usage that much?

scottlamb

1 replies

22h29m

2024-08-23 20:00:25 UTC

5-10% performance boost sounds huge. Wouldn't we have much larger TLBd if page walk was really this expensive?

It's pretty typical for large programs to spend 15+% of their "CPU time" waiting for the TLB. [1] So larger pages really help, including changing the base 4 KiB -> 16 KiB (4x reduction in TLB pressure) and using 2 MiB huge pages (512x reduction where it works out).

I've also wondered why the TLB isn't larger.

On the other hand 9% increase in memory usage also sounds huge. How did this affect memory usage that much?

This is the granularity at which physical memory is assigned, and there are a lot of reasons most of a page might be wasted:

* The heap allocator will typically cram many things together in a page, but it might say only use a given page for allocations in a certain size range, so not all allocations will snuggle in next to each other.

* Program stacks each use at least one distinct page of physical RAM because they're placed in distinct virtual address ranges with guard pages between. So if you have 1,024 threads, they used at least 4 MiB of RAM with 4 KiB pages, 16 MiB of RAM with 16 KiB pages.

* Anything from the filesystem that is cached in RAM ends up in the page cache, and true to the name, it has page granularity. So caching a 1-byte file would take 4 KiB before, 16 KiB after.

[1] If you have an Intel CPU, toplev is particularly nice for pointing this kind of thing out. https://github.com/andikleen/pmu-tools

95014_refugee

0 replies

20h23m

2024-08-23 22:06:37 UTC

I've also wondered why the TLB isn't larger.

Fast CAMs are (relatively) expensive, is the excuse I always hear.

taeric

1 replies

22h1m

2024-08-23 20:28:21 UTC

I see they have measured improvements in the performance of some things. In particular, the camera app starts faster. Small percentage, but still real.

Curious if there are any other changes you could do based on some of those learnings? The camera app, in particular, seems like a good one to optimize to start instantly. Especially so with the the shortcut "double power key" that many phones/people have setup.

Specifically, I would expect you should be able to do something like the lisp norm of "dump image?" Startup should then largely be loading the image, not executing much if any initialization code? (Honestly, I mostly assume this already happens?)

saagarjha

0 replies

12h42m

2024-08-24 05:47:31 UTC

A big part of the challenge for launching the camera app is getting the hardware ready and quickly freeing up RAM for image processing.

quotemstr

1 replies

21h12m

2024-08-23 21:16:59 UTC

Good. It's about time. 4KB pages come down to us from 32-bit time immemorial. We didn't bump the page size when we doubled the sizes of pointers and longs for the 64-bit transition. 4KB has been way too small for ages, and I'm glad we're biting the minor compatibility bullet and adopting a page size more suited to modern computing.

jeffbee

0 replies

2h26m

2024-08-24 16:03:20 UTC

Now do 512B LBAs on NVMe devices.

pflanze

0 replies

17m

2024-08-24 18:11:52 UTC

I would expect that this increases the gap between new and old phones / makes old phones unusable more quickly: new phones will typically have enough RAM and can live with the 9% less efficient memory use, and will see the 5-10% speedup. Old phones are typically bottlenecked at RAM, now 9% earlier, and reloading pages from disk (or swapping if enabled) will have a much higher overhead than 5-10%.

iam-TJ

0 replies

10h52m

2024-08-24 07:36:49 UTC

In Debian kernel we've very recently enabled building an ARM64 kernel flavour with 16KiB page size and we've discussed adding a 64KiB flavour at some point as is the case for PowerPC64 already.

This will likely reveal bugs that need fixing in some of the 70,000+ packages in the Debian archive.

That ARM64 16KiB page size is interesting in respect of the Apple M1 where Asahi [0] identified that the DART IOMMU has a minimum page size of 16KiB so using that page size as a minimum for everything is going to be more efficient.

[0] https://asahilinux.org/2021/10/progress-report-september-202...

dboreham

0 replies

22h27m

2024-08-23 20:02:08 UTC

Time to grab some THP popcorn...