Nice work. Some times I wonder if there's any way to trade away accuracy for speed? Like, often I don't care _exactly_ how many bytes is the biggest user of space, but I just want to see some orders of magnitude.
Maybe there could be an iterative breadth-first approach, where first you quickly identify and discard the small unimportant items, passing over anything that can't be counted quickly. Then with what's left you identify the smallest of those and discard, and then with what's left the smallest of those, and repeat and repeat. Each pass through, you get a higher resolution picture of which directories and files are using the most space, and you just wait until you have the level of detail you need, but you get to see the tally as it happens across the board. Does this exist?
Wish modern filesystems maintained usage per dir as a directory file attribute instead of mandating tools to do this basic job.
This is an excellent point and I wholeheartedly agree!
Is it? That would require any update to any file to cascade into a bunch of directory updates amplifying the write and for what? Do you “du” in your shell prompt?
Not to mention it would likely be unable to handle the hardlink problem so it would consistently be wrong.
You can be a little lazy about updating parents this and have O(1) update and O(1) amortized read with O(n) worst case (same as now anyway).
CephFS does that.
You can use getfattr to ask it for the recursive number of entries or bytes in a given directory.
Querying it is constant time, updates update it with a few seconds delay.
Extremely useful when you have billions of files on spinning disks, where running du/ncdu would take a month just for the stat()s.
Thanks!
What you described is a neat idea, but it's not possible with any degree of accuracy AFAIK. To give you a picture of the problem, calculating the disk usage of a directory requires calling statx(2) on every file in that directory, summing up the reported sizes, and then recursing into every subdirectory and starting over. The problem with doing a partial search is that all the data is at the leaves of the tree, so you'll miss some potentially very large files.
Picture if your program only traversed the first, say, three levels of subdirectories to get a rough estimate. If there was a 1TB file down another level, your program would miss it completely and get a very innaccurate estimate of the disk usage, so it wouldn't be useful at all for finding the biggest culprits. You have the same problem if you decide to stop counting after seeing N files, since file N+1 could be gigantic and you'd never know.
Yeah, maybe approximation is not really possible. But it still seems like if you could do say, up to 1000 stats per directory per pass, then running totals could be accumulated incrementally and reported along the way.
So after just a second or two, you might be able to know with certainty that a bunch of small directories are small, and then that a handful of others are at least however big has been counted so far. And that could be all you need, or else you could wait longer to see how the bigger directories play out.
You would still have to getdents() everything but this way you may indeed save on stat() operations, which access information that is stored separately on disk and eliminating these would likely help uncached runs.
You could sample files in a directory or across directories to get an average file size and use the total number of files from getdents to estimate a total size. This does require you to know if a directory entry is a file or directory, which the d_type field gives you depending on the OS, file system and other factors. An average file size could also be obtained from statvfs().
Another trick is based on the fact that the link count of a directory is 2 + the number of subdirectories. Once you have seen the corresponding number of subdirectories, you know that there are no more subdirectories you need to descend into. This could allow you to abort a getdents for a very large directory, using eg the directory size to estimate the total entries.
For anyone who doesn't know why this is, it's because when you create a directory it has 2 hard links to it which are
When you add a new subdirectory it adds one more link which is So each subdirectory adds one more to the original 2.Something like that exists for btrfs; it's called bdtu. It has the accuracy/time trade-off you're interested in, but the implementation is quite different. It samples random points on the disk and finds out what file path they belong to. The longer it runs the more accurate it gets. The readme is good at explaining why this approach makes sense for btrfs and what its limitations are.
https://github.com/CyberShadow/btdu
Damn, `ext4` is organized differently entirely. You can't get anything useful from:
and recursing. That's a clever technique given btrfs structs.That's so cool.
This seems difficult since I'm not aware of any way to get approximate file sizes, at least with the usual FS-agnostic system calls: to get any size info you are pretty much calling something in the `stat` family and at that point you have the exact size.
i thought files can be sparse and have holes in the middle where nothing is allocated, so the file size is not what is used to calculate usage, it's the sum of the extents or some such.
Yes, files can be sparse but the actual disk usage information is also returned by these stat-family calls, so there is no special cost to handling sparse files.