HN comments for: Show HN: Dut – a fast Linux disk usage calculator

montroser

15 replies

16h12m

2024-07-11 02:17:52 UTC

Nice work. Some times I wonder if there's any way to trade away accuracy for speed? Like, often I don't care _exactly_ how many bytes is the biggest user of space, but I just want to see some orders of magnitude.

Maybe there could be an iterative breadth-first approach, where first you quickly identify and discard the small unimportant items, passing over anything that can't be counted quickly. Then with what's left you identify the smallest of those and discard, and then with what's left the smallest of those, and repeat and repeat. Each pass through, you get a higher resolution picture of which directories and files are using the most space, and you just wait until you have the level of detail you need, but you get to see the tally as it happens across the board. Does this exist?

lenkite

4 replies

14h11m

2024-07-11 04:19:08 UTC

Wish modern filesystems maintained usage per dir as a directory file attribute instead of mandating tools to do this basic job.

hsbauauvhabzb

2 replies

13h33m

2024-07-11 04:57:08 UTC

This is an excellent point and I wholeheartedly agree!

masklinn

1 replies

13h25m

2024-07-11 05:04:31 UTC

Is it? That would require any update to any file to cascade into a bunch of directory updates amplifying the write and for what? Do you “du” in your shell prompt?

Not to mention it would likely be unable to handle the hardlink problem so it would consistently be wrong.

IsTom

0 replies

8h37m

2024-07-11 09:53:22 UTC

That would require any update to any file to cascade into a bunch of directory updates amplifying the write and for what?

You can be a little lazy about updating parents this and have O(1) update and O(1) amortized read with O(n) worst case (same as now anyway).

nh2

0 replies

12h59m

2024-07-11 05:31:08 UTC

CephFS does that.

You can use getfattr to ask it for the recursive number of entries or bytes in a given directory.

Querying it is constant time, updates update it with a few seconds delay.

Extremely useful when you have billions of files on spinning disks, where running du/ncdu would take a month just for the stat()s.

201984

3 replies

16h1m

2024-07-11 02:28:52 UTC

Thanks!

What you described is a neat idea, but it's not possible with any degree of accuracy AFAIK. To give you a picture of the problem, calculating the disk usage of a directory requires calling statx(2) on every file in that directory, summing up the reported sizes, and then recursing into every subdirectory and starting over. The problem with doing a partial search is that all the data is at the leaves of the tree, so you'll miss some potentially very large files.

Picture if your program only traversed the first, say, three levels of subdirectories to get a rough estimate. If there was a 1TB file down another level, your program would miss it completely and get a very innaccurate estimate of the disk usage, so it wouldn't be useful at all for finding the biggest culprits. You have the same problem if you decide to stop counting after seeing N files, since file N+1 could be gigantic and you'd never know.

montroser

2 replies

14h57m

2024-07-11 03:33:20 UTC

Yeah, maybe approximation is not really possible. But it still seems like if you could do say, up to 1000 stats per directory per pass, then running totals could be accumulated incrementally and reported along the way.

So after just a second or two, you might be able to know with certainty that a bunch of small directories are small, and then that a handful of others are at least however big has been counted so far. And that could be all you need, or else you could wait longer to see how the bigger directories play out.

geertj

1 replies

14h36m

2024-07-11 03:54:20 UTC

You would still have to getdents() everything but this way you may indeed save on stat() operations, which access information that is stored separately on disk and eliminating these would likely help uncached runs.

You could sample files in a directory or across directories to get an average file size and use the total number of files from getdents to estimate a total size. This does require you to know if a directory entry is a file or directory, which the d_type field gives you depending on the OS, file system and other factors. An average file size could also be obtained from statvfs().

Another trick is based on the fact that the link count of a directory is 2 + the number of subdirectories. Once you have seen the corresponding number of subdirectories, you know that there are no more subdirectories you need to descend into. This could allow you to abort a getdents for a very large directory, using eg the directory size to estimate the total entries.

olddustytrail

0 replies

6h41m

2024-07-11 11:48:32 UTC

Another trick is based on the fact that the link count of a directory is 2 + the number of subdirectories.

For anyone who doesn't know why this is, it's because when you create a directory it has 2 hard links to it which are

    dirname
    dirname/.

When you add a new subdirectory it adds one more link which is

    dirname/subdir/..

So each subdirectory adds one more to the original 2.

mos_basik

2 replies

14h23m

2024-07-11 04:07:12 UTC

Something like that exists for btrfs; it's called bdtu. It has the accuracy/time trade-off you're interested in, but the implementation is quite different. It samples random points on the disk and finds out what file path they belong to. The longer it runs the more accurate it gets. The readme is good at explaining why this approach makes sense for btrfs and what its limitations are.

https://github.com/CyberShadow/btdu

renewiltord

0 replies

13h32m

2024-07-11 04:58:05 UTC

Damn, `ext4` is organized differently entirely. You can't get anything useful from:

    sudo debugfs -R "icheck $RANDOM" /dev/nvme1
    sudo debugfs -R "ncheck $res" /dev/nvme1

and recursing. That's a clever technique given btrfs structs.

jszymborski

0 replies

14h12m

2024-07-11 04:17:47 UTC

That's so cool.

BeeOnRope

2 replies

16h0m

2024-07-11 02:29:43 UTC

This seems difficult since I'm not aware of any way to get approximate file sizes, at least with the usual FS-agnostic system calls: to get any size info you are pretty much calling something in the `stat` family and at that point you have the exact size.

fsckboy

1 replies

15h43m

2024-07-11 02:47:08 UTC

i thought files can be sparse and have holes in the middle where nothing is allocated, so the file size is not what is used to calculate usage, it's the sum of the extents or some such.

BeeOnRope

0 replies

15h37m

2024-07-11 02:53:28 UTC

Yes, files can be sparse but the actual disk usage information is also returned by these stat-family calls, so there is no special cost to handling sparse files.

robocat

13 replies

10h22m

2024-07-11 08:07:57 UTC

but I don't like how unintuitive the readout is

The best disk usage UI I ever saw was this one: https://www.trishtech.com/2013/10/scanner-display-hard-disk-... The inner circle is the top level directories, and each ring outwards is one level deeper in the directory heirarchy. You would mouse over large subdirectories to see what they were, or double click to drilldown into a subdirectory. Download it and try it - it is quite spectacularly useful on Windows (although I'm not sure how well it handles Terabyte size drives - I haven't used Windows for a long time).

Hard to do a circular graph in a terminal...

It is very similar to a flame graph? Perhaps look at how flame graphs are drawn by other terminal performance tools.

pricechild

2 replies

10h17m

2024-07-11 08:13:24 UTC

"Disk Usage Analyser" / "Baobab" on Linux is awesome with the same UI: https://apps.gnome.org/en-GB/Baobab/

haskman

0 replies

8h9m

2024-07-11 10:20:36 UTC

And Filelight from KDE - https://apps.kde.org/filelight/

bmicraft

0 replies

8h14m

2024-07-11 10:15:51 UTC

Also Filelight (KDE)

krackers

2 replies

9h57m

2024-07-11 08:32:41 UTC

DaisyDisk on mac does that. Also it's blazing fast, it seems to even beat "du" so I don't know what tricks they're pulling.

supernes

1 replies

9h45m

2024-07-11 08:44:39 UTC

I think they're reading some of the info from Spotlight metadata already collected by the OS for indexing, but I could be wrong.

radicality

0 replies

1h44m

2024-07-11 16:46:22 UTC

That’s probably it. It’s likely powered by whatever thing gives you quick directory sizes in Finder after you do View Options (cmd+j), and select “Show All Sizes”. I have that setting always on for all directories and pretty sure it’s cached as it’s fast.

ajnin

2 replies

7h44m

2024-07-11 10:46:21 UTC

I don't like radial charts because the outer rings have a larger area, which makes it look like files deep in the hierarchy take more space than in reality. And also it leaves the majority of screen space unused.

I prefer the more classic treemap view, my personal favorite being the classic version of SpaceMonger but it's Windows only and very slow.

roelschroeven

1 replies

2h39m

2024-07-11 15:50:45 UTC

WizTree also uses a treemap view and is very fast. It's also Windows-only though.

entropicdrifter

0 replies

57m

2024-07-11 17:32:59 UTC

Linux and MacOS have QDirStat: https://github.com/shundhammer/qdirstat

OskarS

1 replies

8h26m

2024-07-11 10:03:43 UTC

I've used graphical tools very similar to this, and I always come back to this:

   du -h | sort -rh | less

(might have to sudo that du depending on current folder. on macos, use gsort)

You just immediately see exactly what you need to delete, and you can so quickly scan the list. I'm not a terminal die-hard "use it for everything" kinda guy, I like GUIs for lots of stuff. But when it comes to "what's taking up all my space?" this is honestly the best solution I've found.

pnutjam

0 replies

16m

2024-07-11 18:13:46 UTC

I like to use:

    du -shx \*

I used to pipe that to:

    | grep G

to find anything gig sized, but I like your:

    | sort -rh

Thanks!

rlue

0 replies

10h7m

2024-07-11 08:22:41 UTC

That’s called a ring chart or a sunburst chart.

kreyenborgi

0 replies

6h54m

2024-07-11 11:36:02 UTC

duc http://duc.zevv.nl/ does this

miew

11 replies

12h17m

2024-07-11 06:12:52 UTC

Why C and not Rust or even Zig?

Galanwe

4 replies

12h15m

2024-07-11 06:14:58 UTC

Why Rust or Zig and not C?

nottorp

3 replies

11h31m

2024-07-11 06:59:25 UTC

Better for the author's resume, if they want to make it hype driven.

Also some nebulous "being more secure". Never mind that this tool does not have elevated privileges. You gotta watch out for those remote root exploits even for a local only app, man.

_flux

2 replies

7h12m

2024-07-11 11:18:00 UTC

I mean you could extract an archive you've downloaded to your filesystem and said archive could have funky file names and then you use this tool..

But I suppose it's not a very likely bug to have in this kind of tool.

nottorp

1 replies

6h35m

2024-07-11 11:54:48 UTC

Or “they” could enter your house at 2 am, drug you and hit you with a $5 wrench until they get access to your files :)

gsck

0 replies

3h49m

2024-07-11 14:41:28 UTC

Never understood this line of thought. Not everything needs to be super secure. Not everything is going to be an attack vector. No one is going to deploy this onto a production server where this program specifically is going to be the attack vector.

Memory safety is cool and all, but a program that effectively sums a bunch of numbers together isn't going to cause issues. Worst case the program segfaults

201984

3 replies

6h36m

2024-07-11 11:53:56 UTC

Because (a) I felt like it, (b) it would be more difficult to make raw syscalls in Rust, and (c) I use[1] flexible array members to minimize allocations, whereas Rust support for those quite bad[2]. It doesn't even look possible to allocate one with size known at runtime.

[1]: https://codeberg.org/201984/dut/src/branch/master/main.c#L16...

[2]: https://doc.rust-lang.org/nomicon/exotic-sizes.html#dynamica...

estebank

2 replies

1h36m

2024-07-11 16:53:44 UTC

it would be more difficult to make raw syscalls in Rust

Would you like to expand on this? Is it because of type conversions that you'd have to do?

I use[1] flexible array members to minimize allocations

I was under the impression that FAM was a non-standard extension, but alas it is part of C99.

From what I'm seeing you have intrusive list where each `entry` points to the previous and next element, where the path itself is a bag of bytes as part of the entry itself. I'm assuming that what you'd want from Rust is something akin to the following when it comes to the path?

    struct Entry<const NAME_LEN: usize> {
        ..
 mode: Mode,
 name: InlineName<NAME_LEN>,
    }

    struct InlineName<const NAME_LEN: usize> {
 value: [u8; NAME_LEN],
    }

201984

1 replies

54m

2024-07-11 17:35:44 UTC

For syscalls, I would have needed to either pull in dependencies or write FFI bindings, and neither of those options are appealing when I could simply write the program in Linux's native language.

For the FAM, your example looks like it requires a compile-time constant size. That's the same as hardcoding an array size in the struct, defeating the whole point. Short names will waste space, and long ones will get truncated.

estebank

0 replies

25m

2024-07-11 18:05:15 UTC

For the FAM, your example looks like it requires a compile-time constant size. That's the same as hardcoding an array size in the struct, defeating the whole point. Short names will waste space, and long ones will get truncated.

You made me realize that the const generics stabilization work hasn't advanced enough to do what I was proposing (at least not in as straightforward way): https://play.rust-lang.org/?version=nightly&mode=debug&editi...

Those are const arguments, not const values, which means that you can operate on values as if they didn't have a size, while the compiler does keep track of the size throughout.

I'd have to take a more detailed look to see if this level of dynamism is enough to work with runtime provided strings, which they might not be, unless you started encoding things like "take every CStr, create a StackString<1> for each char, and then add them together".

estebank

0 replies

1h47m

2024-07-11 16:42:44 UTC

This is not an appropriate question to ask, that I see sometimes in these threads. "Because the author wanted" is good enough a reason for them to write a program in C. It being a new project written in C can also be good enough reason for you not to use it: dust already exists and is written in Rust, which you can use instead.

craftoman

0 replies

9h17m

2024-07-11 09:13:06 UTC

Because C is like a cult Toyota Supra with twin turbo and Rust or Zig is like another cool boring Corvette roadster.

shellfishgene

8 replies

13h20m

2024-07-11 05:10:07 UTC

What I need is a du that caches the results somewhere and then does not rescan the 90% of dirs that have not changed when I run it again a month later...

kstrauser

7 replies

13h1m

2024-07-11 05:28:56 UTC

And it would know they did not change without scanning them because how?

svpg

4 replies

12h47m

2024-07-11 05:42:48 UTC

It could hash the contents of a dir. Along the lines of git

legends2k

1 replies

12h32m

2024-07-11 05:57:49 UTC

And to hash something needs reading all of its data. I think deducing the file size would actually be faster in some file systems and never slower with any.

mort96

0 replies

11h11m

2024-07-11 07:19:05 UTC

Faster in all file systems I'd guess, stat is fast, opening the file and reading its contents and updating a checksum is slow, and gets slower the larger the file is.

Galanwe

1 replies

12h32m

2024-07-11 05:57:35 UTC

Except hashing requires... reading.

There is not much to be done here. Directories entries are just names, no guarantees that the files were not modified or replaced.

The best you could do is something similar to the strategies of rsync, rely on metadata (modified date, etc) and cross fingers nobody did `cp -a`.

shellfishgene

0 replies

12h6m

2024-07-11 06:23:36 UTC

I would be fine with the latter, the program could display a warning like "Results may be inaccurate, full scan required" or something.

I guess I'm just annoyed that for Windows/NTFS really fast programs are available but not for Linux filesystems.

shellfishgene

1 replies

7h21m

2024-07-11 11:08:59 UTC

Maybe it could run in the background and use inotify to just update the database all the time, or at least keep track of what needs rescanning?

shellfishgene

0 replies

7h14m

2024-07-11 11:16:08 UTC

Thinking about this some more, does this system not already exist for the disk quota calculation in the kernel? How does that work? Would it be possible for a tool to scan the disk once, and then get information about file modifications from the system that's used to update quota info?

INTPenis

7 replies

11h34m

2024-07-11 06:56:12 UTC

Reminds me of someone's script I have been using for over a decade.

    #/bin/sh
    du -k --max-depth=1 "$@" | sort -nr | awk '
         BEGIN {
            split("KB,MB,GB,TB", Units, ",");
         }
         {
            u = 1;
            while ($1 >= 1024) {
               $1 = $1 / 1024;
               u += 1
            }
            $1 = sprintf("%.1f %s", $1, Units[u]);
            print $0;
         }
        '

mshook

4 replies

10h53m

2024-07-11 07:37:11 UTC

I don't understand the point of the script, it's nothing more than:

  du -h --max-depth=1 "$@" | sort -hr

out-of-ideas

1 replies

10h45m

2024-07-11 07:44:44 UTC

`-h` is not available in all `sort` implementations

BlackLotus89

0 replies

8h22m

2024-07-11 10:08:20 UTC

Even the busybox port has it. The only sort implementation I know of that doesn't have -h is toybox (I guess older busybox implementations are missing it as well), but I'm using -h for well over a decade and seldom had it missing

dotancohen

0 replies

10h12m

2024-07-11 08:17:55 UTC

The point is that it is faster.

INTPenis

0 replies

4h48m

2024-07-11 13:41:59 UTC

I found this online a long time ago, and it's been with me across BSD, Macintosh and Linux. So I can't say why it is that way, and I didn't know about sort -h before today.

matzf

0 replies

10h52m

2024-07-11 07:38:19 UTC

Any particular reason for doing the human readable units "manually"? `du -h | sort -h` works just fine.

kelahcim

0 replies

9h46m

2024-07-11 08:43:38 UTC

I will definitely try this one and compare with my daily stuff

`du -s -k * | sort -r -n -k1,1 -t" "`

Neil44

5 replies

9h51m

2024-07-11 08:39:00 UTC

On Windows I always used to use Windirstat but it was slow, then I found Wiztree which is many orders of magnitude faster. I understand it works by directly reading the NTFS tables rather than spidering through the directories laboriously. I wonder if this approach would work for ext4 or whatever.

Filligree

1 replies

7h59m

2024-07-11 10:31:15 UTC

NTFS is pointlessly slow, so bypassing the VFS provides a decent speedup in exchange for the ridiculous fragility.

Linux doesn’t have the same issue, and I’d be quite concerned if an application like this needed root access to function.

luma

0 replies

2h0m

2024-07-11 16:30:28 UTC

I think you underestimate how much of a speedup we're talking about: it can pull in the entire filesystem in a couple seconds on a multi TB volume with Bs of files. I have yet to see anything in the linux world (including the OP) that comes anywhere near this performance level via tree walking.

utensil4778

0 replies

49m

2024-07-11 17:41:23 UTC

it works by directly reading the NTFS tables rather than spidering through the directories

Maybe I'm just ignorant of linux filesystems, but this seems like the obvious thing to do. Do ext and friends not have a file table like this?

fsfod

0 replies

2h44m

2024-07-11 15:46:08 UTC

There is a fork of Windirstat that also reads the NTFS MFT as well https://github.com/ariccio/altWinDirStat

Gormo

0 replies

6h30m

2024-07-11 11:59:36 UTC

If you do like WinDirStat, there's a good Linux equivalent called QDirStat: https://github.com/shundhammer/qdirstat

jeffbee

4 replies

15h36m

2024-07-11 02:54:01 UTC

It's also interesting that the perf report for running dut on my homedir shows that it spends virtually all of the time looking for, not finding, and inserting entries in dentry cache slabs, where the entries are never found again, only inserted :-/ Great cache management by the kernel there.

ETA: Apparently the value in /proc/sys/vm/vfs_cache_pressure makes a huge difference. With the default of 100, my dentry and inode caches never grow large enough to contain the ~15M entries in my homedir. Dentry slabs get reclaimed to stay < 1% of system RAM, while the xfs_inode slab cache grows to the correct size. The threads in dut are pointless in this case because the access to the xfs inodes serializes.

If I set this kernel param to 15, then the caches grow to accommodate the tens of millions of inodes in my homedir. Ultimately the slab caches occupy 20GB of RAM! When the caches are working the threading in dut is moderately effective, job finishes in 5s with 200% CPU time.

201984

3 replies

15h23m

2024-07-11 03:06:37 UTC

Are you referring to the kmem_cache_alloc calls in the profile? If so, that's all in kernel space and there's nothing I can do about it.

https://share.firefox.dev/3XT9L7P

jeffbee

2 replies

15h21m

2024-07-11 03:09:25 UTC

No, see how your profiles have `lookup_fast` at the leaves? Mine has `__lookup_slow` and it is slow indeed.

201984

1 replies

15h2m

2024-07-11 03:27:38 UTC

I just saw your edit. You have WAY more stuff under your home directory than I do. I only have ~2.5M inodes on both my laptop drives combined. The difference in the buff/cache output of `free` before and after running `dut` is only 1 GB for me.

Also, TIL about that kernel parameter, thanks!

jeffbee

0 replies

32m

2024-07-11 17:58:13 UTC

Yeah I have a TB of bazel outputs in my cache directory. Unfortunately automatically deleting old bazel outputs is beyond the frontier of computer science and has been pushed out to future releases for 6 years and still going: https://github.com/bazelbuild/bazel/issues/5139

jepler

3 replies

16h36m

2024-07-11 01:53:33 UTC

did you consider using io_uring? if not, was there a reason other than portability?

201984

2 replies

16h29m

2024-07-11 02:00:29 UTC

io_uring doesn't support the getdents syscall, so there's no way to traverse the filesystem with it. I considered using it for statx(2) to get the disk usage of each file, but decided not to because (a) it would be complicated to mix normal syscalls and io_uring and (b) perf showed the kernel spending most of its time doing actual work and not syscall boilerplate.

nh2

0 replies

12h51m

2024-07-11 05:39:25 UTC

Are you sure the perf may not be misleading?

E.g. memory accesses might show up as slower die to CPU caches being flushed when switching between user and kernel space.

I would be extremely interested in a quick (standalone?) benchmark of e.g. 1M stats with vs without uring.

Also https://github.com/tdanecker/iouring-getdents reports big uring speedups for getdents, which makes it surprising to get no speedups for stat.

If uring turns out fast, you might ignore (a), just doing the getdents first and then all stats afterwards. Since getdents is a "batch" syscalls covering many files anyway, but stat isn't.

jepler

0 replies

15h46m

2024-07-11 02:44:25 UTC

I appreciate the explanation!

notarealllama

2 replies

16h58m

2024-07-11 01:31:50 UTC

If this accurately shows hidden stuff, such as docker build cache and old kernels, then it will become my go-to!

nottorp

0 replies

11h33m

2024-07-11 06:56:40 UTC

Uh, even basic du "shows hidden stuff" accurately doesn't it?

dot files are just a convention on unix.

201984

0 replies

16h48m

2024-07-11 01:41:39 UTC

As long as it has permissions, it totals up everything under the directory you give it including names that start with a '.'. It won't follow symlinks though.

kccqzy

2 replies

17h17m

2024-07-11 01:13:23 UTC

I'm away from my Linux machine now but I'm curious whether/how you handle reflinks. On a supported file system such as Btrfs which I use, how does `cp --reflink` gets counted? Similar to hard links? I'm curious because I use this feature extensively.

201984

1 replies

16h50m

2024-07-11 01:40:05 UTC

I've actually never heard of --reflink, so I had to look it up. `cp` from coreutils uses the FICLONE ioctl to clone the file on btrfs instead of a regular syscall.

I don't handle them specifically in dut, so it will total up whatever statx(2) reports for any reflink files.

vlovich123

0 replies

14h46m

2024-07-11 03:43:53 UTC

You’ll probably end up with dupes (and removing these files won’t have the effect you intend) but I don’t know that there’s a good way to handle and report such soft links anyway.

jonhohle

2 replies

13h41m

2024-07-11 04:49:28 UTC

Did you consider the fts[0] family of functions for traversal? I use that along with a work queue for filtered entries to get pretty good performance with dedup[1]. For my use case I could avoid any separate stat call altogether, the FTSENT already provided everything I needed.

0 - https://linux.die.net/man/3/fts_read

1 - https://github.com/ttkb-oss/dedup/blob/6a906db5a940df71deb4f...

nh2

0 replies

13h6m

2024-07-11 05:24:18 UTC

fts is just wrapper functions.

You cannot around getdents and stat family syscalls on Linux if you need file sizes.

201984

0 replies

6h52m

2024-07-11 11:38:11 UTC

Those are single threaded, so they would have kneecapped performance pretty badly. 'du' from coreutils uses them, and you can see the drastic speed difference between that and my program in the README.

inbetween

2 replies

2h3m

2024-07-11 16:27:03 UTC

I often want to know who there is a sudden growth disk usage over the last month/week/etc, what suddenly take space. In those cases I find myself wishing that du and friends would cache their last few runs and would offer a diff against them, this easily listing the new disk eating files or directories. Could dut evolve to do something like that?

thesh4d0w

0 replies

1h44m

2024-07-11 16:45:38 UTC

gt5 does this - https://gt5.sourceforge.net/

beepbooptheory

0 replies

1h48m

2024-07-11 16:41:44 UTC

  du[t] > .disk-usage-"`date +"%d-%m-%Y"`"

And then use diff later?

imiric

2 replies

6h45m

2024-07-11 11:44:59 UTC

I've been using my own function with `du` for ages now, similar to others here, but I appreciate new tools in this space.

I gave `dut` a try, but I'm confused by its output. For example:

  3.2G    0B |- .pyenv
  3.4G    0B | /- toolchains
  3.4G    0B |- .rustup
  4.0G    0B | |- <censored>
  4.4G    0B | /- <censored>
  9.2G    0B |- Work
  3.7G    0B |   /- flash
  3.8G    0B | /- <censored>
   16G  4.0K |- Downloads
  5.1G    0B | |- <censored>
  5.2G    0B | /- <censored>
   16G    0B |- Projects
  3.2G   42M | /- <censored>
   17G  183M |- src
   17G    0B | /- <censored>
   17G    0B |- Videos
  3.7G    0B | /- Videos
   28G    0B |- Music
  6.9G    0B | |- tmp
  3.4G    0B | | /- tmp
  8.8G    0B | |- go
  3.6G    0B | |   /- .versions
  3.9G    0B | | |- go
  8.5G    0B | | |     /- dir
  8.5G    0B | | |   /- vfs
  8.5G    0B | | | /- storage
  8.5G    0B | | /- containers
   15G  140M | /- share
   34G  183M /- .local
  161G    0B .

- I expected the output to be sorted by the first column, yet some items are clearly out of order. I don't use hard links much, so I wouldn't expect this to be because of shared data.

- The tree rendering is very confusing. Some directories are several levels deep, but in this output they're all jumbled, so it's not clear where they exist on disk. Showing the full path with the `-p` option, and removing indentation with `-i 0` somewhat helps, but I would almost remove tree rendering entirely.

201984

1 replies

6h22m

2024-07-11 12:07:43 UTC

It is being sorted by the first column, but it also keeps subdirectories with each other. Look at the order of your top-level directories.

  3.2G    0B |- .pyenv
  3.4G    0B |- .rustup
  9.2G    0B |- Work
   16G  4.0K |- Downloads
   17G  183M |- src
   28G    0B |- Music
   34G  183M /- .local

If you don't want the tree output and only want the top directories, you can use `-d 1` to limit to depth=1.

imiric

0 replies

5h46m

2024-07-11 12:44:10 UTC

Ah, I see, that makes sense.

But still, the second `Videos` directory of 3.7G is a subdirectory of `Music`, so it should appear below it, no? Same for the two `tmp` directories, they're subdirectories of `.local`, so I would expect them to be listed under it. Right now there doesn't seem to be a clear order in either case.

dima55

2 replies

13h37m

2024-07-11 04:53:09 UTC

Ideas for a better format: do what xdiskusage does.

metadat

1 replies

12h31m

2024-07-11 05:59:28 UTC

What specifically do you feel xdiskusage does well?

dima55

0 replies

1h23m

2024-07-11 17:06:56 UTC

It graphically displays the relative sizes of things, and allows you to interactively zoom into any particular subdirectory to see the relative sizes of the things inside it

trustno2

1 replies

12h8m

2024-07-11 06:22:16 UTC

Does it depend on linux functionality or can I use it on macos?

Well I can just try :)

Ringz

0 replies

10h15m

2024-07-11 08:15:15 UTC

From the author:

„Note that dut uses Linux-specific syscalls, so it won't run on MacOS.“

timrichard

1 replies

17h5m

2024-07-11 01:24:48 UTC

Looks nice, although a feature I like in ncdu is the 'd' key to delete the currently highlighted file or directory.

201984

0 replies

16h41m

2024-07-11 01:49:02 UTC

This isn't an interactive program, so ncdu would be better for interactively going around and freeing up space. If you just want an overview, though, then dut runs much quicker than ncdu and will show large files deep down in subdirectories without having to go down manually.

teamspirit

1 replies

15h21m

2024-07-11 03:08:51 UTC

Nice job. I've been using dua[0] and have found it to be quite fast on my MacBook Pro. I'm interested to see how this compares.

[0] https://github.com/Byron/dua-cli

201984

0 replies

15h12m

2024-07-11 03:17:41 UTC

I benchmarked against dua while developing, and the results are in the README. Note that dut uses Linux-specific syscalls, so it won't run on MacOS.

TL;DR: dut is 3x faster with warm caches, slightly faster on SSD, slightly slower on HDD.

jftuga

1 replies

15h26m

2024-07-11 03:04:07 UTC

You should include the "How to build" instructions near the beginning of the main.c file.

201984

0 replies

15h16m

2024-07-11 03:14:06 UTC

Done

hsbauauvhabzb

1 replies

17h14m

2024-07-11 01:15:33 UTC

This looks handy. Do you have any tips for stuff like queued ‘mv’ or similar? If I’m moving data around on 3-4 drives, it’s common where I’ll stack commands where the 3rd command may free up space for the 4th to run successfully - I use && a to ensure a halt on failure, but I need to mentally calculate the space free when I’m writing the commands as the free space after the third mv will be different to the output of ‘df’ before any of the commands have run.

201984

0 replies

16h39m

2024-07-11 01:51:26 UTC

I haven't run into a situation like that, but if I did, I'd be doing mental math like you. `dut` would only be useful as a (quicker) replacement for `du` for telling you how large the source of a `cp -r` is.

classified

1 replies

6h52m

2024-07-11 11:38:19 UTC

I get boatloads of "undefined reference" errors. Where's the list of dependencies?

201984

0 replies

4h38m

2024-07-11 13:51:41 UTC

The only dependency is a recent Linux C standard library. What are the missing symbols? On older versions of glibc you do have to add -pthread.

IAmLiterallyAB

1 replies

15h9m

2024-07-11 03:20:51 UTC

I'm surprised statx was that much faster than fstatat. fstatat looks like a very thin wrapper around statx, it just calls vfs_statx and copies out the result to user space.

201984

0 replies

14h50m

2024-07-11 03:39:52 UTC

Out of curiosity, I switched it back to fstatat and compared, and found no significant difference. Must've been some other change I made at the time, although I could've sworn this was true. Could be a system update changed something in the three months since I did that. I can't edit my post now though, so that wrong info is stuck there.

tonymet

0 replies

1h17m

2024-07-11 17:12:39 UTC

great app. very fast at scanning nested dirs. I often need recursive disk usage when I suddenly run out of space and scramble to clean up while everything is crashing.

tiku

0 replies

9h35m

2024-07-11 08:54:55 UTC

Ncdu is easy to remember and use, clicking through etc. would be cool to find a faster replacement, same usage instead of a new tool with parameters to remember..

tamimio

0 replies

12h56m

2024-07-11 05:34:05 UTC

Someone, please create a Gdut, a fork that will produce graphs for a quick and easy way to read, it’s almost impossible to read in small vertical screens.

tambourine_man

0 replies

15h35m

2024-07-11 02:54:57 UTC

https://dev.yorhel.nl/doc/ncdu2

I wasn't aware that there was a rewrite of ncdu in Zig. That link is a nice read.

sandreas

0 replies

12h37m

2024-07-11 05:53:06 UTC

Nice work! There is also gdu[1], where the UI is heavily inspired by ncdu and somehow feels way faster...

1: https://github.com/dundee/gdu

rafaelgoncalves

0 replies

13h9m

2024-07-11 05:21:20 UTC

neat tool, congrats on the release and thank you for this and the analysis/comparison.

nh2

0 replies

12h36m

2024-07-11 05:53:55 UTC

I don't know why one ordering is better than the other, but the difference is pretty drastic.

I have the suspicion that some file systems store stat info next to the getdents entries.

Thus cache locality would kick in if you stat a file after receiving it via getdents (and counterintuitively, smaller getdents buffers make it faster then). Also in such cases it would be important to not sort combined getdents outputs before starting (which would destroy the locality again).

I found such a situation with CephFS but don't know what the layout is for common local file systems.

mixmastamyk

0 replies

12h57m

2024-07-11 05:32:33 UTC

Not as featureful, but what I've been using. If you can't install this tool for some reason, it's still useful. I call it usage:

    #!/bin/bash

    du -hs * .??* 2> /dev/null | sort -h | tail -22

0 replies

13h3m

2024-07-11 05:27:11 UTC

I have this in my bashrc:

    alias duwim='du --apparent-size -c -s -B1048576 * | sort -g'

It produces a similar output, showing a list of directories and their sizes under the current dir.

The name "duwim" stands for "du what I mean". It came naturally after I dabbled for quite a while to figure out how to make du do what I mean.

laixintao

0 replies

10h59m

2024-07-11 07:31:28 UTC

Anyone have ideas for a better format?

Hi, how about flamegraph? I always want to display the file hierarchy in flamegraph like format.

- previous discussion: https://x.com/laixintao/status/1744012609983295816

- my work display flamegraph in terminal: https://github.com/laixintao/flameshow

kseistrup

0 replies

1h22m

2024-07-11 17:08:23 UTC

Now available from AUR: https://aur.archlinux.org/packages/dut-git

jmakov

0 replies

13h41m

2024-07-11 04:48:36 UTC

Would be great to have a TUI interface for browsing like ncdu.

frumiousirc

0 replies

7h7m

2024-07-11 11:22:48 UTC

dut looks very nice.

One small surprise I found came when I have a symlink to a directory and refer to that with a trailing "/". dut doesn't follow the link in order to scan the real directory. Ie I have this symlink:

    ln -s /big/disk/dev ~/dev

then

    ./dut ~/dev/

returns zero size while

    du -sh ~/dev/

returns the full size.

I'm not sure how widespread this convention is to resolve symlinks to their target directories if named with a trailing "/" but it's one my fingers have memorized.

In any case, this is another tool for my toolbox. Thank you for sharing it.

chomp

0 replies

15h49m

2024-07-11 02:41:08 UTC

GPLv3, you love to see it. Great work.

bitwize

0 replies

2h54m

2024-07-11 15:35:57 UTC

Neat, a new C program! I get a little frisson of good vibes whenever someone announces a new project in C, as opposed to Rust or Python or Go. Even though C is pretty much a lost cause at this point. It looks like it has some real sophisticated performance optimizations going on too.

bArray

0 replies

9h24m

2024-07-11 09:06:20 UTC

I think that 'ls' should also be evaluating the size of the files contained within. The size and number of contained files/folders really does reveal a lot about the contents of a directory without peeking inside. The speed of this is what would be most concerning though.

anon-3988

0 replies

13h9m

2024-07-11 05:21:09 UTC

I have been using diskonaut, its fast enough given that it also produces a nice visual output.