return to table of content

Difftastic, a structural diff tool that understands syntax

mlavrent
15 replies
4h2m

I’m almost not sure why tools like git don’t ship with this as default. Been using difft for about a year now, and my main complaint is that it makes it hard to go back and use other diff tools when I don’t have difft available :).

I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same). It seems like an intractable problem in the general but maybe it’s doable and/or useful for smaller DSLs or subsets of some languages?

ruined
8 replies
3h45m

I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same).

if you do this your difftool becomes a compiler

mlavrent
3 replies
3h29m

Sorry, I should've been clearer. I'm interested if there's any tool that does this kind of thing statically, without running the code. I guess a simple approach is to compile both programs and see if the generated code is the same, but I'd guess reasoning at the generated-code level will probably produce a lot more false positives (i.e. tool will report a change when there isn't one) than if you reason about the original program.

jerf
1 replies
1h54m

This gets really hard, really fast. That is, yes, reasonably obviously doing this completely 100% accurately requires a solution to the halting problem, but even getting to "useful" is really really hard. Even the Haskell world doesn't try to solve the "equivalence of functions" problem, and it's even more complicated in imperative languages.

You probably have a mental image of catching something really simple, and, yeah, "1 + 1" -> "2" is reasonably easy, but in reality there aren't a lot of those super easy changes. Most of the time there is something confounding the situation.

Truly neutral refactorings are pretty uncommon in their own right. You can see that when someone is discussing semantic versioning and pointing out that if you define a "major version" as "there exists at least one possible use of the code whose behavior will be changed as a result of this library change", almost any API change is automatically a major version change, which isn't really what anyone wants. E.g., in Python, the mere fact that introspecting on an object's methods will show one more method than it used to isn't really what we want a major version change for. In general, proving refactorings are actually 100% safe is equally difficult; even simple arithmetic changes can result in things overflowing at different times or in different ways, it's virtually impossible to rewrite an expression involving floats without the change being witnessable somehow, extracting a function could make it so that code that previously didn't overflow the stack now does, memory allocation changes can be the difference between OOMing and not and may interact with GC in unpredictable ways if you get really precise, etc.

brabel
0 replies
55m

The Unison language (https://www.unison-lang.org/) knows how to compute whether the semantic meaning of the code has changed (though I don't think it's possible to get the actual diff to visualize it).

You can edit a function you've committed into the Unison code repo, and if you didn't change the semantics of the function, it's actually stored under the exact same hash... All places using the function refer to it by its hash, so nothing needs to be recompiled either, and no tests need to be rerun.

Things like renaming variables, reordering code whose order doesn't matter (common in functional programming) and things like that do NOT change the hash.

I believe this is only possible because Unison is a Pure Functional Language. If it's not, it becomes a NP problem to decide if two programs are exactly equivalent, probably.

I wonder if Unison could provide the actual semantic diff you're thinking of, it's probably not much more complex than actually knowing the meaning of the code did change. Maybe create a Feature Request :) https://github.com/unisonweb/unison

hobs
2 replies
3h42m

That's exactly what I have done with diffing SQL in lazy mode - just use a server and diff the AST/plan.

slotrans
1 replies
3h28m

Two semantically equivalent SQL statements can plan differently...

rrrrrrrrrrrryan
0 replies
2h32m

The exact same SQL statement can plan differently if table statistics change.

Chris_Newton
0 replies
3h24m

if you do this your difftool becomes a compiler

Some linters and formatters are effectively compilers already, so that doesn’t seem completely implausible in itself. Finding canonical representations of common coding patterns so you can quickly and reliably determine that they are equivalent is a different question, though.

rob74
2 replies
2h50m

I am curious if there’s been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same)

So when using such a diff tool you can spend hours refactoring something, and then git will refuse to commit your changes because your refactoring was successful in not changing the behavior of the code? I understand what you mean, but if we arrive at that point maybe we should stop calling it "diff", to avoid confusion...

kstrauser
1 replies
2h34m

Git doesn't use the output of `diff` to determine whether anything has changed.

samatman
0 replies
38m

True, although not widely known it would seem.

It does use diff to generate patches, however. I know in today's GitHub-dominated landscape, that's considered a bit of a dusty feature, but it would be a pity to break it.

otherjason
1 replies
3h29m

Difftastic is a useful tool, but in my experience, it's far too slow to be suitable as the default selection for a ubiquitous tool like git.

drcongo
0 replies
22m

I'm finding it instantaneous here on a large dirty codebase. In what way is it slow for you?

kstrauser
0 replies
3h15m

I think shipping good ol' diff as the default makes sense. It's going to be there already on any system you might want to run git on, it's fast, it's tiny, and everyone knows the basics of how to use it.

But I'm glad it's easy to change that default.

kstrauser
8 replies
3h8m

For those who don't already know, this is built on tree-sitter (https://tree-sitter.github.io/tree-sitter/) which does for parsing what LSP does for analysis. That is, it provides a standard interface for turning code into an AST and then making that AST available to clients like editors and diff tools. Instead of a neat tool like this having to support dozens of languages, it can just support tree-sitter and automatically work with anything that tree-sitter supports. And if you're developing a new language, you can create a tree-sitter parser for it, and now every tool that speaks tree-sitter knows how to support your language.

Those 2 massive innovations are leading to an explosion of tooling improvements like this. Now every editor, diff tool, or whatever can support dozens or hundreds of languages without having to duplicate all the work of every other similar tool. That's freaking amazing.

bfrog
2 replies
1h42m

While I agree tree-sitter is an amazing tool, writing the grammar out can be incredibly difficult I found. I tried writing out a grammar and highlighting query set for vhdl with tree-sitter, and found that there were a lot of difficulties in expressing vhdl grammar in tree-sitter.

kstrauser
1 replies
1h40m

No argument from me on that. The upside is that one person, somewhere, has to get it right one time and then we can all use it.

grub5000
0 replies
45m

Seems like something LLMs should be useful for, if not now then soon enough

ievans
1 replies
1h41m

Absolutely agreed, and copying from a comment I wrote last year: I think the fact that tree-sitter is dependency-free is worth highlighting. For context, some of my teammates maintain the OCaml tree-sitter bindings and often contribute to grammars as part of our work on Semgrep (Semgrep uses tree-sitter for searching code and parsing queries that are code snippets themselves into AST matchers).

Often when writing a linter, you need to bring along the runtime of the language you're targeting. E.g., in python if you're writing a parser using the builtin `ast` module, you need to match the language version & features. So you can't parse Python 3 code with Pylint running on Python 2.7, for instance. This ends up being more obnoxious than you'd think at first, especially if you're targeting multiple languages.

Before tree-sitter, using a language's built-in AST tooling was often the best approach because it is guaranteed to keep up with the latest syntax. IMO the genius of tree-sitter is that it's made it way easier than with traditional grammars to keep the language parsers updated. Highly recommend Max Brunsfield's strange loop talk if you want to learn more about the design choices behind tree-sitter: https://www.youtube.com/watch?v=Jes3bD6P0To

And this has resulted in a bunch of new tools built off on tree-sitter, off the top of my head in addition to difftastic: neovim, Zed, Semgrep, and Github code search!

drcongo
0 replies
44m

Don't forget Zed! https://zed.dev

epistasis
1 replies
1h33m

I'm imagining what I could have done in my compilers class with something like tree-sitter...

It feels kind of as foundational as YACC.

ivanjermakov
0 replies
1h28m

It is literally an alternative to YACC and other parser generators.

duped
0 replies
2m

I don't believe this is correct - there's no such thing as "speaking tree-sitter." Every tree-sitter parser emits a different concrete syntax tree, not a standard abstract syntax tree.

bloopernova
8 replies
4h7m

Related, updating difftastic and friends if you installed via cargo:

  cargo install cargo-update
  cargo install-update --list
  cargo install-update --all
Other fun Rust projects available via cargo:

https://mise.jdx.dev/ mise-en-place, a drop-in replacement for asdf https://asdf-vm.com/ that is really fast and flexible.

https://github.com/ajeetdsouza/zoxide is a fantastic cd replacement, which stores where you cd to, and you can then do a partial match like "z hel" might take you to "~/projects/helloworld".

https://github.com/bootandy/dust is a compliment to "du", shows which directories are using the most disk space.

IshKebab
2 replies
3h34m

ncdu is the best du replacement by far.

polygamous_bat
1 replies
1h46m

I've always used dust as a replacement, and so I am curious to know if you have tried both tools: do you have thoughts on what makes ncdu better?

IshKebab
0 replies
1h15m

Dust is probably the best you can get without interactivity, so it's good for logs.

But ncdu is a fully interactive file browser that lets you navigate through the tree, and crucially it lets you delete things without requiring a full rescan. It's amazing for freeing up disk space by deleting things you don't need anymore, which is probably 95% of the reasons I run `du`.

qmmmur
1 replies
1h51m

Wow, I installed mise-en-place. It's exactly what I wanted asdf to be.

bloopernova
0 replies
29m

It's so much faster than asdf, the dev did a really great job.

kstrauser
1 replies
3h36m

I love zoxide! Also for your list: lsd, a prettier ls.

bloopernova
0 replies
2h51m

so... many... colours!

Looks great, thank you for the recommendation.

hrdwdmrbl
2 replies
3h45m

It seems like a major lapse in product innovation that Github has not come out with something like this. They don't even have something to help you when the indentation changes, they usually just show it as a giant add & remove. Their diff viewer can and should be smarter.

sroussey
1 replies
3h1m

GitHub has the option to ignore whitespace in a diff.

mbork_pl
0 replies
1h8m

Which is useful, but too crude.

sanxchit
1 replies
1h48m

What an amazing tool, wish it had a GUI version as well.

layer8
0 replies
32m

From the screenshot examples in the readme, I’m not sure how substantial the benefits are over GUI tools like Kdiff3 or WinMerge that have existed for ages.

sanity
1 replies
4h5m

Interesting, I found Semantic Merge [1] years ago but it was never open source.

This just does diff but not merge, but at least it's open source - and the diffs look a lot nicer, I've already made it my default.

Any plans to extend it to merging?

[1] https://docs.plasticscm.com/semanticmerge

rideontime
0 replies
2h28m

Was going to suggest this myself, this was a godsend when I was working with a big team on a C# project going through a messy refactor.

adamtaylor_13
1 replies
3h8m

Does anyone know how to enable this for .html.erb files? I found it doesn't work properly in Ruby .erb files which makes it fallback to just regular ol diff behavior.

coldbrewed
0 replies
2h34m

That may require a tree-sitter implementation for erb templated html; it may exist but if so it's less of a mainstream thing.

Some quick googling turns up https://github.com/tree-sitter/tree-sitter-embedded-template which may or may not meet your needs.

adamc
1 replies
2h50m

Doesn't seem to have a Debian install.

Night_Thastus
1 replies
2h41m

No MSYS install, sadly. :(

quasarj
0 replies
22m

It's just a cargo package. Is there a working rust/cargo toolchain under MSYS?

xyzelement
0 replies
1h3m

I don't write enough code / write it professionally anymore to integrate it into my life BUT MAN this is a great idea.

In general, we're overflowing in TMI which makes it hard to suss out what matters. For example at work I often read docs that describe what we do for customer X vs customer Y and it takes a ton of work to suss out the 1% of text that is different between those two, which is really what you want to understand and validate.

So anything that makes just the impactful change stand out is beyond welcome.

pmayrgundter
0 replies
3h58m

"Do you know how to read @@ -5,6 +5,7 @@ syntax? Difftastic shows the actual line numbers from your files, both before and after."

Preach!

Just dropped it in and did a git diff.. works like a charm!

pjturpeau
0 replies
2h14m

It seems to be a great tool, however on the few checks I did on big XML files, it shows modified lines in normal green and modified attributes in bold green, which makes them difficult to detect visualy.

I didn't find in the documentation how it is possible to change the style of the diff, or to ask for another color in the bold case.

Any idea?

nibab
0 replies
2h6m

This is great! I wish my PR review tools allowed me to plug in something like this. Hopefully one day we will go back to the world of customizable/plugin-based software. Most of my web tools are very prescriptive about the user experience and dont let me tailor my tools.

mnw21cam
0 replies
2h39m

No package for Debian-like systems yet.

markrages
0 replies
29m

Does the output work with patch(1)? Or does this use a different patch?

keybored
0 replies
1h28m

I think I use this indirectly through the git-delta pager which is a great pager replacement for git.

jmholla
0 replies
57m

I tried switching to this, but I found it noisy and use weird formatting for things that didn't change. I went back to using icdiff[0].

[0]: https://github.com/jeffkaufman/icdiff

drcongo
0 replies
26m

I love this so much. I hate reading cli diffs, but this is instantly understandable.

blackfawn
0 replies
2h10m

Difftastic seems really nice! Unfortunately it shows some changed binary files which makes it sort of unusable. `file` reports these files as "ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, stripped" and the MIME type/encoding is "application/x-sharedlib; charset=binary" so not sure why difftastic is trying to show them as thousands of changed lines of text...

aus10d
0 replies
2h25m

Really cool idea!

akkartik
0 replies
1h14m

Is there a way to make the output more familiar to diff users? I've turned on --inline. I also mostly don't care enough about line numbers to want them on every line, so prefer the '<' and '>' leaders.

Also, on Arch there doesn't seem to be a man page.

airstrike
0 replies
2h16m

Fantastic tool. Now we just need the vscode extension ;-)

abledon
0 replies
1h30m

onnly found out about this because it was an option to view diffs when installing git using Nix