You can also convert the data to JSON using sqlite_to_json
You can also convert the data to JSON using sqlite_to_json
This approach works by storing the actual SQLite binary files in Git and then using a custom "diff" configuration to dump each file as SQL and compare the result.
It's a neat trick, but storing binary files like that in Git isn't as space efficient as using a plain text format.
I built my own tooling to solve this problem: https://datasette.io/tools/sqlite-diffable - which outputs a “diffable” copy of the data in a SQLite database, precisely so you can store it in Git and look at the differences later.
I’ve been running that for a couple of years in this repo: https://github.com/simonw/simonwillisonblog-backup - which provides a backup of my blog’s PostgreSQL Django database (first converted to SQLite and then dumped out using sqlite-diffable).
Here’s an example diff: https://github.com/simonw/simonwillisonblog-backup/commit/72...
I don't think there will be any huge difference between a text dump and binary formats here, git doesn't really care about the underlying data when it stores and packs it.
Maybe a text format that is sorted before saved would compress a lot better though, both with zlib compression and gits delta compression. You can't really sort a binary file and put it back together!
I'm more wondering about whether or not this Sqlite DB is actually a "source file", that is something that is not generated by something else, which is normally the base material for both make and Git.
Obviously an SQLite DB cannot be created ex nihilo, so it is not a source file. Isn't it a bit dangerous to store the database file in the repo, and not the mean to generate it? If both are included in the repo, it then seems redundant (reproducible builds concerns, maybe?).
Seeing that many people are interested in this topic, I wonder in which cases they would however want to track that file in Git.
> I’ve been running that for a couple of years in this repo: https://github.com/simonw/simonwillisonblog-backup - which provides a backup of my blog’s PostgreSQL Django database (first converted to SQLite and then dumped out using sqlite-
I'm curious, what is the reason you chose not to use pgdump, but instead opted to convert to to sqlite and then dump the DB using sqlite-diffable?
On a project I'm working on, I'd like to dump our Postgres schema into individual files for each object (i.e., one file for each table, function, stored proc, etc.), but haven't spent enough time to see if pgdump could actually do that. We're just outputting files by object type for now (one tables, function, and stored procs files). It looks like sqlite-diffable does something similar to what we're looking for (separate files for each object).
pgdump can do that if you go with the directory format output. According to the docs, it will "create a directory with one file for each table and large object being dumped, plus a so-called Table of Contents file describing the dumped objects in a machine-readable format that pg_restore can read."
Mainly it's because all of my other projects are SQLite and this felt like a good opportunity to try out some of my own tooling against the main PostgreSQL database I still use for an active project.
Does this custom diff driver also help git do a merge of two dbs?
> It's a neat trick, but storing binary files like that in Git isn't as space efficient as using a plain text format.
Why is that? I would have expected git's delta compression to work well on sqlite files, since small table changes usually cause only a small number of page changes.
I would think so too. As a quick test I added chinook.db to a fresh repository, committed it, added a row to a table, committed again, and then did a git gc.
The resulting pack file, with two revisions of the database, is 329k. The raw database is 864k and gzips to 328k.
So it does look like git's delta encoding works with sqlite's blocks.
is similar approach possible for postgress and MySQL?
Not for this technique, because this relies on the fact that SQLite databases are a single file that can be checked into Git.
MySQL and PostgreSQL use a whole directory full of files. You could try storing that whole thing in Git and then tiring a custom diff command that can load those directories into a temporary database sever and dump out SQL for comparison, but it would be very slow and brittle if you could even get it to work at all.
Instead, a better strategy would be to dump your MySQL or PostgreSQL database to plain SQL and store that in your Git repo. Or use the trick in using here for my PostgreSQL database: https://news.ycombinator.com/item?id=38114171
> Or use the trick in using here for my PostgreSQL database: https://news.ycombinator.com/item?id=38114171
I wonder how this approach handles converting decimal fields, considering SQLite does not have a decimal field type and only has floats (i.e., you could potentially lose some precision during the conversion)?
Probably! I don't deal with decimal fields in my projects so I've not explored that yet.
Is the dump order preserving ? It technically doesn't have to be....
does no one know about _A tables anymore? This has been a solved thing since the 1970s
Can you elaborate? I've been coding since the 90s and have never heard the term _A table before.
sometimes called audit or journal tables.
Every time something is updated or deleted, the entire previous record is inserted into its corresponding _A table with who did it and when (and optionally for what transaction number)
so delete from foo results in an insert into foo_A before the delete occurs. so update to bar results in an insert into bar_A before the update occurs and rolling back a delete results in picking the old values out of the _A table and rolling back an update results in picking the old values out of the _A table and inserting the current values into the _A table before the updating the main table.
I guess there's plenty of technical solutions to track changes, like scd or snapshots or audit tables - what I find interesting about git approach it's that it's much more user-friendly vs. SQL archeology to understand what changed?
This is a super neat trick! At work, we're scaling git to terabyte scale and I'm constantly on the search for new ways we can showcase diffs for different file types:
It would be nice if we finally could have a database that we could use git on to track changes.
Dolt looks very promising in this area but only works with MySQL, would be nice to have a SQLite solution.
The tricky thing about keeping files in git that don't diff well (but can be converted into a representation that does) for me has always been lack of support for filter or hooks in the GitHub web interface. This is a frequent problem for me at work with big xml files.
Oh wow. This is indeed a very neat use of git config!
Although just storing the text output would probably be nicer? You could probably achieve similar with some git hooks?
I wonder if Fossil (SCM used to develop SQLite and written by SQLite devs) has a similar feature built-in
Repeating myself from several days ago (https://news.ycombinator.com/item?id=37958006):
4 years ago I wrote an internal tool to serialize records of our hierarchical database to a bunch of hierarchical json files, which are then managed in a normal git repo. Each record is serialized to json, "root" entities in the hierarchy are serialized as to contain all their child entities. This is done as the user works - the database is updated and then the correct json file is read, updated and written back. The current SHA is always updated in the database, so there's a way to recognize users worked outside our app with git on the repo, we can detect that and automatically the database based on the diff. There's also an opposite sync.
This was/is a very specific solution to our very specific set of problems. So not applicable to the general problem of "versioning a database".
In fact users now complain that this duality of database <-> repo is confusing and we should ditch the database. Our users are unfortunately not very technical and sometimes require a lot of hand holding.
In theory they're correct but in practice this would mean ditching our ORM and rewriting one from scratch that will work on top of the json files. While keeping the same API outwards. Which is potentially a lot of work. Maybe that will happen, we haven't decided yet.
I wrote much more about it here:
From time to time I'm searching for a database engine for small databases that stores the data as diffable text, just provides up to date indexing and sql on top of that to make the operations convenient and fast.
Didn't find any. I keep encountering BerkleyDB but I don't know how good it is for that purpose.
HSQLDB does that, but 1. ew, Java, and 2. there are reasons people don't do that.
The major problem with text formats is that it means either you have to rewrite the whole file on every change ... or else you have to append arbitrarily and be able to survive that (if you're committing to git you probably are capable of dealing with the storage, at least).
If the former is feasible for you (because incremental updates are rare), just use your favorite database's load/dump around every session, and do some fsync testing (or just keep full backups).
If the latter compromise is acceptable there might be something you can do with streaming logical replication on some DBs, but again there's a reason people generally don't.
You might want to look at tsv-utils, or a similar project: https://github.com/eBay/tsv-utils
For the SQL part, but maybe a lot heavier, you can use one of the projects listed on this page: https://github.com/multiprocessio/dsq (No longer maintained, but has links to lots of other projects)
For what use-cases do you need this?
Have you considered Fossil (link below)? it is a SCM tool built on top of sqlite.
I have not used it myself, but I would assume it stores file revisions as diffs, and since its all just sqlite you should be able to run arbitrary queries on it.
My usecase is silly. Basically small to medium database backed cms websites. I just want to evolve them as the client already is using them to serve and store content. Also have up to date complete backup of them in git.
I could do it easily if I just kept data in text files instead of sqlite, but I really like the speed of the database queries.
Why not set the text files as the SSOT and have a hook that populates the database when the files change?
Maybe the other way around? Use database triggers on inserts an updates to store the data in files as well? And create a procedure to update the database from files after git pull?
Sharing a personal project to display git based CSV :D
https://app.data-drift.io/41231518/samox/local-datadrift-rep...
Not sure how it could be plugged on this stuff, but sharing anyway :D
When you hijack someone else’s thread with something unrelated, I find it impossible to trust you enough to get excited about what you did. Smiling faces aside, this is kind of disrespectful to the author.
Whereas with a show HN, I would focus on you and do my best to find something positive to say.
I felt it was related (database + git is quite a narrow scope), and not concurrent to the post (I did not comment on the DoltgreSQL post for instance). But, I admit I was looking for some personal gratification. Sorry it bothered you.
Because it’s fun, here’s an alternative spelling of `echo .dump | sqlite3`:
<<<.dump sqlite3
This avoids invoking echo unnecessarily. Every time you do it you might just save a picowatt-hour or two! (It shouldn’t be more expensive. But be sure to spend a few watt-hours benchmarking it rigorously.)This uses two things that shell users are commonly unfamiliar with:
• <<<: as `> filename` redirects stdout to the named file and `< filename` pipes the named file into stdin, `<<< word` pipes the word into stdin. (The whitespace is optional, and word follows normal argument splitting so you can do things like `<<<"Hello, world!"`.)
• Ordering: although most commonly written at the end of commands, redirections can be placed at the start as well. So if you do things like `2>/dev/null >&2` to suppress all output¹, that bit can go at the start or the end. I’ve become increasingly partial to putting redirections at the start of the line, especially in cases where I’m appending to a file for a while, because a leading `>>filename` supports visual alignment better than a trailing.²
And since we’re talking about different ways: as given, this is feeding the SQL to sqlite3 via stdin, but you can also pass it on the command line, after the filename. But things like this in Git (aliases, diff textconv, probably more) work by taking the string and appending the filename to the end, so you need a way of reordering the arguments. The solution is an immediately-invoked function:
f() { sqlite3 "$@" .dump; }; f
So when you diff mydb.sqlite, it runs `sh -c 'f() { sqlite3 "$@" .dump }; f mydb.sqlite'` or equivalent, which winds up executing `sqlite3 mydb.sqlite .dump`, as desired.I use this technique a number of times in my Git aliases, saving the bother of putting them in separate shell scripts somewhere where path management is a bother, at the cost of maintaining a one-liner with sometimes too many semicolons.³
—⁂—
¹ “Take stderr (2) and redirect it (>) to /dev/null, then take stdout (default/implicit, could also write 1 explicitly) and redirect it (>) to stderr (&2).” There are plenty of other ways of writing this!
² Lists can be a better solution for this specific case, allowing you to redirect to the file only once for a whole bunch of commands:
{
<<<"Line one"
some-command
<<<"End of $thing"
} > filename
³ My longest is thirteen lines, though half of them barely count as lines. The line from my ~/.config/git/config, within [alias]: # Revise into the commit that last changed File
rf = "!f() { if [ $# -eq 0 ]; then REV=\"$(git status --porcelain --untracked-files=no | sed '/^ /d;s/^.. //' | xargs -n1 git rev-list -1 HEAD -- | uniq)\"; NUM_REVS=\"$(echo \"$REV\" | wc -l)\"; if [ $NUM_REVS -ne 1 ]; then >&2 echo Files in the index were not all last modified in the same commit; exit 1; fi; else REV=\"$(git rev-list -1 HEAD -- \"$1\")\"; shift; fi; git revise \"$REV\" \"$@\"; }; f"
Seems a bit over the top when you could also just provide it as a command line argument to sqlite3 which is much more obvious than <<< and works in every shell.
sqlite3 db.sqlite3 .dump
Review what’s going on. Git gives the file name as the last argument, so you can’t pass .dump as the last argument like that without the contortion of an immediately-invoked function.
"echo" is typically part of the shell, it doesn't run another executable, so I don't think that would save anything.
Shh! Don’t spoil things! Picowatt-hours, I say. Picowatt-hours!
But seriously, although echo is typically a shell built-in and distinctly faster than /usr/bin/echo, it’s still much slower than <<<, presumably because it still has to set up a pipe and an extra… shall we say pseudoprocess.
Comparing behaviours for feeding text into `true` (typically a shell built-in, so that process spawn times doesn’t drown the signal):
try() {
echo -e "\e[32;1m$1\e[m"
for (( run = 0; run < 5; run++ )); do
time (for (( i = 0; i < 1000; i++ )); do
$2
done)
done
}
a() { /usr/bin/echo .dump | true; }; try /usr/bin/echo a
b() { echo .dump | true; }; try echo b
c() { <<<.dump true; }; try '<<<' c
d() { true; }; try "no piping" d
My best times of the five runs, under bash/zsh, expressed in time per iteration:• /usr/bin/echo: 750μs/845μs
• echo: 469μs/371μs
• <<<: 11μs/31μs
• No piping: 3μs/10μs
So… yeah, on a very slightly older or slower machine than mine, using <<< may save you more than half a millisecond. That’s a much bigger difference than I expected—I was expecting it to be well under 200μs, maybe under 100μs, though the more I think about it the more I realise my expectation may have been unreasonable.
Improved benchmarking script (I just wasn’t thinking carefully at the time, just getting something out quick; but eval is obviously better than requiring a function):
try() {
echo -e "\e[32;1m$1\e[m"
for (( run = 0; run < 5; run++ )); do
time (for (( i = 0; i < 1000; i++ )); do
eval $1
done)
done
}
try "/usr/bin/echo .dump | true"
try "echo .dump | true"
try "<<<.dump true"
try "true"
There is another difference that users may care about here. zsh will create a temp file for each here-string in the here-string version, and bash may do too¹. Whether those files hit a disk is a matter of system configuration, and whether such a disk is magically quick or swirling rust is a different issue too.
[Just noting that the implementation here isn't equivalent for nerdsnipe-ery, not arguing for real attempts at optimisation of simple pipelines.]
¹ Always with older versions, only for large strings with newer versions.
Interesting, didn’t know about that. Hadn’t thought about it too deeply.
python <<<'import subprocess; subprocess.run(["ls", "-la", "/proc/self/fd"])'
Running under zsh, link 0 -> '/tmp/zshqqjTS0 (deleted)'; under bash, link 0 -> 'pipe:[1509353]'.(I know <(…) is substituted in zsh with /proc/self/fd/… for a fd corresponding to a pipe:[…], and regular | piping makes 0 be a pipe:[…].)
I know some common configurations use concrete /tmp. Mine is tmpfs.
Thanks for the info! I like sharing these kinds of things because I have found them interesting and expect a few others will too, and people often add to them details I hadn’t known and like to know!
Fellow shell trivia enthusiast, hi!
$ bash -c 'realpath /dev/stdin <<< small'
/proc/79631/fd/pipe:[5282632]
$ bash -c 'realpath /dev/stdin <<< $(printf "%65536s")'
/tmp/sh-thd.yZ2L6p (deleted)
You can see the cutover on buffer size from my install of bash 5.2.15. Anything under 64k of here-string will still use a pipe.In case you weren't aware, you can also force a "real" file with zsh process substitution by using =(…)¹. It can be useful when the tool you're using doesn't behave correctly with <(…), if it wishes to seek across it for example. Sadly, =(…) isn't supported in bash [yet?].
¹ In your case it still wouldn't hit a disk as it is in /tmp, but it at least becomes seekable.
It's rather annoying that part of the config is stored in your local checked out copy of the repository (.git/config) and part of it is in a configuration file that is checked into version control (.gitattributes).
That's because part of it is for you personally (the bit that dictates exactly which program gets run to render the diff), and part of it is for everybody (the bit that specifies the diff-related file type).
Sure, but unless I'm missing something, adding `sqlite3` into my `.gitattributes` is useless without manually doing extra work to specify the config. I get that it would be nice to be able to override this locally, but having no way to specify a default way to make this work feels broken.
Hmm, I'm sure the question was originally simply about the data being stored in two separate places? - but, I checked again, and it's very clearly not.
The usual workaround is to create a config file, check it and symlink it to .git/config.
Even better - use an include/includeIf in your config
That’s not a usual workaround. That’s a crippled technique that doesn’t scale beyond single-user, single-branch usage, because .git/config includes things like the remotes and branches.
That sucks because then you can't have project and user-specific configs (e.g. setting your email address for a particular project).
How often does that happen for you in practice? I’ve worked across hundreds of git repos with other teams, and I can count the ones were I had to fiddle with the git config on one hand.
It’s still a workaround with limitations, but I have found it to work ok most of the time.
All the time because I occasionally do extra curricular work on my work laptop and I want to use my Gmail address for that, not my work email address.
There's been occasional talk of implementing something like this on the Git ML & devloper summits (and I've probably been the most excited about it), but for now this is deliberate and symlinking a tracked file to .git/config is a big exploit vector, you're basically giving everyone who's got access to the repository shell access on your computer.
That's because some configuration will allow you to execute arbitrary commands on the user's computer (aliases, but also various ..cmd config, etc.). Then there's other config that's also dangerous, but short of arbitrary code execution.
So, if git ever does start supporting a "tracked config" it would involve being very careful about the security implications. The approach I've advocated for is similar to how Emacs handles this, i.e. to have the user whitelist certain configuration keys/value types as "safe").
But for now it's not supported at all, one workaround for what you may be trying to do is to do path-based includes. I.e. if the reason you're setting the user and/or E-Mail is e.g. for a personal or work configuration, then you could clone those projects in ~/git/personal or ~/git/work, and have your ~/.gitconfig set the configuration for those accordingly.
This is a fair point. Allowing a specific diff program to be specified in the repository configure would basically allow anyone with push access to your repository the possibility of executing arbitrary commands on your local machine.
This is an excellent and very clever trick; however what it does is store the sql database in binary format, and use the dump only for the diff.
Would it not be simpler to only store the text dump each time? (The dump can be used trivially to rebuild the db should the need arise, and it's human readable).
So maybe a pre-commit or pre-staging hook would be even better for this use case?
It kinda depends on performance, a large database may take a while to rebuild depending on indices and the like. But then, the same can be said for the output of this dump command (and diffing the result). It's probably manageable for a gnucash database (1000 - 100000 record order of magnitude if I were to guess), but for anything larger the dump command will take ages.
If the database is huge you don't want to store every version as a whole blob in git...
I don't have experience with truly huge dbs in the gigabytes or terabytes, but up to, say, 250 Mb the dump feels very fast.
I love these sort of articles. No new tools are needed here, it shows you how to get the most out of the tools you already know and use. Awesome!
The filter[0] attribute with "smudge" and "clean" would be a better solution for storing SQLite into git.
You have to be careful with this, as using .dump as your filter will not necessarily get you an equivalent database on restore. For example, .dump will not include database metadata such as application_id¹ or user_version²(perhaps other things too but these are the ones that have bitten me before).
The custom diff driver in the OP will not reflect changes to those values either obviously, but at least it won't lose them in the actual database file stored in the git repository.
Is there no capability in SQLite to dump these things as well? Seems like a pretty big thing missing if that's the case, a thing I would expect SQLite to support pretty well.
Not as far as I'm aware. You can query much of that stuff using the CLI with the .dbinfo command, but support for that appears to be dependent on build options so may not be available. Another option might be to provide multiple commands("sqlite my.db 'PRAGMA user_version' .dump"), but you'd have to remember to update scripts if you suddenly started using other things in your app too.
I'll note that the metadata is handled correctly by the CLI's .backup command or you can "VACUUM INTO <somewhere>"¹, so if you're just attempting to use .dump for the sake of backup there are probably better options available anyway.
But yeah, I was surprised when I chased down a bug that was caused by a missing application_id following a restore from .dump output. I did a search at the time and noticed it had been reported to the sqlite folks on their forums a few times, so presumably I'm missing the reasoning behind not including it.
Not sure if this actually does more, but remember you can do:
sqlite> select * from pragma_application_id();
0
sqlite> select * from pragma_user_version();
0
Yeah, you could use that as in "sqlite3 t.db 'SELECT format(…) FROM pragma_…' .dump" to create a file that sets application_id/$others and then dumps the tables, which would make restoration from the dump file work in more cases.
It is just another thing to be aware of if you're trying to store your database in your git repository with a filter, and want it to roundtrip correctly.
Is there a web interface like GitHub that supports this? I wanted to use it with GitHub Enterprise a few years ago and found that it meant I couldn't see diffs properly in GitHub.
Ah yes, I made a similar observation here https://news.ycombinator.com/item?id=38111408
It's better to store text in git IMHO.
> It's better to store text in git IMHO.
I think the bias towards text files in Git is only a reflection of Git configuring its defaults to handle text files, such as the diff tool and also file type. If you add a custom file type and configure it to not handle it as text, which includes specifying git attributes to prevent it to update things like newline characters, then Git works just as well.
Diffs are purely a UI mechanic of Git. Computed on demand, but they don't really exist at the data layer, where blobs are stored. That's the real spot where all the problems are.
Binaries can't realistically be packed and compressed most of the time in Git's design, so the net result is there's basically a copy of every version of that file in your repo, forever. That 10MiB binary got modified 5 times? That's 50MiB of bloat that is stuck in your git repository, forever, until the end of time. It doesn't matter if you delete it. Space inefficiency like this is a core issue that compounds many problems in practice.
Then there's the fact a lot of the algorithms start falling over. For example, Git rebases touch the disk multiple times (patch application, update the index) for every commit in the series you rebase. This gets very expensive when the working tree is filled with tons of blobs and the repository is large (many files, even 99.9% small ones) and the series is long.
I sort of doubt most programmers want to write/rewrite 50TiB of binary files in their Git repo every day. Some certainly do I guess, but I suspect most of us just want to shove a few dozen reference PDFs and a CAD file or two into their repository, maybe some zip files or .so files that get auto-updated, and use their basic workflows without things having performance cliffs. Some will want to store game assets, which is harder. But today it's mostly unsatisfying for anything but the smallest and most glacially-moving binary files.
It doesn't, what you see when you display the content is only 1/2 of it, the other half is how the system is able to store the content through delta-compression.
Here "text" or "binary" is a bit of a red herring, what's really important is whether it's diff-able. But in 99% of cases "binary" and "text" are synonyms for "un-diffable" and "diffable".
But the key disadvantage is that now getting started with the repository requires more effort: you must have sqlite3 installed locally, and you must apply certain config values manually (and “remember to run this helper script” or “the first time you try running it it’ll⸺” are still manual).
(This is for security reasons—you don’t want commands like clone and diff to be able to execute arbitrary code without your knowledge. Hence the attributes/config split.)
Otherwise, you’ll find that your .sqlite file is SQL instead of a SQLite database, and if you try running stuff while it’s in that condition you’ll get “file is not a database” errors from SQLite.
You are right, there is a bootstrap cost and potential security implications. In general, also, .gitattributes is pretty evil, but:
1. don't store binary files in Git 2. don't run programs that keep a git file open for too long, otherwise clowntown. For example, you run git pull while GnuCash keeps the file open and then you loose all your updates when you save (also a problem with the code below, but fixable).
The correct solution to this is running the "smudge" and "clean" operations as a backup and restore operation, outside the repo, when starting and closing GnuCash, in this case:
```
#!/bin/bash
cat $1.sql > $1.sqlite3
gnucash $1.sqlite3
echo ".dump" | sqlite3 $1.sqlite3 > $1.sql
# optionally:
# git add $1.sql
# git commit -m "Update $(date) $1.sql"
```
".gitattributes is pretty evil"
Could you please expand on this? I've been working with git for many years, and IME .gitattributes files are useful, sometimes essential, as the standard way to manage repo-level settings -- especially helpful for larger projects with heterogenous contributors. What am I missing?
In general, I believe in not giving people enough rope to hang themselves, but hiding the rope under a lock (Porcelain commands) that you open by entering complicated commands which weird syntax (Plumbing commands).
.gitattributes are "evil", because they "lie" about what's in the repo. You could use .gitattributes for simple things like line endings on different platforms (Windows vs Unix), but you could also use it for more complex flows where the file you have locally and the file you store in the repo are completely different - filter feature. This would be "magic"/hidden/implicit, rather than explicit for the user who won't understand why github.com shows something and their local filesystem shows something else.
Why is this better?
You won't be storing the sqlite file in git. So you won't be storing indexes or un-vacuumed data in git. You also won't be storing a database with some of it's rows locked or anything like that.
Git will be able to compress the text better, and compress between versions better.
Merging and resolving conflicts would have a chance of working.
Maintaining a big, binary blob will make the git history grow in proportion to the size of the binary blob, not the size of the logical/abstract diff, regardless of how comprehensible the diff looks.
To minimize the disk space to only the changes, you want to store the changes and not the complete, binary before/after images. By filtering, you reduce the stored object to the difference.
This is what I was expecting. Store the dump in git but then have the binary in the working directory.
I used to do this trick a couple of years ago to diff the credentials file in Rails codebases, which is encrypted, to keep track of changes to passwords and tokens.
Worth to keep in mind that the text representation is virtual and thus the command is run on every operation like a git-diff or a git-show. Therefore, if it takes a while to convert to text, it will also slowdown Git and other tools depending on it.
We do the exact same thing to keep track of some credentials we use sops[1] and AWS KMS to separate credentials by sensitivity, then use the git differ to view the diffs between the encrypted secrets
Definitely not best practice security-wise, but it works well
I was also exploring something like this a few years back but for Office files. This exact approach seemed like an absolute win to me, but I ended up not using it, because this won't work in the GitHub web UI. This won't be a deal-breaker to many, but people should be aware of it still. In the end I ended up doing this: https://github.com/TomasHubelbauer/modern-office-git-diff/
does it works with xlsm? (excel macro enabled?)
I am not sure the internal format of XLSM and how the macros are represented within it, but I would assume they are stored in the form of their source code as a plain text file within the ZIP archive that is the XLSM file and if that's the case, my tool would work with them just fine.
The really neat part about this is that it still uses the original SQLite files, so it will work retroactively once you add the configuration to your repo. Great article!
Is this only visual?
Ooh, that's a neat trick! The head fake, so to speak is this is a git trick rather than anything special to sqlite. Thanks for sharing.
Awesome...! Thanks for sharing!
You can also use this trick to diff encrypted files, if you have an automatic way of decrypting them
Are people really adding their database dumps to git? Does that scale? Without this trick, what was the point?
Out of the box no, but you can solve the scaling git problem by improving the deduplication: https://xethub.com/assets/docs/faq/how-xet-deduplication-wor...
We're focused on this problem at XetHub.
- Llama2 git repo with ~600 GB of model files: https://xethub.com/XetHub/Llama2
- RedPajama git repo with 3.4 terabytes of data: https://xethub.com/XetHub/RedPajama-Data-1T
Not scale, no, but git also tends to work pretty well for certain data projects; given git is just a VCS it can track any kind of file (to varying degrees of usefulness - git really doesn't like files that massively reorder themselves when the editor saves them like MSWord or certain FXML generators, it completely mucks up the diffs).
For hobby projects, I've absolutely committed an SQLITE database in the past to serve as an HTTP cache; I was relying on an external API with some pretty hefty usage limits but at the same time was also certain that the data I was requesting from that API would never change, so I could just keep all responses around and reuse them forever (and it was easier to cache the response than to write a bespoke serialization format for converting them). Yeah, it's technically a build artifact (because it's not code) but it's better than getting 429 errors on every new dev machine for an hour + it was just hobbyist code, nothing meant for professional use.
The entire database, no. But a table with slowly changing dimensions, or a table that "should not" change, git is a simple solution for snapshotting.
Version control? I have a pile of static-ish SQLite databases for unit tests to run against. I plan on using this for when we modify those databases, it’s legible what the change is.
I had no idea you could do this. Phenomenal!
Alternatively: https://www.sqlite.org/sqldiff.html
That takes two files as input, don’t you have to checkout, copy etc manually?
Whereas (IIUC) in the article git runs a text dump on both dbs automatically during git diff, which is an insane UX improvement. (Given that you’re like me and didn’t know about this).
You can probably configure git to use this tool when it diffs sqlite files.
True, I suppose the utility is slightly different. The trick in the OP article is definitely neat.
That’s a really cool trick! I’m going to have to see if I can apply it to other things at work as well. Maybe a image to ascii and then a diff would be useful…
Here’s an idea for the AI folks: ask it to explain the differences. Imagine something like like:
“The same screenshot of an iOS app, but the Subscribe button is clear blue and more prominent”
(Although it wouldn’t work directly, since it looks like git runs “text dump” on both independently and then compares the text naively).
If the process of image generation / description was fully reversible we could store image descriptions instead of a list of pixels...
But if one feeds an image description from chatGPT to Dall-e and back in a loop, how many steps does it take to revert to pure noise? (surely this has been tried? but I couldn't find it)
I mean there are billions of perceptually distinct images that map to the same “text description”. So text would generally be both lossy and inefficient.
> instead of a list of pixels
We don’t store lists of pixels. Not even lossless formats like PNG does that. Good ole JPEG has 1:10 - 1:20 compression ratio, ballpark.
""""AI""""
But it's not really a diff no the database itself, just a diff of a full dump that you can use to rebuild the db, but not change an existing one. For example when you do a DELETE the diff does instead have an INSERT less in the dump, which is not exactly a database diff. Depending on the use case that might still be ok.