return to table of content

DuckDB Doesn't Need Data to Be a Database

mbreese
54 replies
1d21h

Back in the day (early 2000’s), I worked with a DB2 instance that had similar functionality. At the time, they called this feature federated databases. If you had the appropriate wrapper, you could use any data source in a query. Even output from other programs. At the time I used it for including dynamic DNA sequence alignments in queries.

IIRC, SQLite can do similar things with virtual tables (with a more limited set of data file types).

I always liked this way of working, but I also wonder why it never really took off. Data discovery can be an issue, and I can see the lack of indexing as being a problem.

I guess that’s a long winded way to ask: as interesting as this is, what are the use cases where one would really want (or need) to use it?

abraae
45 replies
1d19h

I always liked this way of working, but I also wonder why it never really took off.

In today's new fangled world, a lot of developers don't use a lot of the great stuff that RDBMS can provide - stored procedures, SQL constraints, even indexes. The modern mindset seems to be that that stuff belongs in the code layer, above the database. Sometimes it's justified as keeping the database vanilla, so that it can be swapped out.

In the old days you aimed to keep your database consistent as far as possible, no matter what client was using it. So of course you would use SQL constraints, otherwise people could accidentally corrupt the database using SQL tools, or just with badly written application code.

So it's not hard to see why more esoteric functions are not widely used.

Arch-TK
20 replies
1d19h

I've heard people call the old approach "outdated" and even "wrong" before.

It's honestly the dumbest situation, where people eagerly use extremely complex databases as if they were indexed KV stores. Completely ignoring about 97% of the features in the process.

What's especially funny is that half the time a basic KV store would perform better given all the nonsense layered on top.

And then there's this whole mentality of "we can't interface with this database unless we insert a translation layer which converts between relationships between sets of tuples and a graph structure".

It's like people have unearthed an ancient technology and have no idea how it's intended to be used.

dalyons
17 replies
1d17h

Stored procs, triggers and suchlike actually are outdated and mostly wrong. Have you considered that people moved away from them for good reasons, not just ignorance? Hard to source control, impossible to debug, written in weird/bad languages, massive blast radius, don’t scale. Usability matters! I have written more TSQL than I would care to admit, and it’s a hell you couldn’t pay me enough to go back to.

The “consistency must be enforced at all costs” just turned out not to be true. Worked at many places at moderate scale since my old dba days that don’t use fks. It just doesn’t actually matter, I can’t recall any serious bugs or outages related to dangling rows. Plus, you end up having multiple databases anyway for scale and organizational reasons so fks are useless in that environment anyway.

On the other hand, I’m all for indexes and complex querying. not just k-v.

leoqa
7 replies
1d16h

I think you touched on the real reason: we separate services now and they all have different databases. We shifted up to the application layer not because of technical reasons but social reasons.

dalyons
6 replies
1d16h

Your timing is off, we as an industry ditched stored procs and friends before (micro)services became a trend. For the reasons I said.

I’m not sure why it’s so hard to accept that we stopped using them because they were bad, not for some other gotcha reason (social, ignorance, fads, etc)

btreecat
5 replies
1d15h

Maybe because claiming they are bad with out explaining why, giving explicit examples, or really digging into the nuance of the more than one feature mentioned makes the claim feel unsubstantiated.

dalyons
4 replies
1d15h

Pretty sure I did say why. “ Hard to source control, impossible to debug, written in weird/bad languages, massive blast radius, don’t scale. Usability matters! ” and I forgot “generates non obvious non local changes” for triggers.

ako
3 replies
1d12h

What is so hard to source control with stored procedures, I’ve writing them in text files for over 30years, version controlled in rcs, cvs, svn and git. No problem, it’s just the deployment automation that works a bit different.

Also, oracle plsql is way better than ms t-sql, it mostly feels like pascal/modula2 with embedded sql statements.

dalyons
2 replies
1d1h

yes, you can hack something together that keeps them in source control, and then some automations to run them in a CD way. But now your business logic change is half in application code, and half in sprocs, and you have a versioning and deployment coordination problem. Which you can work around too. But why? making deployments and debugging of buis logic hard for what end? Just keep all the business logic in your app code, then its deployed and tested as one artifact, and everyones happier.

Also i hope you're not suggesting anyone pick up oracle for a new project these days :)

ako
1 replies
23h58m

What do you mean hack to keep them in source control? They're just sql source code files, like any other source code you can simply commit to version control?

Wouldn't advise anyone to use Oracle, but neither would i advise them to use sql-server. Usually Postgres is good enough, although oracle plsql is still nicer than postgres plsql (e.g, packages).

dalyons
0 replies
22h32m

i feel like i addressed this in a different subthread https://news.ycombinator.com/reply?id=40526231&goto=threads%...

You can source control them, but then you have a deployment coordination problem. Often changes to business logic will be half in application code and half in the sproc, and you have to coordinate the deployment of that change. You have to somehow reason about what version of the sprocs are on your server. Add on that usually in enviroments that use sprocs, changes to them are gatekept by DBAs. Just write your business logic in application code.
The_Colonel
3 replies
1d14h

Hard agree.

But I like foreign keys (ON DELETE RESTRICT), they are the last layer of defense against application bugs messing up the database.

dalyons
2 replies
1d14h

i totally get that position. I used to be in that camp, but in my last 3 jobs at companies of >10mil users, we had to get rid of FKs for other reasons (perf, dist system stuff). I kind of expected bad data corruption issues.... but they just never happened ¯\_(ツ)_/¯. Not sure if that was strong cultures of automated testing, or what. So now i no longer think they're worth the tradeoffs.

YMMV with this one, i could see how it might pan out differently in other environments

pbowyer
1 replies
1d11h

Did these involve financial transactions? Other situations I will have dangling rows without issues, but whenever money comes into the picture I like to have as much consistency enforced as I can. But I'm aware that doesn't play nicely with distributed systems.

dalyons
0 replies
1d2h

Yes actually. financial transactions are often used as an example that demands strong consistency. But it’s not true in the real world for the most part. for any given money movement (a debit and a credit) there’s a good chance one of those legs is over the network - you’re either interacting with a third party api, or a proprietary internal product, or similar. So you can’t have transactional consistency, or fks. Further, for every movement there is layers of reconciliation that happens hours or days later, that fixes any inconsistency. Finance / fintech is actually among the most eventually consistent systems out there, so it’s sort of ironic its use in a lot of transactional examples.

Exception is true banking ledgers maintained at banks, but even then you’d be surprised. Entirely different systems handle loans and checking accounts

Arch-TK
2 replies
1d7h

Most of the actual drawbacks you describe (Lack of source control and "weird/bad language" are just wrong. You can source control this stuff just fine and if you don't like the language nothing stops you from transpiling.) are caused by how databases are being used not by their inherent design.

That's what I meant when I said it's weird people keep picking these particular databases for projects which don't end up using any of the features.

Stored procedures, triggers and all the other features people seem to refuse to use work mostly fine if you actually design the rest of the product around them and put a modicum of effort into them.

"It doesn't scale" - that can be said of the fundamental database design itself. If you need to scale hard then you need to pick a distributed database.

dalyons
1 replies
1d1h

I dont really know why im arguing this, there are 100s of blog posts out there for decades that explain why sprocs are considered bad. The marketplace of ideas has spoken, and we've moved on. I'll never likely never convince you, but here goes my take expanded some more.

You can source control them, but then you have a deployment coordination problem. Often changes to business logic will be half in application code and half in the sproc, and you have to coordinate the deployment of that change. You have to somehow reason about what version of the sprocs are on your server. Add on that usually in enviroments that use sprocs, changes to them are gatekept by DBAs. Just write your business logic in application code.

Transpile? why on earth would anyone bother with that, it just makes it even more complicated and impossible to debug. without that, TSQL/PSQL/PGSQL are objectively very awkward languages for imperative business logic. People are not familiar with them, it makes it hard to jump from regular code to *SQL code to read the logic flow, nevermind debugging them with breakpoints and suchlike. Splitting your buis logic between database and app makes it much more awkward to test too. Just write your business logic in application code.

Scale - making the hardest to scale piece of your architecture , the traditional rdmbs, an execution environment for arbitrarily complex business logic is going to add a lot of extra load. Meaning you're going to have scaling problems sooner than you otherwise would. Just write your business logic in application code, you can scale those pods/servers horizontally easy peasy, they're cheap.

Look you can make any technology work if you try hard enough, but sprocs are almost all downsides these days. The one minor upside is some atomicness, but you can get that with a client side transaction and a few more round trips. There's just IMO no reason to pick them anymore and pay these costs, given what we've learned.

BTW I could have made this clearer in the original comment but i love RDBMSs - i love schemas, indexes, complex querying, and transactions. It is not weird to keep picking these databases to get those things.

I just will never again use them as a programming environment.

Arch-TK
0 replies
19h58m

[appeal to popularity]

I think you put too much weight on what people currently do. Every year it seems the previous years trendy ideas are "terrible" and something new is "the way it should have always been done from the start".

[stored procedures, business logic splits, and DBA gatekeeping]

You probably shouldn't be using stored procedures to be implementing more than you need to keep things internally consistent (i.e., almost never). I actually don't understand why you're so hung up on stored procedures. I've seen people try to implement whole applications on top of stored procedures, certainly not anything I'm recommending.

Yes if you split your business logic like that then you're going to have trouble, so don't. If instead you treat your database as an isolated unit and keep it internally consistent, then it's not really that difficult to keep things always working even when migrations happen.

As for the DBAs, nobody says you have to have DBAs, but if you're going to use a technology, it's worth having someone who understands it. (or, you know, just pick a different database).

[transpile?]

I mean, lots of people use query generators which let you write e.g. native python and generate equivalent SQL (SQLAlchemy Core).

Anyway, your complaint seems to be that people don't understand SQL databases well enough to use them properly. Sure, but this is like complaining that git is bad because nobody knows how to use it. People who don't know or want to learn git, SQL, or a tool they're using, should pick a different tool.

[it scales worse if you try to implement an application in it]

well yes

[more complaining about abuse of stored procedures]

sure

...

I think at some point you read something I wrote as: "And you should attempt to find out a way to shoehorn your entire application into the RDBMS such that your application is just a frontend which calls stored procedures on the RDBMS" but that's certainly not anything close to what I said.

Designing your database such that it stays internally consistent when you perform operations on it is a good goal, and doesn't require filling it with stored procedures and making it your backend.

psadri
0 replies
1d14h

I think the implementations had a bad DX - but the concepts were not incorrect. Not enforcing referential integrity, or foregoing schemas all together (hello document stores and then doing the equivalent of SQL joins and aggregations in nested loops within code) is just plain lazy / ignorant that comes to bite you in the ass later.

myaccountonhn
0 replies
1d15h

It makes me think that for many projects, recutils would be a better fit if you’re not gonna use the advanced functionality anyway. It should do just fine on medium sized DBs and is so much simpler.

jacob019
1 replies
1d16h

Yep, that's me. I use sqlite as a key value store. Usually just two columns per table, a key and value, that I map to a python dict, and pickle the value if I need something more complex.

actionfromafar
0 replies
1d8h

Sqlite is obscenely performant as Key Value store, though.

vosper
7 replies
1d18h

In today's new fangled world, a lot of developers don't use a lot of the great stuff that RDBMS can provide - stored procedures, SQL constraints, even indexes. The modern mindset seems to be that that stuff belongs in the code layer, above the database.

Implementing constraints in application code is certainly a lot easier (and easier to test) than in the database. What I want is a database which far stronger constraint capabilities than eg MySQL or Postgres provide, so that my application-level constraints can live in the database where they belong, without compromising ease of development and maintenance.

mbreese
5 replies
1d17h

> Implementing constraints in application code is certainly a lot easier

I can’t imagine doing this. You’re basically saying that you will only ever have one program writing to your database.

vineyardmike
3 replies
1d12h

This is, for many, considered a best practice.

The idea being that a single service/codebase is controlling interaction with the database. Any read/writes should go through that service's APIs. Basically a "CreateFoo(a, b, c)" API is a better way to enforce how data is written/read vs every service writing their own queries.

mkleczek
2 replies
1d12h

It incurs huge costs though:

1. Costs of API maintenance - rigid APIs need constant change. To mitigate that people create APIs like GraphQL which is a reimplementation of SQL (ie. a query language).

2. Costs of data integration - (micro)systems owning databases need data synchronisation and duplication. What's worse - most of the times data synchronisation bypasses service APIs (using tools like CDC etc.) so all efforts to decouple services via APIs are moot.

A single data platform (a database) with well governed structure enforced by a DBMS is a compelling alternative to microservices in this context.

vineyardmike
1 replies
1d11h

Costs of API maintenance - rigid APIs need constant change.

Rigidity is the point...

If you have a "CRUD a Foo" set of APIs, and how you create/read a "foo" is defined in that API. Sometimes change is necessary, sometimes the API contract changes, but sometimes just the internal implementation changes (eg new or refactored tables/columns). The rigidity of the API ensures that every downstream user of a foo creates/reads in the exact same way. It centralizes permissions, overrides (eg. a/b testing), rate limiting, transactions, etc. to be homogeneous for everyone. If you want to create/read a foo via database queries alone, and the database changes or the business logic changes, then the same issue occurs where all clients need to change, but now it needs to be coordinated everywhere, and you can't benefit from a centralized change hiding implementation details.

Many people prefer to keep all the logic around enforcing consistency and object lifecycle (application behavior) in the application layer. This allows a single codebase to manage it, and it can be uniformly guarded with tests. Exposing the database itself is really just an example of a leaking implementation details.

To mitigate that people create APIs like GraphQL

If you need raw flexible queries, then this is probably the wrong sort of solution. Ideally, the developer of a service already knows what queries will be made, and clients don't typically need detailed custom queries. Analytics (typically read-only) should already occur in an offline or read-replica version of the database to not impact production traffic.

(micro)systems owning databases need data synchronisation and duplication.

What do you mean? Foo service exposes "CRUD-Foo" apis and is the only service that calls the Foo-storing database. If Foo service is horizontally scaled, then it's all the same code, and can safely call foo-db in parallel. If the database needs horizontal scaling, you can just use whatever primitives exist for the database you picked, and foo-service will call it as expected, and the transaction governs the lifecycle of the records.

...all efforts to decouple services via APIs are moot.

To be clear, different services wouldn't operate on their own duplicated version of the same shared data, in their own database. They'd call a defined API to get that data from another service. The whole point of this is to allow each service to define the interface for their data.

mkleczek
0 replies
1d10h

Exposing the database itself is really just an example of a leaking implementation details.

This is a mistake that a lot of people make - treating database as an implementation detail. Relational model and relational calculus/algebra were invented for _sharing_ data (not storing). The relational model _is_ the interface. Access to data is abstracted away - swapping storage layer or introducing different indexes or even using foreign data wrappers is transparent to applications.

Security and access policies are defined not against API operations but against _data_ - because it is data that you want to protect regardless of what API is used to access it.

To be clear, different services wouldn't operate on their own duplicated version of the same shared data, in their own database. They'd call a defined API to get that data from another service.

You mean synchronous calls? This actually leads to what industry calls "distributed spaghetti" and is the worst of both worlds.

The whole point of this is to allow each service to define the interface for their data.

The point is that well defined relational database schema and SQL _is_ a very good interface for data. Wrapping it in JSON over HTTP is a step backwards.

tomrod
0 replies
1d16h

That's the idea of microservices.

galkk
5 replies
1d17h

IMHO stored procedures are evil and they shouldn’t have place in modern development. They are old solution for client server world.

ako
1 replies
1d12h

I don’t get this mentality, they’re just code that lives closed to your data, and can do bulk processing on data in your database way more efficiently than anything living in the application layer.

FpUser
0 replies
1d3h

"can do bulk processing on data in your database way more efficiently than anything living in the application layer."

Not necessarily. All depends on particulars. Very complex processing of reasonably sized chunk of data especially when can be done in multiple threads / distributed can be way more efficient.

I've head very real case: consulting company was hired by TELCO to write the code that will take content of the database and create one huge file containing invoices to clients. Said file would then be sent to a print house that can print the actual invoices, and mail those to clients.

They tried to do it with stored procedures and had failed miserably - it would run for a couple of days give or take and then crash.

I was called for help. Created 2 executables. On is a manager and the other is multithreaded calculator that did all the math. Calculator was put on 300 client care workstations as a service. Dispatcher would do initial query from accounts table for current bill cycle, split account id's into few arrays and send those arrays to calculators. Calculators would suck needed data in bulk from a database, do all the calculations and send the results back to dispatcher for merging into final file. TELCO people were shocked when they realized that they can have print file in less than 2 hours.

mbreese
0 replies
1d17h

Sometimes you need to have atomic updates for a function. Stored procedures aren’t inherently evil, but are a hold over from when DB admins were a thing and they didn’t trust app developers.

Don’t worry, just wait another 5-10 years, they’ll be back in vogue again.

LunaSea
0 replies
1d11h

- References are triggers

- Audit tables are often implemented thanks to triggers

- Soft deletes can be easily implemented thanks to triggers

LaGrange
0 replies
1d9h

We live in a client-server world.

banku_brougham
3 replies
1d17h

Stored procedures are awful. There, I said it.

Its storing transform code in an inscrutible way, hiding pipeline components from downstream users. The rest of your alarms and dq has to be extendend to look into these...

Oh an PL/SQL requires a huge context switch from whatever you were doing.

dalyons
1 replies
1d15h

It’s not a controversial take. They are awful, and thus haven't been used for new development for at least a decade, going on two.

mvc
0 replies
1d9h

I generate my pl/pgsql using Lisp. Which is another technology that has been dying for decades. The money it generates seems to work just as well as the money I earned writing js.

xarope
0 replies
1d15h

I'm a bit surprised about the flak that stored procs have. Treat them just like any other, hopefully well documented, API, and they work perfectly fine. There are pros and cons, just like whether you decide to use react or htmx, python or golang, erlang or lisp, row based database vs column based vs kv, etc.

On the other hand, yes, going back to the days of poorly written and documented oracle pl/sql stored procs, yes, I shudder too, but then again, that can also be said of a number of public APIs that have been hacked/had side effects exposed.

whizzter
2 replies
1d17h

People do still use indexes and the basic FK/UQ constraints because they're add value even when using ORM mappers,etc. I don't think that the ideas of changing databases are that prevalent any more (apart from perhaps unit-testing but since db's can often be dockerized that's also solved now).

People don't hate stored-procedures (or other custom SQL constraints/triggers) per-se, but more the amount of headache they bring when it comes to debugging and keeping in sync with version control.

jasonkester
1 replies
1d12h

.sql files go into version control just as well as .java files. And because your build will be blowing all your stored procedures away and rebuilding them as part of the build, you'll know at compile time if Junior Dev Jimmy dropped a null constraint or renamed something but didn't fix one of the procs.

That's way easier to debug than if Jimmy also had his query living as a string somewhere in your backend code.

Honestly, if this stuff is hard, it's because you've made it hard. I can only assume most people griping about stored procedures don't have a /sql folder living next to their /python or /java folder in source control.

dalyons
0 replies
16h23m

Or… you just keep all your business logic in /java and call it a day. It’s stored procs that are making things harder.

victor106
0 replies
1d14h

Sometimes it's justified as keeping the database vanilla, so that it can be swapped out.

I understand if it's a prototype, I have seen developers of even mature products follow this mentality that they want the DB to be portable. It's insane. The reality is that rarely happens and when it does its mostly cause your business is growing. But not knowing (or caring to know) the features your DB offers you and not exploiting it is just laziness.

dalyons
0 replies
1d19h

the approach in the article is for analytics / DW type use cases, not OTLP. So a dig about kids these days not understanding constraints is not particularly relevant.

LunaSea
0 replies
1d11h

The reason a lot of more advanced database features aren't used is simply because a lot (most?) web developers don't know them and have a very light understanding of databases in general.

solidsnack9000
2 replies
1d21h

A related functionality is "SQL/MED", a SQL specification for federated databases that has some kind of relationship to medical data historically (I believe one of the use cases is data at one site that another site is allowed to query, in a limited way, data hosted at another site that may not be moved from it).

sixdimensional
1 replies
1d12h

The MED stands for management of external data, nothing to do with medical data.

It’s actually a chapter in the ISO/ANSI SQL specification.

solidsnack9000
0 replies
2h56m

My understanding from reading the literature around SQL/MED and related technologies is that they were, indeed, inspired by data management problems in medicine. The use cases are like those I outlined -- we can't move the data from this physical location, but we want to allow other sites to query it.

As a result, a specification was added, as part of ISO/ANSI SQL specification process.

The MED term is an acronym and not a shortening of "medical".

refset
1 replies
1d20h

Steampipe demonstrates a rather impressive range of scenarios for using FDWs + SQL in place of more traditional ETL and API integrations: https://steampipe.io/

banku_brougham
0 replies
1d17h

I like this when I tried it out. The AWS API plugin was neat, query resources with SQL.

However, I like to limit my toolset to three things, these days thats duckdb, julia for analysis side, and... OK two things.

Oh yeah Trino for our distributed compute.

noisy_boy
0 replies
1d10h

Similar timeline - Oracle had a feature called external tables where you can create a "table" on top of an external flat file and run queries against it. Prior to that we were use awk/perl etc and found it to be more performant + all the benefits that SQL brings.

banku_brougham
0 replies
1d17h

sqlite is awesome, obviously. but the api isnt elegant and easy the way duckdb (or postgres) is. handling dates in sqlite has me googling everything.

youngbum
11 replies
1d13h

This is the exact reason we applied duckdb and duckdb-wasm into our service.

Our team is currently building a form builder SaaS. Most forms have responses under 1,000, but some of them would have more than 50,000 responses.

So, when user tries to explore through all responses in our “response sheet” feature, usually they could be loaded via infinite scrolling (load as they scroll).

This uses up to 100MB of network in total if they had to get object arrays of 50,000 rows of data with 50 columns.

That was where duckdb kicked in : just store the responses into S3 as parquet file(in our case Cloudflare R2).

Then, load the whole file into duckdb-wasm into client. So when you scroll through sheet, instead of getting rows from server, you query rows from local db.

This made our sheet feature very efficient and consistent in terms of their speed and memory usage.

If network speed and memory is your bottle neck when loading “medium” data into your client, you definitely should give it a try.

PS. If you have any questions, feel free to ask!

PS. Our service is called Walla, check it out at https://home.walla.my/en

hamandcheese
5 replies
1d13h

you query rows from local db.

But the data is still remote (in object storage) right? If I understand correctly, this works then the first solution because parquet is a much more efficient format to query?

foota
2 replies
1d12h

It's probably part of it, but also overhead from small requests and latency from round trips.

youngbum
1 replies
1d11h

Great point.

The advantages of loading “parquet” in “client side” are that 1) you only have to load data once from server and 2) the parquet files are surprisingly well zipped.

1) If you load once from server, no more small network requests while you are scrolling a table. Moreover, you could use the same duckdb table to visualize data or show raw data.

2) Sending whole data as a parquet file is faster through network than receiving data as json in response.

foota
0 replies
15h18m

I wonder how much of the benefit is from just not using json vs reducing round trips. I guess if you had a large table you could stream (smaller than normal) row groups from parquet? Not sure how much work that would be though.

I'm not sure what the optimal response size is for an http response, but probably there are diminishing efficiency returns above more than a MB or two, and more of a latency hit for reading the whole file. So if you used row groups of a couple of MB and then streamed them in you'd probably get the best of both worlds.

youngbum
0 replies
1d11h

My comment was a bit ambiguous. So, for sheets where we have to load all data, we would load all columns at once as a parquet file. (I will leave comment for the advantage of this approach in the next comment)

On the other hand, let’s say we have to draw a chart from a column. The type chart could be changed by user - they could be Pie charts, means, time series chart, median, table or even dot products. To achieve this goal, we would bring just a column from s3 using duckdb, and apply sql queries from client side, rendering adequate ui.

youngbum
0 replies
1d11h

Long story short, you could either 1) query specific columns using s3-parquet-duckdb stack 2) load parquet file through network, and put it inside local duckdb-wasm instance so that you can do queries from client side

LunaSea
2 replies
1d11h

You wouldn't need DuckDB for this, you can simply store the parquet file in S3 and read them using a parquet NPM package.

youngbum
1 replies
1d11h

Exactly.

We have also tried arrow js or parquet wasm, and they were much lighter than duckdb wasm worker.

DuckDb however was useful in our case, considering our nature as form builder service, we had to provide features for statistics. It was cool to have OLAPS inside a webworker that could handle (as far as we checked) more than 100,000 rows at ease.

LunaSea
0 replies
1d1h

I'm still unconvinced.

A regular JavaScript array can also handle 100k object rows very fast.

laurels-marts
0 replies
22h20m

So you have duckdb running on the server (e.g. node.js) and duckdb-wasm running on the client? Or are you hitting S3 directly with duckdb-wasm?

remram
9 replies
1d18h

Why wouldn't you mail your co-worker the SQL text of this query? What is the advantage of the binary database file with no data?

ramraj07
6 replies
1d18h

Or if all your data is in a place like snowflake, you just create a view and send them the name..

wild_egg
3 replies
1d16h

Not a whole lot of operational difference between sending them a snowflake view name or sending an S3 URL but there's probably two orders of magnitude cost difference

ramraj07
2 replies
1d12h

Don’t know what you’re talking about. Used correctly snowflake can be as effective cost wise or cheaper for most orgs’ data needs. Compared to spinning up and down clusters of spark (and maintaining g teams to do this work or paying databricks).

wild_egg
1 replies
1d2h

Not talking about spark clusters anywhere here. I wouldn't be surprised if that's similar in cost to snowflake.

What I'm talking about is snowflake vs a $50 EC2 instance running DuckDB reading data from S3.

Try it out sometime — the results might surprise you

ramraj07
0 replies
17h21m

I have, and haven’t come away impressed. I have in fact done this multiple times. perhaps this doesn’t work with actual complex analytics and stuff? DuckDB doesn’t even do proper SIMD for example.

To be clear I have used Duckdb in some serverless ideas and it has worked well, but nothing has bet hyper optimized engines like snowflake and redshift.

banku_brougham
1 replies
1d17h

well yeah when you pay the big bucks you can expect all the bells and whistles.

MR4D
0 replies
1d16h

The future is here, it's just not evenly distributed. Wait a decade (or less) and this will be in open source. Heck, it might already be.

wild_egg
1 replies
1d18h

You then also have to mail them every update to the SQL and hope they're handling versioning well

With this approach, next time they attach the DB, they automatically see the latest version

remram
0 replies
1d16h

Oh I missed the part where they put that DB file on S3 too and sent a link. I thought they were attaching the DB to the email.

Still, I would more readily send people a link to a Gist or playground than a binary DB on S3.

countvonbalzac
9 replies
1d21h

Does DuckDB cache the S3 downloads? Otherwise it could get pretty expensive, no?

ayhanfuat
3 replies
1d18h

I asked this some time ago on their Discord in relation to AWS lambda and the Python client and the answer was that you need to handle caching on your own but it is easy to do with fsspec. I haven’t tried it yet though.

wenc
2 replies
1d10h

Do you have any details on this?

Duckdb over vanilla S3 has latency issues because S3 is optimized for bulk transfers, not random reads. The new AWS S3 Express Zone supports low-latency but there's a cost.

Caching Parquet reads from vanilla S3 sounds like a good intermediate solution. Most of the time, Parquet files are Hive-partitioned, so it would only entail caching several smaller Parquet files on-demand and not the entire dataset.

wenc
0 replies
22h16m

I really appreciate it! Thanks.

davesque
2 replies
1d20h

If the parquet file includes any row group stats, then I imagine DuckDB might be able to use those to avoid scanning the entire file. It's definitely possible to request specific sections of a blob stored in S3. But I'm not familiar enough with DuckDB to know whether or not it does this.

chrisjc
0 replies
1d15h

DuckDB can do some pushdowns to certain file formats like parquet, but every release seems to be getting better and better at doing it.

Parquet pushdowns combined with Hive structuring is a pretty good combination.

There are some HTTP and Metadata caching options in DuckDB, but I haven't really figured out how and when they really making a difference.

akdor1154
0 replies
1d20h

It does do that. I can't answer OP's qn about caching though.

orthecreedence
0 replies
1d18h

To my understanding, it does not cache (but I haven't reviewed the code). "Expensive" here really just means expensive in time. If you're running DuckDB on an EC2 instance to query S3 on the same account, it's practically free of any other cost. I wouldn't bother doing it outside AWS unless it's a one-time thing.

Running a simple analytics query on ~4B rows across 6.6K parquet files in S3 on an m6a.xl takes around 7 minutes. And you can "index" these queries somewhat by adding dimensions in the path (s3://my-data/category=transactions/month=2024-05/rows1.parquet) which DuckDB will happily query on. So yeah, fairly expensive in time (but cheap for storage!). If you're just firehosing data into S3 and can add somewhat descriptive dimensions to your paths, you can optimize it a bit.

elchief
0 replies
1d3h

it doesn't cache, nor does it respect HTTP_PROXY which is kind of annoying

jhoechtl
5 replies
1d22h

Waiting for DDb to be able to read delta tables natively.

noone_important
3 replies
1d21h

I tried to use the extension, but unfortunately i couldn't resolve my problems with it. I always run into errors when i try to execute queries on delta tables.

aleatorisch
2 replies
1d20h

I'm curious what errors you were running into? Mind posting an issue in the repo, or here? Thanks!

noone_important
0 replies
11h47m

Sorry for the late answer. I get the following errors if I try to execute queries on delta tables (accessing them with spark works fine)

InvalidInputException: Invalid Input Error: Attempting to execute an unsuccessful or closed pending query result Error: IO Error: Hit DeltaKernel FFI error (from: kernel_scan_data_next in DeltaSnapshot GetFile): Hit error: 2 (ArrowError) with message (Invalid argument error: Incorrect datatype for StructArray field "partitionValues", expected Map(Field { name: "entries", data_type: Struct([Field { name: "keys", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "values", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, false) got Map(Field { name: "key_value", data_type: Struct([Field { name: "key", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false))

zie
4 replies
1d17h

For things like this I just export to a sqlite DB file and send that along. This is interesting for AWS/S3 datastores though.

tobilg
3 replies
1d9h

You can do the exact same thing with DuckDB as well...

zie
2 replies
1d4h

I didn't know that(I don't use DuckDB), but SQLite is very standard and anyone can play with it. Query/UX interfaces exist for almost every imaginable platform even. So I'm still a SQLite recommender for stuff like this.

chrisjc
1 replies
22h44m

DuckDB is really for OLAP/Columnar (analytical workloads) whereas SQLite is more for OLTP/Row-based (transactional).

When you start running aggregations/windows over large amounts of data, you'll soon see the difference in performance.

zie
0 replies
21h30m

Sure, but the OP is using DuckDB as a way to export some smaller amount of data for someone. That's a great use-case for SQLite, from whatever your real DB is, which for the OP is also DuckDB.

brutuscat
4 replies
1d20h

Does it work with some format that supports indexes like Apache carbon data rather than parquet?

https://github.com/apache/carbondata

Jgrubb
3 replies
1d18h

I admit I'm not crazy deep in this space but I'm _pretty_ into this space and I've never heard of Carbon.

Parquet is kind of winning the OSS columnar format race right now.

wenc
2 replies
1d10h

Parquet the most popular columnar format. (owing to support in Spark and various other big data tools, as well as local tools like pandas, polars and duckdb)

It's technically not the very best format (ORC has some advantages), but it's so ubiquitous and good enough -- still far better than than CSV or the next best competing format. I have not heard of Carbon -- it sounds like an interesting niche format, hopefully it's gaining ground.

It's the VHS, not the betamax.

stdbrouw
1 replies
1d10h

"Good enough" makes it sound like barely a step up from a CSV file. I'd say its support for various encodings [1] including a great default (dictionary + run length encoding on the indices) and compression algorithms that can be set for each individual column, columnar access, partitioning, a parallelized reader out of the box, in-memory filtering and other ops concurrently with loading in the data (thanks to Arrow) etc. etc. are all really wonderful when working with medium-sized data.

[1] https://parquet.apache.org/docs/file-format/data-pages/encod...

wenc
0 replies
1d

Agreed. On a scale of 10 in terms of current technology, CSV is a 1 while Parquet is 7. ORC is maybe 7.2. But parquet is far more ubiquitous than ORC (I’ve never seen ORC in prod but I also have limited sample sizes)

I’m sure there are more advanced formats.

RyanHamilton
4 replies
1d6h

I recently took a big gamble on DuckDB. I've added it to the core of qStudio a free SQL editor. Now within the editor you can save table results from any database to the local duckdb instance or you can even perform joins amongst any server and I do mean almost any server. Out of the box DuckDB supports mysql/postgresql but I've added support for 20+ other database though it probably doesn't support all their types. If you want to get started with DuckDB qStudio is a good IDE: https://www.timestored.com/qstudio/duckdb-file-viewer#qduckd...

netcraft
3 replies
1d4h

qStudio looks very cool and im right in your target demo - and I hate to complain about free but just a bit of feedback: jar on mac is a lot of friction to be able to try it. I just spent about 5 minutes downloading and trying to get it to work - mac says it cant open because its from an unidentified developer, I went into security to get it to open anyway, it gave me another scary warning and then it still couldn't launch because I dont have a jvm installed somehow?

I'll probably try again later because it really does look cool and as much as I absolutely love my intellij/datagrip I'm always looking for alternatives and have thought several times about building a tool with these automatic analysis tools built in.

RyanHamilton
2 replies
1d2h

Damn that's dissapointing as I actually just recorded a mac demo this week. I got the false impression all macs have java but I guess it was just the one I had. I'm happy to try creating an installer if you are willing to test it. If so please email me ryan A_T timestored.com.

robertlagrant
0 replies
23h43m

Macs come with a weird thing where they have a Java in the path, but when you run it it takes you to a Java download page. It might be worth adding that to your instructions rather than bundling Java yourself?

irq-1
0 replies
1d

Take a look at https://www.hydraulic.dev/

Build self-updating desktop app packages in minutes. Deploy to every OS from any OS. Platform native formats, fully signed and notarized. Electron, Flutter, JVM and native.

And free:

Conveyor is free for open source projects.
robertlagrant
3 replies
1d1h

Changes in format, different partitioning strategies, schema changes – through all of it the receiver’s view remains the same.

I don't understand this - if I start saving those files in a different format how will it continue to work? Why would the view remain the same if I just rename columns, even?

snthpy
2 replies
19h57m

I think what was implied was that the user just references the view so as long as you update the view in concert with the data format change then the user is none the wiser about the change.

robertlagrant
1 replies
10h32m

Oh - I see. Because the .db file is centralised, you can connect to it from somewhere and do stuff to it. I think I missed that the DB is like that.

Is it like dumping a SQLite database somewhere with a view in it, and connecting over that as well? Or does DuckDB have more magic to transfer less data in the query work?

snthpy
0 replies
9h41m

Yeah, I guess you could equivalently put a SQLite database with a view or virtual table in S3 which would give you the same level of indirection and API abstraction provided by this mechanism.

Where DuckDB will have an advantage is in the processing speed for OLAP type queries. I don't know what the current state of SQLite Virtual Tables for parquet files on S3 is, but DuckDB has a number of optimisations for it like only reading the required columns and row groups through range queries. SQLite has a row oriented processing model so I suspect that it wouldn't do that in the same way unless there is a specific vtable extension for it.

You can get a comparable benefit for data in a sqlite db itself with the project from the following blogpost but that wouldn't apply to collections of parquet files: https://phiresky.github.io/blog/2021/hosting-sqlite-database...

wodenokoto
2 replies
1d13h

Is the `weird_rides.db` file python or some sort of duckdb script?

I'm quite intrigued about DuckDBs ability to read parquet files off of buckets. How good is at at simply ignoring files (e.g, filtering based on info in datafiles paths)?

rcaught
0 replies
1d13h

I've successfully used it's glob support on S3.

mvc
0 replies
1d9h

If the objects in S3 are named according to hive conventions, it will do what you would hope and not read any files with no data to match the query.

It will even use range requests in S3 to avoid fetching the entire blob.

See here for the details https://duckdb.org/2021/06/25/querying-parquet.html#automati...

philipwhiuk
2 replies
1d9h

That’s right – most databases no longer have any data in them.

That's a pretty outlandish statement.

rcaught
0 replies
1d8h

In the full context of their statement, many modern databases separate the data storage component. RDS Aurora, etc.

Night_Thastus
0 replies
22h52m

I swear there's a skit this reminds me of, but I cannot for the life of me remember what it is.

filleokus
2 replies
1d11h

How stable is the DuckDB file format nowadays? One annoying thing I've experienced (like a year ago) was when I upgraded the rust bindings that produced my DuckDB files, which then couldn't be opened with DBeaver because they hadn't released a new version yet.

I guess you could run into something similar with this solution?

wodenokoto
0 replies
1d8h

No, because parquet is quite stable.

astronautas
2 replies
21h17m

I am confused...

It's a cool example, but really an antipattern. Nowadays everyone gets analysts want access to raw data, since they know which aggregations they need best, whereas data engineers stay away from pre-aggregating and focus on building self-service data access tooling. Win-win this way.

How about building a duckdb accessible catalog on top of s3? Like instead of read_parquet, you would select from tables, which themselves would be mapped to s3 paths aka external tables.

astronautas
0 replies
3h21m

Not a catalog though, still need to input s3 path...

zokier
1 replies
1d6h

Not saying that this technique doesn't have any use, but the example scenario is pretty weak. Why not just send s3 link to the analysts

    s3://robotaxi-inc/daily-ride-data/*.parquet
Or just the query:

    SELECT pickup_at, dropoff_at, trip_distance, total_amount
    FROM 's3://robotaxi-inc/daily-ride-data/*.parquet'
    WHERE fare_amount > 100 AND trip_distance < 10
does the intermediate database and view really provide much value?

Thinking more, I feel this boils down to ownership. Does it make sense for you to own this abstraction layer? Or does it make sense to shift the ownership towards the receiving end, and just provide the raw data.

To me maintaining this sort of views to data sounds more like responsibility of data analysts that random devs. They are the experts in wrangling data afterall. But of course there is no single right or wrong answer either

astronautas
0 replies
21h16m

This

valval
1 replies
1d12h

Up until this point in my career, I've always wanted to incorporate some of these new age SQL tools into my workflows, but I just never seem to realize the utility.

What's going on here that wouldn't warrant building a processing pipeline that places the data in a more permanent data warehouse and create all the necessary views there?

rcaught
0 replies
1d8h

This is not showing what you would always do, it's showing something novel that you could do.

unnouinceput
1 replies
1d

Quote: "How do you share that dataset with an analyst?"

..and blah blah about the sharing from a developer perspective.

In reality the analyst is some higher up that only knows how to import/view CSV in Excel so that's exactly what will ask for ("hey zzz can you send me those daily parcels in a .csv file? thank you")

rcaught
0 replies
16h10m

COPY (SELECT * FROM view_name) TO 'output.csv' (HEADER, DELIMITER ',');

slachterman
1 replies
1d3h

How does authentication work in this scheme? The view uses separate S3 credentials, which are configured per external user consuming the database?

mr-karan
1 replies
1d13h

Curious, why not use httpfs extension and directly read multiple parquet files as demonstrated here[1]?

  SELECT *
    FROM read_parquet([
      's3://bucket/file1.parquet',
      's3://bucket/file2.parquet'
    ]);
[1]: https://duckdb.org/docs/extensions/httpfs/s3api#reading

rcaught
0 replies
1d12h

The article shows dynamically fetching a database, with database level views, that dynamically fetch data.

You're just showing a query that dynamically fetches data.

jewelry
1 replies
1d14h

This feels just like Presto/Trino to be honest

willvarfar
0 replies
1d12h

They are similar in that you have some data stored somewhere, and you can query it as though it was data stored inside a 'conventional' database that has it's own storage that it manages itself.

But whereas Presto/Trino/Bigquery/etc are server-based where queries execute on a cluster of compute nodes, duckdb is something you run locally, in-process.

halfcat
1 replies
1d19h

Conceptually is this sort of a database-agnostic SQL view? Seems like these could be stacked to simplify (or complicate?) an entire ETL pipeline.

I haven’t decided where I land on this. In some ways, stacking SQL views looks like it simplifies a bunch of ETL jobs, but I also fear a few things:

* It either breaks catastrophically when there’s a change in the source data

* Fails silently and just passes the incorrect data along

* More challenging to debug than an ETL pipeline where we have a clear point of error, can see the input and output of each stage, etc

* Source control of SQL views seems less great than code. Often when we have too many views, you can’t update one without dropping all of the dependencies and recreating them all

But I also wonder if I feel this way because I know programming better than SQL

banku_brougham
0 replies
1d16h

No you are right, views have an use case but they get abused as ersatz orchestration, creating a nested structure that multipkies the amount of needed compute.

edit: now I read your question more carefully. I think the s3 data is meant to be managed by other orchestration. This is a quick easy way to share a data source with an analyst, PM or end consumer.

I do not expect poster is advocating this as any intermediate stage in a data pipeline.

dangoodmanUT
1 replies
1d20h

does view creation still list all files? ime even if not queried, the view would do a lot of s3 calls

vgt
0 replies
23h31m

Shameless plug - MotherDuck[0] is a serverless managed DuckDB data warehouse with some interesting properties:

- Managed storage with zero-copy clone (and upcoming time travel)

- Secure Sharing

- Hybrid Mode that allows folks to combine client (WASM, UI, CLI, etc) with cloud data

- Approved and supported ecosystem of third party vendors

- Since we're running DuckDB in production, we're working very closely with the DuckDB team to improve both our service and open source DDB in terms of reliability, semantics, and capabilities

[0]https://motherduck.com/

(co-founder at MotherDuck)

tobilg
0 replies
1d10h

You can try DuckDB in your browser via DuckDB WASM with https://sql-workbench.com

mukiblejlok
0 replies
10h12m

X n it is my l

dforsber
0 replies
23h18m

Shameless plug - Data Tap[1] is a custom made Lambda function with embedded DuckDB and AWS managed ingestion URL where you can HTTP post your data as much as you like. It will buffer your data and then use DuckDB SQL clause to land the datas Parquet on S3.

- Deploy to your own account (BYOC)

- "S3 first": Partitioned and compressed Parquet on S3

- Secure sharing of write access to the Data Tap URL

- 50x more cost efficient than e.g. "Burnhose" whilst also having unmatched scalability of Lambda (1000 instances in 10s).

[1] https://www.taps.boilingdata.com/ (founder)

dforsber
0 replies
23h37m

Nobody says the obvious: Using DuckDB as the Catalog. You can easily do snapshots, and thus also time travel.

clumsysmurf
0 replies
1d19h

DuckDB has Swift bindings, but unfortunately, afaik, nothing official for Android. If anyone has gotten it working on Android I'd love to hear about it.

blyry
0 replies
1d21h

This is a great feature. We've been able to significantly extend the scope and usefulness of our on-prem SQL Cluster for analytics and reporting with PolyBase by building new transactional systems with cheaper postgres, doing ETLs of third-party data to delta tables in azure storage, and then federating access to them with PolyBase so that nobody in the business has to change how they actually query the data. I'm sure in another decade we'll be fully migrated to some cloud platform but for now, federating the queries is a huge win.