HN comments for: Pipe Syntax in SQL

Ciantic

83 replies

5d2h

2024-08-24 16:16:12 UTC

Here is one example from the PDF:

    FROM r JOIN s USING (id)
    |> WHERE r.c < 15
    |> AGGREGATE sum(r.e) AS s GROUP BY r.d
    |> WHERE s > 3
    |> ORDER BY d
    |> SELECT d, s, rank() OVER (order by d)

Can we call this SQL anymore after this? This re-ordering of things has been done by others too, like PRQL, but they didn't call it SQL. I do think it makes things more readable.

andy800

23 replies

5d2h

2024-08-24 16:29:10 UTC

The multiple uses of WHERE with different meanings is problematic for me. The second WHERE, filtering an aggregate, would be HAVING in standard SQL.

Not sure if this is an attempt to simplify things or an oversight, but favoring convenience (no need to remember multiple keywords) over explicitness (but the keywords have different meanings) tends to cause problems, in my observation.

singron

10 replies

5d1h

2024-08-24 16:34:25 UTC

In the query plan, filtering before or after an aggregation is the same, so it's a strange quirk that SQL requires a different word.

0cf8612b2e1e

5 replies

5d1h

2024-08-24 16:50:46 UTC

Indeed. Just as I think git’s N different ways to refer to the same operation was a blunder.

andy800

4 replies

5d1h

2024-08-24 16:57:29 UTC

But pre- and post- aggregation filtering is not really "the same" operation.

0cf8612b2e1e

3 replies

5d1h

2024-08-24 17:01:29 UTC

If I use a CTE and filter the aggregate, feels the same to me.

andy800

2 replies

2024-08-24 18:20:20 UTC

If you perform an aggregation query in a CTE, then filter on that in a subsequent query, that is different, because you have also added another SELECT and FROM. You would use WHERE in that case whether using a CTE or just an outer query on an inner subquery. HAVING is different from WHERE because it filters after the aggregation, without requiring a separate query with an extra SELECT.

RaftPeople

1 replies

4d21h

2024-08-24 21:21:58 UTC

HAVING is different from WHERE because it filters after the aggregation, without requiring a separate query with an extra SELECT.

Personally I rarely use HAVING and instead use WHERE with subqueries for the following reasons:

1-I don't like repeating/duplicating a bunch of complex calcs, easier to just do WHERE in outer query on result

2-I typically have outer queries anyway for multiple reasons: break logic into reasonable chunks for humans, also for join+performance reasons (to give the optimizer a better chance at not getting confused)

sgarland

0 replies

4d4h

2024-08-25 13:54:17 UTC

The main (only?) task I routinely use HAVING for is finding duplicates.

andy800

3 replies

5d1h

2024-08-24 16:55:38 UTC

I was not there at the original design decisions of the language, but I imagine it was there specifically to help the person writing/editing the query easily recognize and interpret filtering before or after an aggregation. The explicitness makes debugging a query much easier and ensures it fails earlier. I don't see much reason to stop distinguishing one use case from the other, I'm not sure how that helps anything.

yen223

0 replies

4d19h

2024-08-24 23:09:37 UTC

I think the original sin here is not making aggregation an explicitly separate thing, even though it should be. Adding a count(*) fundamentally changes what the query does, and what it returns, and what restrictions apply.

singron

0 replies

4d13h

2024-08-25 05:06:31 UTC

I also wasn't there, but I think this actually wasn't to help authors and instead was a workaround for the warts of SQL. It's a pain to write

    SELECT * FROM (SELECT * FROM ... GROUP BY ...) t WHERE ...

and they decided this was common enough that they would introduce a HAVING clause for this case

    SELECT * FROM ... GROUP BY ... HAVING ...

But the real issue is that in order to make operations in certain orders, SQL requires you to use subselects, which require restating a projection for no reason and a lot of syntactical ceremony. E.g. you must give the FROM item a name (t), but it's not required for disambiguation.

Another common case is projecting before the filter. E.g. you want to reuse a complicated expression in the SELECT and WHERE clauses. Standard SQL requires you to repeat it or use a subselect since the WHERE clause is evaluated first.

0cf8612b2e1e

0 replies

5d1h

2024-08-24 17:10:04 UTC

I think this stems from the non-linear approach to reading a SQL statement. If it were top-to-bottom linear, like PRQL, then the distinction does not seem merited. It would then always be filtering from what you have collected up to this line.

wvenable

9 replies

5d1h

2024-08-24 17:29:49 UTC

The second WHERE, filtering an aggregate, would be HAVING in standard SQL.

Only if you aren't using a subquery otherwise you would use WHERE even in plain SQL. Since the pipe operator is effectively creating subqueries the syntax is perfectly consistent with SQL.

andy800

8 replies

4d23h

2024-08-24 18:53:44 UTC

Perhaps, however then you eliminate the use of WHERE/HAVING sum(r.e) > 3, so in case you forgot what the alias s means, you have to figure that part out before proceeding. Maybe I'm just used to the existing style but as stated earlier, seems this is reducing explicitness which IMO tends to lead to more bugs.

wvenable

7 replies

4d23h

2024-08-24 19:01:58 UTC

A lot of SQL engines don't support aliases in the HAVING clause and that can require duplication of potentially complex expressions which I find very bug-inducing. Removing duplication and using proper naming I think would be much better.

I will already use subqueries to avoid issues with HAVING.

magicalhippo

6 replies

4d21h

2024-08-24 21:18:15 UTC

A lot of SQL engines don't support aliases in the HAVING clause

We're moving from SQLAnywhere to MSSQL, and boy, we're adding 2-5 levels of subqueries to most non-trivial queries due to issues like that. Super annoying.

I had one which went from 2 levels deep to 9... not pleasant. CTEs had some issues so couldn't use those either.

wvenable

4 replies

4d20h

2024-08-24 22:09:58 UTC

I'm surprised you had issues with CTEs -- MS SQL has one of the better CTE implementations. But I could see how it might take more than just trivial transformations to make efficient use of them.

magicalhippo

3 replies

4d19h

2024-08-24 23:02:47 UTC

I don't recall all off the top of my head.

One issue, that I mentioned in a different comment, is that we have a lot of queries which are used transparently as sub-queries at runtime to get count first, in order to limit rows fetched. The code doing the "transparent" wrapping doesn't have a full SQL parser, so can't hoist the CTEs out.

One performance issue I do recall was that a lateral join of a CTE was much, much slower than just doing 5-6 sub-queries of the same table, selecting different columns or aggregates for each. Think selecting sum packages, sum net weight, sum gross weight, sum value for all items on an invoice.

There were other issues using plain joins, but I can't recall them right now.

RaftPeople

2 replies

4d3h

2024-08-25 15:02:29 UTC

CTE's (at least in MS SQL land) are a syntax level operation, meaning CTE's get expanded to be as if you wrote the same subquery at each place a CTE was, which frequently impacts the optimizer and performance.

I like the idea of CTE's, but I typically use temp tables instead of CTE's to avoid optimizer issues.

wvenable

1 replies

4d1h

2024-08-25 17:12:48 UTC

If you use temp tables you're subverting the optimizer. Sometimes that's what you want but often it's not.

RaftPeople

0 replies

3d21h

2024-08-25 20:51:47 UTC

I use them on purpose to "help" the optimizer by reducing the search space for query plan ((knowing that query plan optimization is a combinatorial problem and the optimizer frequently can't evaluate enough plans in a reasonable amount of time).

JoelJacobson

0 replies

4d8h

2024-08-25 10:27:52 UTC

Can you please share the SQL queries? If tables/columns are sensitive, maybe it can be anonymized replacing tables with t1,t2,t3 and columns c1,c2,c3.

yen223

0 replies

4d22h

2024-08-24 20:12:29 UTC

Should we introduce a SUBSELECT keyword to distinguish between a top-level select and a subquery?

To me that feels as redundant as having WHERE vs HAVING, i.e. they do the same things, but at different points in the execution plan. It feels weird to need two separate keywords for that.

cryptonector

0 replies

4d13h

2024-08-25 05:08:45 UTC

You can always turn a HAVING in SQL into a WHERE by wrapping the SELECT that has the GROUP BY in another SELECT that has the WHERE that would have been the HAVING if you hadn't bothered.

You don't need a |> operator to make this possible. Your point is that there is a reason that SQL didn't just allow two WHERE clauses, one before and one after GROUP BY: to make it clearer syntactically.

Whereas the sort of proposal made by TFA is that if you think of the query as a sequence of steps to execute then you don't need the WHERE vs. HAVING clue because you can see whether a WHERE comes before or after GROUP BY in some query.

But the whole point of SQL is to _not have to_ think of how the query is to be implemented. Which I think brings us back to: it's better to have HAVING. But it's true also that it's better to allow arbitrary ordering of some clauses: there is no reason that FROM/JOIN, SELECT, ORDER BY / LIMIT have to be in the order that they are -- only WHERE vs. GROUP BY ordering matters, and _only_ if you insist on using WHERE for pre- and post-GROUP BY, but if you don't then all clauses can come in any order you like (though all table sources should come together, IMO).

So all in all I agree with you: keep HAVING.

thiht

11 replies

4d23h

2024-08-24 18:38:50 UTC

Honestly SQL screwed things up from the very beginning. "SELECT FROM" makes no sense at all. The projection being before the selection is dumb as hell. This is why we can’t get proper tooling for writing SQL, even autocompletion can’t work sanely. You write "SELECT", what’s it gonna autocomplete?

PRQL gives me hope that we might finally get something nice some day

akira2501

4 replies

4d23h

2024-08-24 18:39:26 UTC

    SELECT 1+2;

FROM clauses aren't required, and using multiple tables in FROM doesn't seem to work out too well when that syntax is listed first.

yen223

1 replies

4d22h

2024-08-24 19:56:44 UTC

WITH clauses are optional and appear before SELECT. No reason why the FROM clause couldn't behave the same

akira2501

0 replies

4d21h

2024-08-24 20:45:38 UTC

Isn't that strictly for CTEs? In which case, you are SELECTing from the CTE.

thiht

0 replies

4d9h

2024-08-25 08:54:58 UTC

Doesn’t change anything, you can still have the select at the end, and optional from and joins at the beginning. In your example, the select could be at the end, it’s just that there’s nothing before.

ElectricalUnion

0 replies

4d13h

2024-08-25 04:48:16 UTC

Beginning with Oracle Database Release 23 [released May 2, 2024], it is now optional to select expressions using the FROM DUAL clause.

scrlk

2 replies

4d22h

2024-08-24 20:14:08 UTC

The initial version of SQL was called "Structured English Query Language".

If the designers intended to create a query language that resembled an English sentence, it makes sense why they chose "SELECT FROM".

"Select the jar from the shelf" vs. "From the shelf, select the jar".

xigoi

1 replies

4d9h

2024-08-25 09:29:12 UTC

“Go to the shelf and select the jar”. You’re describing a process, so it’s natural to formulate it in chronological order.

mr_toad

0 replies

13h51m

2024-08-29 04:40:34 UTC

SQL is a declarative language not a procedural one. You tell the query planner what you want, not how to do it.

gfody

1 replies

4d23h

2024-08-24 18:56:58 UTC

what’s it gonna autocomplete?

otoh if you selected something the from clause and potentially some joins could autocomplete

thiht

0 replies

4d9h

2024-08-25 08:57:39 UTC

Not reliably, especially if you alias tables. Realistically, you need to know what you’re selecting from before knowing what you’re selecting.

parpfish

0 replies

4d22h

2024-08-24 20:07:16 UTC

I also hate having SELECT before FROM because I want to think of the query as a transformation that can be read from top to bottom to understand the flow.

But I assume that that’s part of why they didn’t set it up that way — it’s just a little thing to make the query feel more declarative and less imperative

tmoertel

10 replies

5d1h

2024-08-24 16:55:44 UTC

The point of SQL pipe syntax is that there is no reordering. You read the query as a sequence of operations, and that's exactly how it's executed. (Semantically. Of course, the query engine is free to optimize the execution plan as long as the semantics are preserved.)

The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins:

From the paper:

Each pipe operator is a unary relational operation that takes one table as input and produces one table as output.

Vanilla SQL is actually more complex in this respect because you have, for example, at least 3 different keywords for filtering (WHERE, HAVING, QUALIFY) and everyone who reads your query needs to understand what each keyword implies regarding execution scheduling. (WHERE is before grouping, HAVING is after aggregates, and QUALIFY is after analytic window functions.)

quietbritishjim

1 replies

4d22h

2024-08-24 19:43:26 UTC

The point of SQL pipe syntax is that there is no reordering.

If you're referring to this in the comment you're replying to:

Can we call this SQL anymore after this? This re-ordering of things ...

Then they're clearly just saying that this is a reordering compared to SQL, which is undeniably true (and the while point).

tmoertel

0 replies

4d22h

2024-08-24 19:57:35 UTC

The post I was referring to said that this new pipe syntax was a big reordering compared to the vanilla syntax, which it is. But my point is that if you're going to understand the vanilla syntax, you already have to do this reordering in your head because the order in which the the vanilla syntax executes (inside out) is the order in which pipes syntax reads. So it's just easier all around to adopt the pipe syntax so that reading and execution are the same.

aidos

1 replies

4d22h

2024-08-24 20:07:28 UTC

Golly, QUALIFY, a new SQL operator I didn’t know existed. I tend not to do much with window functions and I would have reached for a CTE instead but it’s always nice to be humbled by finding something new in a language you thought you knew well.

mrbungie

0 replies

4d18h

2024-08-24 23:42:45 UTC

Is not common at all, is a non ANSI SQL clause that afaik was created by Teradata, syntactic sugar for filtering using window functions directly without CTEs or temp tables, especially useful for dedup. In most cases at least, for example you can't do a QUALIFY in an query that is aggregating data just as you can't use a window function when aggregating.

Other engines that implement it are direct competitors in that space: Snowflake, Databricks SQL, BigQuery, Clickhouse, and duckdb (only OSS implementation I now). Point is: if you want to compete with Teradata and be a possible migration target, you want to implement QUALIFY.

Anecdote: I went from a company that had Teradata to another where I had to implement all the data stack in GCP. I shed tears of joy when I knew BQ also had QUALIFY. And the intent was clear, as they also offered various Teradata migration services.

vendiddy

0 replies

4d11h

2024-08-25 06:35:40 UTC

This is an interesting point.

All these years I've been doing that reordering and didn't even realize!

thwarted

0 replies

4d20h

2024-08-24 21:48:59 UTC

The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins

I already think about SQL like this (as operation on lists/sets), however thinking of it like that, and having previous operations feed into the next, which is conceptually nice, seems to make it hard to do, and think about:

*(the query engine is free to optimize the execution plan as long as the semantics are preserved)

since logically each part between the pipes doesn't know about the others, so global optimizations, such as use of indexes to restrict the result of a join based on the where clause can't be done/is more difficult.

mattashii

0 replies

4d22h

2024-08-24 19:46:01 UTC

Vanilla SQL [...] QUALIFY is after analytic window functions

Isn't that FILTER (WHERE), as in SELECT avg(...) FILTER (WHERE ...) FROM ...?

cryptonector

0 replies

4d13h

2024-08-25 05:01:05 UTC

The point of SQL pipe syntax is that there is no reordering.

But this thing resembles other FROM-clause-first variants of SQL, thus GP's point about this being just a reordering. GP is right: the FROM clause gets re-ordered to be first, so it's a reordering.

camgunz

0 replies

4d10h

2024-08-25 08:04:03 UTC

This kind of implies there's better or worse ordering. AFAIK that's pretty subjective. If the idea was to expose how the DB is ordering things, or even make things easier for autocomplete OK, but this just feels like a "I have a personal aesthetic problem with SQL and I think we should spend thousands of engineering hours and bifurcate SQL projects forever to fix it" kind of thing.

_a_a_a_

0 replies

4d18h

2024-08-24 23:43:18 UTC

'qualify' is now standard? Thought it was a vendor extension currently.

richbell

8 replies

5d2h

2024-08-24 16:30:33 UTC

IMO having SELECT before FROM is one of SQL's biggest mistakes. I would gladly welcome a new syntax that rectifies this. (Also https://duckdb.org/2022/05/04/friendlier-sql.html)

I don't love the multiple WHEREs.

abelcha

4 replies

4d22h

2024-08-24 20:07:25 UTC

duckDB is what sql should be in 2024

https://duckdbsnippets.com/

xigoi

3 replies

4d9h

2024-08-25 09:31:40 UTC

The very first example on that page is vulnerable to injection.

richbell

2 replies

4d2h

2024-08-25 16:11:39 UTC

Which one?

xigoi

1 replies

4d2h

2024-08-25 16:24:17 UTC

  #!/bin/bash 
  function csv_to_parquet() {     
      file_path="$1"     
      duckdb -c "COPY (SELECT * FROM read_csv_auto('$file_path')) TO '${file_path%.*}.parquet' (FORMAT PARQUET);" }

richbell

0 replies

3d16h

2024-08-26 01:59:46 UTC

Eh, in the context of the site and other snippets that seems pedantic.

Could it be run on untrusted user input? Sure. Does it actually pose a threat? It's improbable.

viraptor

0 replies

4d18h

2024-08-25 00:10:12 UTC

Duckdb also supports prql with an extension https://github.com/ywelsch/duckdb-prql

mr_toad

0 replies

13h59m

2024-08-29 04:33:23 UTC

SQL was supposed to follow English grammar. Having FROM before SELECT is like having “Begun” before “these clone wars have.”

RaftPeople

0 replies

4d2h

2024-08-25 15:37:28 UTC

That's a great list of friendlier sql in DuckDB. For most of that list I either run into it regularly or have wanted the exact fix they have.

pajeets

8 replies

5d2h

2024-08-24 16:24:55 UTC

Yes, having |> isn't breaking SQL but rather enhancing it.

I really like this idea of piping SQL queries rather than trying to create the perfect syntax from the get go.

+1 for readability too.

oxym0ron

7 replies

2024-08-24 17:57:31 UTC

Honestly, it seems like a band-aid on legacy query language.

lpapez

6 replies

2024-08-24 18:20:45 UTC

SQL a legacy query language?

In order for a thing to be considered legacy, there needs to be a widespread successor available.

SQL might have been invented in the 70s but it's still going strong as no real alternative has been widely adopted so far - I'd wager that you will find SQL at most software companies today.

Calling it legacy is not realistic IMO.

Spivak

5 replies

4d21h

2024-08-24 20:57:51 UTC

I mean kinda? It's legacy in the "we would never invent this as the solution to the problem domain that's today asked of it."

We would invent the underlying engines for sure but not the language on top of it. It doesn't map at all to how it's actually used by programmers. SQL is the JS to WebAssembly, being able to write the query plan directly via whatever language or mechanism you prefer would be goated.

It has to be my biggest pain point dealing with SQL, having to hint to the optimizer or write meta-SQL to get it to generate the query plan I already know I want dammit! is unbelievably frustrating.

wvenable

3 replies

4d21h

2024-08-24 21:16:35 UTC

By that definition JavaScript is also legacy.

having to hint to the optimizer or write meta-SQL to get it to generate the query plan I already know I want dammit'

That's not in the domain of SQL. If you're not getting the most optimized query plan, there is something wrong with the DBMS engine or statistics -- SQL, the language, isn't supposed to care about those details.

Spivak

2 replies

4d4h

2024-08-25 14:17:15 UTC

That's not in the domain of SQL.

That's my point, I think we've reached the point where SQL the langage can be more of a hindrance than help because in a lot of cases we're writing directly to the engine but with oven mitts on. If I could build the query from the tree with scan, filter, index scan, cond, merge join as my primitives it would be so nice.

mr_toad

1 replies

14h4m

2024-08-29 04:27:51 UTC

Sounds like you don’t want SQL at all. Some sort of non-SQL, or not-SQL, never-SQL. Something along those lines.

Spivak

0 replies

4h43m

2024-08-29 13:48:34 UTC

That's the thing though, I still want my data to be relational so NoSQL databases don't fit the bill. I want to interact with a relational database via something other than the SQL language and given that this language already exists (Postgres compiles your SQL into an IR that uses these primitives) I don't think it's a crazy ask.

lpapez

0 replies

4d20h

2024-08-24 21:34:08 UTC

It's legacy in the "we would never invent this as the solution to the problem domain that's today asked of it."

I don't think that definition of legacy is useful because so many things which hardly anyone calls "legacy" fit the definition - for example: Javascript as the web standard, cars in cities and bipartisan democracy.

I think many of us would say that that none of these is an ideal solution for the problem being solved, but it's what we are stuck with and I cannot think anyone could call it "legacy systems" until a viable successor is widespread.

randomdata

2 replies

5d2h

2024-08-24 16:23:00 UTC

> Can we call this SQL anymore after this?

Maybe not, just as we don't call "rank() OVER" SQL. We call it SQL:2003. Seems we're calling this GoogleSQL. But perhaps, in both cases, we can use SQL for short?

euroderf

0 replies

5d1h

2024-08-24 17:06:37 UTC

GoogleSQL

EssGyooGell: A Modest Proposal

esafak

0 replies

5d1h

2024-08-24 17:02:30 UTC

You show a good example. Many people would call that SQL, and if pipes become popular, they too might simply be called SQL one day.

nextaccountic

1 replies

4d8h

2024-08-25 10:17:13 UTC

At this point I think that vanilla SQL should just support optionally putting the from before the select. It's useful for enabling autocompletion, among other things.

RaftPeople

0 replies

4d3h

2024-08-25 15:27:06 UTC

And a simple keyword that does a GROUP BY on all columns in select that aren't aggregates, just a syntax level macro-ish type of thing.

jmull

1 replies

5d1h

2024-08-24 16:42:35 UTC

The proposal here adds pipe syntax to SQL.

So it would be reasonable to call it SQL, if it gets traction. You want to see some of the big dogs adopting it.

That should at least be possible since it looks like it could be added to an existing implementation without significant disruption/extra complexity.

cryptonector

0 replies

4d13h

2024-08-25 05:11:10 UTC

There may be trademark issues, but even if not, doing sufficient violence to the original thing argues for using a new name for the new thing.

hyperman1

1 replies

2024-08-24 17:46:36 UTC

This is an extension on top of all existing SQL. The pipe functions more or less as a unix pipe. There is no reordering, but the user selects the order. The core syntax is simply:

  query | operator

Which results in a new query that can be piped again. So e.g. this would be valid too:

  SELECT id,a,b FROM  table WHERE id>1
  |WHERE id < 10

Personally, I can see this fix so much SQL pain.

larodi

0 replies

2024-08-24 18:28:38 UTC

okay, now I can see why this so much reminds of CTE

specialist

0 replies

4d22h

2024-08-24 19:43:49 UTC

My initial reaction is that the pipes are redundant (syntactic vinegar). Syntactic order is sufficient.

The changes to my SQL grammar to accomodate this proposal are minor. Move the 'from' rule to the front. Add a star '*' around a new filters rule (eg zero-or-more, in any order), removing the misc dialect specific alts, simplifying my grammar a bit.

Drop the pipes and this would be terrific.

sklivvz1971

0 replies

4d22h

2024-08-24 20:03:58 UTC

We can call it "Linq2SQL" and what a disaster it was...

setr

0 replies

4d22h

2024-08-24 19:45:37 UTC

this is consistent, non-pseudo-english, reusable, and generic. The SQL standard largely defines the aesthetic of the language, and is in complete opposition to these qualities. I think would be fundamentally incorrect to call it SQL

Perhaps if they used a keyword PIPE and used a separate grammar definition for the expressions that follow the pipe, such that it is almost what you’d expect but randomly missing things or changes up some keywords

racecar789

0 replies

5d1h

2024-08-24 17:14:28 UTC

In that example, "s" has two meanings: 1. A table being joined. 2. A column being summed.

For clarity, they should have assigned #2 to a different variable letter.

iblaine

0 replies

4d22h

2024-08-24 20:16:32 UTC

Looking at this reminds me of Apache Pig. That’s not a compliment.

ibash

0 replies

5d2h

2024-08-24 16:18:37 UTC

Yes we can call it sql.

Language syntax changes all the time. Their point is that sql syntax is a mess and can be cleaned up.

gajus

0 replies

2d23h

2024-08-26 18:55:47 UTC

What's the SQL equivalent of this?

extr

0 replies

5d2h

2024-08-24 16:19:02 UTC

Not bad, very similar to dplyr syntax. Personally i’m too used to classic SQL though and this would be more readable as CTEs. In particular how would this syntax fair if it was much more complicated with with 4-5 tables and joins?

Glyptodon

0 replies

5d1h

2024-08-24 17:06:03 UTC

In the example would there a difference between `|> where s > 3` and `|> having s > 3` ?

Edit: nope, just that you don't need having to exist with the pipe syntax.

simonw

24 replies

5d2h

2024-08-24 16:00:38 UTC

Google: you are a web company. Please learn to publish your research papers as web pages.

orangepanda

9 replies

5d2h

2024-08-24 16:10:55 UTC

I expected to see some eldritch css monstrosity, but no, its just a pdf. A well formatted one, at that.

What’s your issue there?

simonw

7 replies

5d2h

2024-08-24 16:13:43 UTC

Reading two column PDFs on a mobile phone sucks.

Plus I can't use web tools, like "Read this page" in Mobile Safari.

And copying and pasting is harder.

And I can't link to individual sections.

I'm honestly baffled by people who prefer PDFs for this kind of information. Are they printing them out on paper and going at them with a highlighter or something?

tmoertel

4 replies

5d1h

2024-08-24 17:15:36 UTC

Just my personal take, but when I have to read something carefully, I find it easier to do on paper.

For example, I recently wrote an article about taking random samples using SQL. Even though I was writing it for my blog, which is HTML, I proofread the article by rendering it as a PDF doc, printing it out, and reviewing it with a blue pen in hand.

What surprised me is that I also found it easier to review the article on the screen when it was in PDF format. TeX just does a way better job of putting words on a page than does a web browser.

Actually, if you want to do the comparison yourself, I'll put both versions online:

HTML: https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....

PDF: https://blog.moertel.com/images/public_html/blog/pix-2024060...

I don't think either version is hard to read, but if I had my choice, I'd read the PDF version. But maybe that's just me.

Let me know which you prefer.

LeonB

3 replies

2024-08-24 18:16:42 UTC

On mobile phone, as a reader with photophobia, the pdf causes physical pain, and is illegible, whereas the html is perfectly readable via reader mode (where text can be enlarged and dark mode settings are respected.

tmoertel

2 replies

2024-08-24 18:28:16 UTC

Thanks for sharing this perspective! HTML is a lot more accessible in general than PDF documents.

QQ: Do the math formulas render properly in reader mode for you? (On my test with Chrome, the answer seems to be no.)

LeonB

1 replies

4d23h

2024-08-24 18:54:53 UTC

I don’t think the formulas are rendered in reader view. (iOS Safari)

In the browser (iOS Safari) I use an extension (dark reader) to give it a dark theme, and the formulas render just fine there.

LeonB

0 replies

4d18h

2024-08-25 00:22:51 UTC

(Minor correction, the plugin is called ‘dark night’)

samatman

0 replies

4d18h

2024-08-25 00:20:21 UTC

Personally, it's sending it to GoodReader on a 13" iPad.

I don't know that I'd go so far as to say I 'prefer' this, but there are a lot of PDFs out there, this works fine, and it's a nice change of pace given how much time I spend in front of a monitor / laptop screen.

lrem

0 replies

5d2h

2024-08-24 16:27:24 UTC

Indeed. That’s the easiest way to show your students/professor/coworkers which are the crucial bits.

VoodooJuJu

0 replies

5d2h

2024-08-24 16:15:28 UTC

PDF content is not web-indexed. Their Google Scholar link doesn't even work either.

jml7c5

6 replies

5d1h

2024-08-24 17:27:07 UTC

I really wish that browsers had developed first-class support for offline web page bundles. There's no way to share a page that is guaranteed to be self-contained and not hit the network, especially if you want to use javascript. It's particularly frustrating since browsers supported offline mode as far back as the 90s; it just needed to be combined with support for loading from zipped folders.

That simple change would've largely solved the academic paper problem decades ago. It's bizarre that it still isn't a feature.

simonw

3 replies

5d1h

2024-08-24 17:29:49 UTC

One option her is to inline all assets - images etc - as bas64 URIs. The HTML page ends up huge but it will at least be self-contained.

jml7c5

2 replies

2024-08-24 17:42:35 UTC

Yes, but it's not guaranteed to be self-contained. I wouldn't want to open a random HTML file knowing that it could phone home, or that the content might break one day without me realizing. There's a practical and psychological aspect to sharing `steves_paper_2014.html` versus `steves_paper_2014.offlinesitebundle`. The latter feels safe and immutable.

irq-1

1 replies

4d22h

2024-08-24 19:40:34 UTC

What you want is an HTML tag or response header that restricts network access, which the browser can then enforce. Offline or a list of allowed domains, this would be great for security in general. Not so great for advertisers though.

mewpmewp2

0 replies

11h1m

2024-08-29 07:30:47 UTC

Then you have to verify that the tag is there, right? But if it has another extension like .offlinebundle you can know thay browsers will not make any extra requests.

abelcha

1 replies

4d22h

2024-08-24 20:19:01 UTC

this is what you are looking for i believe (web archive format) https://en.wikipedia.org/wiki/WARC_(file_format)?oldformat=t...

jml7c5

0 replies

4d21h

2024-08-24 21:17:12 UTC

Browsers don't have native support for opening WARC. It doesn't solve the safety problem either: you can still construct a WARC that phones home, AFAIK.

It's a great format for the problem it solves, but if browsers supported offline-only files the container format wouldn't (and shouldn't) need to be that complicated.

estebarb

4 replies

5d2h

2024-08-24 16:15:57 UTC

Conference papers use templates. It's not like Google can choose.

simonw

3 replies

5d2h

2024-08-24 16:16:43 UTC

They can choose to publish it in both HTML and PDF.

lrem

0 replies

5d1h

2024-08-24 16:32:56 UTC

Maybe. Maybe not. Depends on the publisher’s terms.

goodfight

0 replies

16h37m

2024-08-29 01:54:51 UTC

if you replace in an arxiv.org link with ar5iv.org it will auto translate to html if possible

DonaldPShimoda

0 replies

5h10m

2024-08-29 13:22:26 UTC

Translating LaTeX to HTML is not a straightforward process, unfortunately. Many people have tried to implement automated translation systems, but nothing has really worked out yet.

I think it's unfair to expect the research team to invest additional hours in learning how to make good websites, so to solve your problem would require hiring additional talent whose only job is to translate academic PDFs into accessible web pages. I don't think that's a bad idea, and certainly Google has the funds to do something like that, but I don't imagine they'd find it to be a good use of money. Accessibility is an afterthought for most major companies these days.

lrem

0 replies

5d2h

2024-08-24 16:24:44 UTC

That’s not a blog post. This is an academic preprint, I imagine the format is as prescribed.

irrelative

0 replies

5d2h

2024-08-24 16:32:22 UTC

Seriously. It’s not like that was the actual purpose of html or anything.

mocamoca

12 replies

4d22h

2024-08-24 20:27:19 UTC

Question for people writing highly complex SQL queries.

Why not write simple SQL queries and use another language to do the transformations?

Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries? Doesn't working without reusable blocks / tests / logs make development harder?

Syntax is one thing, but actual performance (and safety/maintenance) is another deal?

sagarm

4 replies

4d21h

2024-08-24 21:03:16 UTC

I've worked on a few SQL systems used for analytics and ETL.

My users fell into (for the purposes of this discussion) three categories:

1. Analysts who prefer sheets

2. Data scientists that prefer pandas

3. Engineers who prefer C++/Java/JavaScript/Python

I'm fairly sure SQL isn't the first choice for any of them, but in all three cases a modern vectorized SQL engine will be the fastest option for expressing and executing many analysis and ETL tasks, especially when the datasets don't fit on a single machine. It's also easier to provide a shared pool of compute to run SQL than arbitrary code, especially with low latency.

Even as a query engine developer, I would prefer using a SQL engine. Performing even the basic optimizations a modern engine would perform -- columnar execution, predicate pushdown, pre-aggregation for shuffles, etc -- would be at least a week of work for me. A bit less if I built up a large library to assist.

xpe

3 replies

4d14h

2024-08-25 03:37:12 UTC

Re #2: I prefer https://pola.rs over Pandas

sagarm

2 replies

4d12h

2024-08-25 06:03:24 UTC

I've heard great things about Pola.rs performance. To get there, they have a lazy evaluation so they can see more of the computation at once, allowing them to implement optimizations similar to those in a SQL engine.

xpe

1 replies

2d5h

2024-08-27 12:44:10 UTC

In the early days, even as I appreciated what Pandas could do, I never found its API sane. Pandas has too many special cases and foot-guns. It is a notorious case of poor design.

My opinion is hardly uncommon. If you read over https://www.reddit.com/r/datascience/comments/c3lr9n/am_i_th... you will find many in agreement. Of those who "like" Pandas, it is often only a relative comparison to something worse.

The problems of the Pandas API were not intrinsic nor unavoidable. They were poor design choices probably caused by short-term thinking or a lack of experience.

Polars is a tremendous improvement.

sagarm

0 replies

1d17h

2024-08-28 00:55:41 UTC

Hey, I agree with you.

On eager vs lazy evaluation -- pytorch defaulting to eager seemed to be part of the reason it was popular. Adding optional lazy evaluation to improve performance later seems to have worked for them.

wvenable

1 replies

4d20h

2024-08-24 21:32:36 UTC

Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries?

As others have said, yes. In most cases, the more complex the query the better the result.

Doesn't working without reusable blocks / tests / logs make development harder?

SQL isn't a programming language, it's a query language. You're effectively writing a single expression that describes the shape of the data that you want and then then the RDBMS goes off and finds the most performant way to retrieve the data that matches that shape. It doesn't compare well with procedural programming.

The closest "languages" that comparable to SQL are HTML and CSS. However you manage HTML and CSS is also how you can manage SQL.

crazygringo

0 replies

4d20h

2024-08-24 22:22:12 UTC

> Doesn't working without reusable blocks / tests / logs make development harder?

SQL isn't a programming language, it's a query language. You're effectively writing a single expression that describes the shape of the data that you want...

Exactly this. Generally speaking, your SQL queries won't have "bugs" the way that you can create bugs when writing a function.

The challenging parts of building a complex query are usually 1) getting it to function at all (just being a valid SQL statement for your tables that gives you your desired output fields), and 2) making sure it runs performantly on realistically large table sizes (usually in milliseconds as opposed to seconds), which may involve rewriting things like joins vs. subqueries and/or adding indexes.

A lot of bugs in functions come from edge cases or different combinations of paths through code or unexpected combinations of parameter values or math formula errors or whatever... but a SQL query won't usually really have any of those things. It's just a single transformation that basically either works or doesn't.

fifilura

1 replies

4d21h

2024-08-24 20:43:15 UTC

One reason SQL has become more popular lately is as an API for map/reduce.

Before you would write a Java/c++ class that would do the map/reduce job for you distributed over 100s of CPUS. And you would feel like you were on the bleeding edge doing innovative stuff.

Turns out that SQL is a perfect API for map/reduce.

Everything you write in SQL can be transformed into a massively parallel job. And you don't even know about it.

This is the secret behind BigQuery and Trino/Presto/Athena.

mr_toad

0 replies

13h19m

2024-08-29 05:12:32 UTC

If your data is relational. SQL can get pretty ugly with semi-structured data, graph traversal etc.

javcasas

0 replies

4d21h

2024-08-24 20:36:51 UTC

I'm not interested in loading all the data in memory, or swapping it out if it doesn't fit, but also it may be a lot of data, and just moving it over the network is hard.

I have tests. I have reusable blocks (SQL functions, WITH blocks and views). I don't have logging though.

I can put the result in a non-materialized view and have it update in real time as the data changes. Or I can toggle it to materialized view and now it's snapshotted data.

Finally, views that depend on views that depend on views get automatically optimized by the query planner. You need a lot of very tricky custom code to start approaching that.

crazygringo

0 replies

4d21h

2024-08-24 20:48:51 UTC

Performance. A million times performance.

If I need to join a million rows to another table and then calculate an aggregate value, and do it all in a single query, it's fast. It might only take 0.01 seconds.

Whereas if I make separate queries to retrieve a million rows, and a million rows from another table, then it's incredibly slow just due to the data transfer. No matter how conceptually easier another language may be. So it might take 30 seconds for repeated sequential serialization and network and deserialization.

And even if you're looking up individual rows (not a million), with lots of joins that would be lots of round-trips to the database which multiplies latency and load -- so even if you can still get your final result quickly (e.g. 0.01 seconds rather than 0.001 seconds), the load you can handle drops by the same multiple (e.g. you can now only serve 10% as many users).

The general rule of thumb with databases is that they can be super-fast when everything is done on the database machine in a single query and your final result is a small amount of data (assuming everything is indexed properly and the query is written properly). But they become super-slow when you're doing a bunch of queries in a row, and where those intermediate queries can return massive amounts of data, or (even worse) need to send back massive amounts of data.

SahAssar

0 replies

4d21h

2024-08-24 20:33:51 UTC

Why not write simple SQL queries and use another language to do the transformations?

Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries?

Yes. With knowledge of the data (like indices) and statistics of the data it is usually very much more efficient than piping the data to another process to handle the same job.

Of course that requires you to write a good query and know how to tell what makes a good query, but if your data is in a relational database it is usually more efficient to do your filtering/matching/aggregating there.

slt2021

10 replies

5d2h

2024-08-24 16:13:54 UTC

reminds me PRQL https://prql-lang.org

and SPL from Splunk

mark_and_sweep

7 replies

5d1h

2024-08-24 16:35:01 UTC

Reminds me of KQL (Kusto Query Language) by Microsoft

LeonB

6 replies

2024-08-24 18:28:04 UTC

Yeh, when I saw the example above I thought it was KQL.

But they do sort of acknowledge it in the paper. Eg on the first page it says:

…we present a solution – adding pipe-structured data flow syntax to SQL. This makes SQL more flexible, extensible and easy to use. This paradigm works well in other languages like Kusto’s KQL[5]

Strange typo though, to say “Kusto’s KQL” instead of “Microsoft’s KQL”

Kusto is allegedly named after (sort of in reference to) Jacques Cousteau, so “Kusto’s” doesn’t make sense.

_huayra_

2 replies

4d23h

2024-08-24 19:02:23 UTC

Does KQL still enforce no blank lines in the `let` clauses to the expression?

When I last used KQL, it was infuriating that I could create my `let` clauses in chunks separated by whitespace because a blank line would be considered a terminated statement (and Kusto would say "Hey where's your expression?!"). This meant every Kusto file was a sea of text with no clear differentiation between subsequent clauses. I ended up using 3 blank comment lines as a "fake empty line" just to maintain my sanity.

Again, hope they fixed that by now...

beoberha

0 replies

4d22h

2024-08-24 19:53:01 UTC

I’m nearly positive that’s an editor “feature”. If you submit queries via the API, you can have blank lines.

LeonB

0 replies

4d23h

2024-08-24 19:27:17 UTC

I’m not sure if that’s fixed, it seems pretty much “by design”, with the way the editor (in the browser) selects the “current query” for you.

Nathanba

2 replies

4d23h

2024-08-24 19:12:23 UTC

unfortunately KQL doesn't seem to have INSERT, UPDATE etc. support, it seems to be a pure query language for querying. Unless this strange different .insert syntax is what they intended for their language from the start? I don't know: https://learn.microsoft.com/en-us/kusto/management/data-inge...

trobertson

0 replies

4d11h

2024-08-25 07:12:46 UTC

unfortunately KQL doesn't seem to have INSERT, UPDATE etc.

(Disclaimer: I'm an engineer at Microsoft, and I use Kusto basically every day)

This seems to me to be a deliberate design choice. Microsoft doesn't want engineers mutating the databases by hand. There are mechanisms to do that (mostly outside of Kusto, and usually to resolve privacy incidents), but the common case of querying is not supposed to allow for arbitrary changes to the database.

LeonB

0 replies

4d23h

2024-08-24 19:24:50 UTC

Good question, I don’t know i’ve only used it for querying.

For SQL to Have a good/improved syntax for insert/update/delete/merge would be very nice (in theory)

wizardist

0 replies

5d1h

2024-08-24 16:38:47 UTC

Also OpenSearch PPL.

feldrim

0 replies

2024-08-24 18:18:26 UTC

BTW, PRQL is mentioned in the article as well. In sum, they decided that instead of a new language against SQL, extending SQL with pipes would be better for user experience, learning curve, etc.

tmoertel

9 replies

5d1h

2024-08-24 16:40:48 UTC

I'm glad to see pipe syntax moving forward! In my time at Google, I wrote a lot of SQL queries that required a long sequence of intermixed filtering, grouping, aggregation, and analytic window operations. But the ordering of those operations rarely matched SQL's inside-out syntax, so I commonly had to chop up my logic and glue it into multiple chained CTEs. This busywork made my code's logic harder to follow and was frustrating, especially since I was familiar with alternative coding models, such as R's dplyr, that would have allowed me to express my logic as a simple linear pipeline.

I hope SQL pipes become commonplace so that I can use them anywhere I have to write SQL queries.

geertj

6 replies

4d21h

2024-08-24 21:16:24 UTC

What’s wrong with CTEs though? I have never thought of them as busywork and start with them if I know my query is going to be a multi-step process. I already thought them as “pipes” (in the abstract, not related to this paper). If the query flow is linear so is the sequence of CTEs.

The only slightly annoying thing with CTEs is that you always have to name them. You might consider this an overhead if the flow is strictly linear. But when the flow is not linear (eg when doing two different aggregations over the same data, to join them later) you need a way to refer to the intended CTE, so always requiring a name does add some consistency and doesn’t seem a big deal overall.

magicalhippo

3 replies

4d21h

2024-08-24 21:24:04 UTC

What’s wrong with CTEs though?

Depends on DB engines I suppose. I've come across that certain operations were not allowed in CTEs, and they can be an optimization barrier.

However if your query is dynamically modified at runtime, then CTEs can be a no-go. For example, we have a grid component which first does a count and then only selects the visible rows. This is great if you have expensive subselects as columns in a large table. However to do the counting it turns the main query into a sub-query, and it doesn't handle CTEs.

geertj

2 replies

3d3h

2024-08-26 14:59:02 UTC

Understood. I should have asked my question a bit more specifically: what's wrong with CTEs that wouldn't be an issue with this new pipe syntax. I briefly scanned the paper and it appears there aren't any specific benefits to the pipe syntax that would make optimization easier. So we can expect that if a SQL engine doesn't optimize CTEs well it would likely have the same limitations for the pipe syntax.

Section 2.1.4 the paper lists the benefits of the pipe syntax over CTEs, and they are all based on ergonomics. As someone who has never had issues with the ergonomics of CTEs I must say I am not convinced that proposed syntax is better. It may be that I've been doing SQL for so long that I don't see its warts. Overall SQL feels like a very well designed and consistent language to me. The new pipe syntax appears to bolt on an imperative construct to an otherwise purely functional language.

tmoertel

1 replies

23h25m

2024-08-28 19:06:31 UTC

The new pipe syntax appears to bolt on an imperative construct to an otherwise purely functional language.

It's not imperative. The pipe symbol is a relational operator that takes one table as input and produces one as output. It's still purely functional, but it has the advantage of making the execution order obvious. That is, the order is a linear top-down flow, not the inside-out flow implicit in vanilla SQL. Further, when your wanted flow doesn't match vanilla SQL's implicit ordering, you don't have to invent CTEs to wire up your flow. You just express it directly.

As for ergonomics, consider a simple task: Report some statistics over the top 100 items in a table. Since LIMIT/ORDER processing is last in vanilla SQL's implied ordering, you can't directly compute the stats over the top items. You must create a CTE to hold the top items and then wire it into a second SELECT statement to compute the stats. That's busywork. With pipe syntax, there's no need to invent that intermediate CTE.

geertj

0 replies

16h4m

2024-08-29 02:28:15 UTC

It's not imperative. The pipe symbol is a relational operator that takes one table as input and produces one as output.

Maybe I used the wrong term. In my mental model, the query planner decides the order in which the query is evaluated based on what table stats predict is most efficient query plan, and I actually don't really want to think about the order too much. For example, if I create a CTE, I don't necessarily want it to be executed in that order. Maybe a condition on the later query can be pushed back into the earlier CTE so that less data can be scanned.

I will admit that technically there should be no difference in how a query planner handles either. But to me the pipe syntax does not hint as much at these non-linear optimizations than CTEs do. I called the CTE syntax more functional as it implies less to me.

but it has the advantage of making the execution order obvious.

So we're back to ergonomics which I just never had an issue with...

As for ergonomics, consider a simple task: Report some statistics over the top 100 items in a table. Since LIMIT/ORDER processing is last in vanilla SQL's implied ordering, you can't directly compute the stats over the top items.

Could I not compute the stats over all values, then order and limit them, and depend on the query planner to not do the stat calculation for items outside the limit? If the order/limit does not depend on a computed statistic that should be possible? Or does that not happen in practice?

tmoertel

0 replies

4d20h

2024-08-24 22:17:09 UTC

CTEs are not inherently busywork. I rather like them. What is busywork is having to chop a linear flow of operations into chunks and then wrap those chunks in CTEs that you must wire together. All this, simply because the SQL syntax doesn't let you express that flow directly.

RaftPeople

0 replies

4d2h

2024-08-25 15:42:03 UTC

What’s wrong with CTEs though?

At least in SQL Server CTE's are syntax level, so multiple uses of a CTE in a query causes it to get expanded in each of those places, which typically increases the complexity of the query and can cause issues with the optimizer and performance.

svat

0 replies

4d5h

2024-08-25 13:05:37 UTC

The paper's section 5.1 (“Usage at Google”) and its Figure 4 (page 9) shows the usage growing steadily, “despite limited documentation and incomplete tooling” — of course 1600 users is still small relative to the number of SQL users at Google, and it's possible the growth will plateau at some point, but I for one adopted pipe syntax the very day I saw it (even sent an excited email to their mailing list), and have been using it ever since. As it's an extension, it's always possible to use regular SQL for parts of the same query (copying others' queries etc), but often I find myself rewriting queries into pipe SQL and find it significantly simplifies them (CTEs can be removed, etc).

humanfromearth9

0 replies

3d5h

2024-08-26 13:28:28 UTC

I upvote you for using CTEs. It can be so powerful.

10000truths

9 replies

2024-08-24 17:51:52 UTC

The pipeline syntax as presented is nicer than the status quo, but I'd prefer a syntax that models query execution as a directed graph of operations. Doing so would not only make some of the more complex SQL query constructs much more straightforward to represent:

* Joins can be modelled as a "cross-referencing" operation that consume two (or more) data streams and produce a single data stream

* CTEs can be modelled as producing multiple data streams

* Recursive CTEs can be modelled as cycles in the execution graph

almostgotcaught

7 replies

4d23h

2024-08-24 18:50:15 UTC

a directed graph of operations

What syntax do you know that can represent a dag in text?

RyanHamilton

2 replies

4d22h

2024-08-24 20:30:17 UTC

A = select * from tbla

B = select * from tblb

C = select * from A join B

wtetzner

0 replies

4d21h

2024-08-24 20:59:23 UTC

I guess CTEs already provide that (even if they're a bit clunky).

    WITH a AS (select * from tbla),
    b AS (select * from tblb)
    select * from a join b

almostgotcaught

0 replies

4d20h

2024-08-24 21:38:22 UTC

Yes we already have that, it's called `async def`.

mr_toad

1 replies

13h33m

2024-08-29 04:58:35 UTC

Something like the DOT language used in GraphViz

https://graphviz.org/doc/info/lang.html

almostgotcaught

0 replies

12h25m

2024-08-29 06:07:06 UTC

Yes DOT (and the other UML whatever languages) are absolutely the only extant examples that make an attempt. But again, if you look at DOT you'll see it doesn't actually do anything syntactically - it just has syntax for edge lists.

cha42

1 replies

4d22h

2024-08-24 20:30:55 UTC

Any syntax with a let operator to name stuff or a lambda abstraction.

almostgotcaught

0 replies

4d20h

2024-08-24 21:36:53 UTC

That only gives you trees not DAGs - you can't do fan-in (there's no way to "share" let bound names).

aranke

0 replies

4d22h

2024-08-24 20:05:17 UTC

Check out Substrait, it sounds like what you’re describing.

ku1ik

6 replies

5d2h

2024-08-24 16:14:37 UTC

I would welcome this! I’ve been using SQL for over 20 years and I still sometimes have trouble expressing certain queries, which in my head are well understood. I typically get there but I wish the amount of trial and error was shorter, and the process of turning what I have in my head inside out disappeared.

atum47

3 replies

5d2h

2024-08-24 16:16:34 UTC

Same here. I often connect to mariadb via console and try my queries there. I confess that I've used gpt to help me with some hairy ones and it worked quite good.

vinnymac

2 replies

5d2h

2024-08-24 16:29:57 UTC

It's difficult for me to tell how sincere your confession was, I am guessing you meant it in jest.

But it is worth stating nonetheless. No one confesses when they search StackOverflow for something they've solved 100 times before, or when they ask Google a question a dozen times a day. Asking ChatGPT for some insight should equally not be considered sinful :)

pajeets

0 replies

2024-08-24 17:43:28 UTC

I guess I'm shameless for generating 98% of my SQLs

atum47

0 replies

4d3h

2024-08-25 15:02:02 UTC

There are variations between mysql and mariadb, I've used mysql for a long time now, when I switched to mariadb I found those differences. I find writing complex queries on the console first and then committing it to the code a good strategy. I can see the results right away and some error that I might have done.

pajeets

0 replies

5d2h

2024-08-24 16:27:17 UTC

Absolutely as soon as I start doing multiple joins and throwing up all sorts of trailing syntaxes to further refine/filter/group I get lost

But piping is both intuitive and very easy to understand without having to get everything right from the beginning.

I really hope Postgres 16 can implement pipes, might be interesting to be able to use triggers and functions to pipe to and from, actually might simplify that aspect of it too.

ibash

0 replies

5d2h

2024-08-24 16:22:04 UTC

I do find common table expressions (eg with clause) lets me express what I want way easier.

Probably because it makes it easy to express a bunch of operators that are then pipelined…

DaiPlusPlus

6 replies

5d2h

2024-08-24 16:16:31 UTC

The first-page of the paper has 13 co-authors listed - but all with the same affiliation ("Google, Inc") - so this is ultimately a single-vendor making a unilateral proposal to break with the past - which means I'm confident this proposal won't be gracing the pages of the ISO/IEC 9075 (ISO SQL) standards in my lifetime - no matter how badly we all need QoL improvements to SQL.

...okay, if I dial-back my feelings of resignation to mediocrity, then I'll admit that Google probably does have enough clout to make this go somewhere - but they'd need to add this to all their database offerings (BigQuery, Spanner, Firebase's SQL mode) and contribute patches to Postgres and MySQL/Maria - maybe after Microsoft relents a decade later to add it to MSSQL we'll maybe start to see Oracle's people refer to it vaguely as a nice-to-have they'll implement only after they start losing more blue-chip customers[1].

Also, it's giving me M (Excel PowerQuery) vibes too.

-------

[1]For context, Oracle's DB lacked a `bit`/`bool` column type for the past 40 years until last year. People had to use `char(1)` columns with CHECK constraints to store '0'/'1' - or worse: 'T'/'F' or 'Y'/'N' (see https://stackoverflow.com/a/3726846/159145 )

0cf8612b2e1e

4 replies

5d1h

2024-08-24 16:46:40 UTC

… People had to use `char(1)` columns with CHECK constraints to store '0'/'1' - or worse: 'T'/'F' or 'Y'/'N'

If you are truly blessed you get to see all of these in a single database. Also “Yes”/“No” with that specific casing.

orthoxerox

2 replies

2024-08-24 17:32:39 UTC

"Check constraints make the database run slower, just don't generate any values other than YN in the application layer"

You can guess how many foreign keys that database had.

datadrivenangel

1 replies

4d23h

2024-08-24 18:39:11 UTC

Must have been super fast though

DaiPlusPlus

0 replies

4d20h

2024-08-24 22:08:27 UTC

Not just fast, it's web-scale

https://www.youtube.com/watch?v=b2F-DItXtZs

DaiPlusPlus

0 replies

5d1h

2024-08-24 17:00:53 UTC

Not to be outdone: https://thedailywtf.com/articles/what_is_truth_0x3f_

bhawks

0 replies

2024-08-24 17:39:12 UTC

And to be fair it is not like any database implementations implement the entire spec, or that the spec itself is nearly as long as the C++ but still very underspecified.

Piped SQL fits in perfectly with the overall SQL pot-luck buffet! I for one welcome Google to the table, enjoy the language that works everywhere and nowhere but is the best there is.

gfody

5 replies

4d23h

2024-08-24 19:09:16 UTC

I seriously wonder if the people who are so adament that sql is flawed have spent as much time using at as they have trying to "fix" it. After 20 years of sequeling I have come to believe that this language is so far ahead of its time that we're only just beginning to see what a proper tooling for it looks like. Azure Data Studio w/Copilot makes starting queries with "select" the most natural thing in the world and this pipe syntax is barbaric in contrast.

parpfish

0 replies

4d22h

2024-08-24 20:15:11 UTC

I think that engineers and analysts each have very different relationships with SQL.

When I was doing data science, all the other DS folks would be perfectly content to read and write queries that were hundreds of lines long. There were plenty of minor bits to pick, but it was a great lingua franca for describing data processing.

But engineers hate SQL because they generally only need a tiny little subset of the feature to enable transactional data updates. So they write an ORM to do the subset of SQL they need and never get the opportunity to be indoctrinated into the SQuLt

cl3misch

0 replies

4d22h

2024-08-24 19:55:56 UTC

I think their claim isn't that it's impossible to be efficient in existing SQL but rather that pipe syntax is more natural and approachable to a lot of people?

brikym

0 replies

4d18h

2024-08-25 00:13:45 UTC

I disagree. KQL is much nicer. It reads better having the source first and this also offers autocompletion in the editor.

beart

0 replies

16h29m

2024-08-29 02:03:08 UTC

After 10 years of sequeling, I disagree. Maybe I'll feel differently in 10 more years, but I somehow doubt it.

Using an AI model that ate up half the internet to produce an efficient autocomplete is one path... Improving the language is another...

abelcha

0 replies

4d22h

2024-08-24 20:09:15 UTC

well maybe people dont want to suffer for 20 years

sklivvz1971

3 replies

4d22h

2024-08-24 20:02:29 UTC

SQL doesn't have problems. It's arguably one of the oldest and most successful languages of all times.

Some developers have problems, because they don't understand its abstraction. Periodically they try to re-implement feature to "fix" SQL. Every time it's an absolute disaster.

I realize this is how the world works, but it's just a waste of time.

tmoertel

2 replies

4d22h

2024-08-24 20:10:34 UTC

Some developers have problems, because they don't understand its abstraction.

I think the point of this new syntax is to lower the cost of understanding the abstraction. You keep the familiar clause syntax but let clauses be chained into pipelines in which the actual (semantic) execution order is the same as the written order.

sklivvz1971

1 replies

4d22h

2024-08-24 20:14:45 UTC

I think this new syntax is more imperative (you tell the DB "how" to get you the data). The old syntax is more declarative (you tell the DB "what" you need).

The big advantage is that the SQL engine can optimize the how if you focus on the what -- especially after so much research has been don building query compilers.

If you want to focus on the "how" then perhaps a "bare bones" DB like Redis is more in line with this kind of thinking.

creatonez

0 replies

4d21h

2024-08-24 20:34:59 UTC

I think this new syntax is more imperative (you tell the DB "how" to get you the data). The old syntax is more declarative (you tell the DB "what" you need).

I think this syntax could accidentally give the impression that this is what's happening, but it's still an abstraction thanks to the optimizer. The "how" a pipe syntax describes roughly aligns with how a human would naturally want to construct a machinated process to sort through a file cabinet. Which may have been why it's been such a big hit for Unix and functional programming, despite poorly mapping to the machine model before the optimizations of modern OSes and functional compilers came along.

Any choice whatsoever might give a false impression of the machine model. Maybe SQL is doing the right thing by essentially having its syntax elements in random order, thwarting any attempts by developers to relate what they are typing to what is actually happening, but I think the authors of this paper are right in their desire to make it more ergonomic.

typedef_struct

2 replies

4d19h

2024-08-24 23:07:50 UTC

You can tell the authors realized this was a bad idea when they had to add the 'OVER' keyword, which isn't documented and hardly mentioned in the paper.

roryokane

0 replies

4d16h

2024-08-25 01:49:41 UTC

ZetaSQL’s docs for the ‘OVER’ keyword you mention: https://github.com/google/zetasql/blob/master/docs/window-fu...

I disagree that the paper not mentioning ‘OVER’ implies that the paper authors secretly think pipe syntax is a bad idea. They probably just wanted to keep the paper concise, or forgot about that one less-used bit of syntax.

Do you think that ‘OVER’ keyword implies something fundamentally wrong about pipe syntax? If so, how?

nhinck2

0 replies

3d16h

2024-08-26 02:09:36 UTC

https://en.m.wikipedia.org/wiki/Window_function_(SQL)

They didn't add anything.

simonw

2 replies

5d2h

2024-08-24 16:14:26 UTC

To see how well it works, I uploaded the PDF to Google AI Studio and ran the prompt "Convert this document to neatly styled semantic HTML" against the gemini-1.5-pro-exp-0801 model - the result is actually pretty good!

https://static.simonwillison.net/static/2024/Pipe-Syntax-In-...

simonw

0 replies

4d19h

2024-08-24 23:03:37 UTC

I wrote up a few extra notes about how I did this here: https://simonwillison.net/2024/Aug/24/pipe-syntax-in-sql/

dang

0 replies

4d21h

2024-08-24 21:23:00 UTC

(This was originally a reply to https://news.ycombinator.com/item?id=41339138 but I detached it so the useful link could float higher in the thread - thanks!)

ccorcos

2 replies

4d19h

2024-08-24 23:05:26 UTC

Now we just need a healthy SQL transpiler ecosystem so we can all live in the future!

r1b

0 replies

4d3h

2024-08-25 15:12:50 UTC

This was my first thought as well. I like the language, but I’m hesitant to adopt it without tooling that can extract an AST, ideally interoperable with the hard-won tooling we already have for SQL.

Improved UX at query development time is nice, but data teams need so much more than this. Stuff like lineage, linting, cross-dialect interop - these are no longer niche concerns, they are table stakes, and they come from having a shared syntax for data transformation.

ec109685

0 replies

4d14h

2024-08-25 03:42:00 UTC

Yes, and to try new features to see if they gain traction before standardization.

snidane

1 replies

4d16h

2024-08-25 01:53:51 UTC

This style is familiar to those writing dataframe logic in df libraries with sql semantics - spark, polars or duckdb relational (https://duckdb.org/docs/api/python/relational_api.html).

It definitely makes things easier to follow, but only for linear, ie. single table, transformations. The moment joins of multiple tables come into the picture things become hairy quick and then you actually start to appreciate the plain old sql which accounts for exactly this and allows you to specify column aliases in the entire cte clause. With this piping you lose scope of the table aliases and then you have to use weird hacks like mangling names of the joined in table in polars.

For single table processing the pipes are nice though. Especially eliminating the need for multiple different keywords for filter based on the order of execution (where, having, qualify (and pre-join filter which is missing)).

A missed opportunity here is the redundant [AGGREGATE sum(x) GROUP BY y]. Unless you need to specify rollups, [AGGREGATE y, sum(x)] is a sufficient syntax for group bys and duckdb folks got it right in the relational api.

nickpeterson

0 replies

4d15h

2024-08-25 03:27:34 UTC

It’s even more familiar to F# developers, which it almost looks identical to.

scop

1 replies

4d22h

2024-08-24 20:00:06 UTC

Reminds me of Elixir’s Ecto library.

https://hexdocs.pm/ecto/crud.html#pipe-based-queries

freedomben

0 replies

4d21h

2024-08-24 20:51:14 UTC

That was my first thought too. I think working with ecto really primed me to like this syntax. I like it quite a bit

saltcured

1 replies

4d21h

2024-08-24 21:08:19 UTC

I don't have time to read this closely or ponder the grammar right now. They have a section on "complex queries" that acknowledges my first concern, but doesn't seem to really address it. Namely, that SQL allows tree-like composition of queries which is more general than a linear composition.

Has anybody figured out whether they are proposing this pipeline syntax to be mixed with regular compositional forms like CTEs and subqueries? Or is it another limited, top-level syntax, similar to how some DB engines do not allow nesting of CTE syntax?

svat

0 replies

4d5h

2024-08-25 13:08:36 UTC

this pipeline syntax to be mixed with regular compositional forms like CTEs and subqueries?

Yes, it's an extension (available by default), which means you can freely mix with regular SQL and use pipes for just parts of your query.

https://github.com/google/zetasql/blob/2024.08.2/docs/pipe-s...

Pipe syntax can be mixed with standard syntax in the same query. For example, subqueries can use different syntax from the parent query.

r1b

1 replies

4d22h

2024-08-24 20:09:23 UTC

It’s strange to me that the interop story here only considers compatibility with GoogleSQL.

These days, we have tooling that can generate an IR for SQL and transpile it to any dialect (see: sqlglot).

Doesn’t coupling SQL syntax to dialect-specific SQL semantics throw a wrench in this?

sagarm

0 replies

4d21h

2024-08-24 20:41:54 UTC

You could implement a family of poorly defined dialects using SQLGlot, sure, in the same way you could define a meta-language targeted at Algol-family languages. That's not a substitute for adding language features.

nsonha

1 replies

4d21h

2024-08-24 20:55:04 UTC

What's wrong with single character pipe (|)? Not everyone uses ligatures even if they like that particular symbol. Just keep things simple please

svat

0 replies

4d18h

2024-08-24 23:33:42 UTC

Great question, answered in section 4.1.5 of the paper (page 6): it turns out that the single character | is already used for bitwise OR, and although that's rarely used and in most cases there would be no confusion, there are at least a few cases, and at this time, “More drastic solutions requiring users to migrate queries did not seem desirable.”

mrjin

1 replies

4d18h

2024-08-24 23:41:48 UTC

Starting with SELECT was a mistake in SQL as you need to know where before what to select from. What we really need is something like

FROM r JOIN s on r.Id = s.Id ORDER BY r.Id SELECT *

But the thing is, such changes will break pretty much all existing code, so the author added |> to distinguish, but why not use | instead? Don't make people typing one more character please.

svat

0 replies

4d5h

2024-08-25 13:14:39 UTC

Yes, being able to start with FROM is what makes this so intuitive to use (and autocomplete work better); your hypothetical

    FROM r JOIN s on r.Id = s.Id ORDER BY r.Id SELECT *

would indeed be pretty much that (linebreaks optional):

    FROM r JOIN s on r.Id = s.Id 
    |> ORDER BY r.Id 
    |> SELECT *

The question about “typing one more character” is answered in the paper's section 4.1.5 “Why use ‘|>’ for the pipe character?” (page 6): “The most natural and obvious choice would be to use ‘|’. Unfortunately…” — they don't want to break existing queries that use | for bitwise OR.

(I wonder if one day, if the majority of users are using pipe syntax and don't mind updating their old queries or adding a special directive to enable single pipe…)

misiek08

1 replies

5d1h

2024-08-24 16:38:44 UTC

Looks like from Elixir. If old SQL syntax will be supported then why not - but this one will introduce a lot less readable queries when multiple JOINs, subqueries and aggs are involved - it is very easy in plain SQL and here they will probably look bad.

ibash

0 replies

5d1h

2024-08-24 16:51:18 UTC

Think it’ll be up to the query author to decide what’s most expressive.

Given it’s adding the pipe operator you can always only add it to the part of the query where it makes sense.

A nice sql linter for legibility could be great for this.

kmoser

1 replies

5d1h

2024-08-24 16:38:01 UTC

I understand the desire to reorder clauses in a way that makes sense to the user, but they seem to discount the value in the formality of enforcing a particular order to the clauses: you're assured that if the query contains, say an ORDER BY clause, it will always appear at the end, no need to hunt for it.

Also, why the need to match semantic evaluation order when there are far more important things happening under the hood that affect the execution plan (indexes, etc.)?

Side-effects at a distance...The same columns are cross-referenced in SELECT, GROUP BY and ORDER BY, and corresponding edits are often required in three places

Can't this be avoided by using aliases?

akira2501

0 replies

4d23h

2024-08-24 18:47:33 UTC

when there are far more important things happening under the hood that affect the execution plan

It feels like google is only concerned with large query engines like BigQuery where you're relying on partitioning and sort order within partitions more than you can indexes or complicated multi table reference semantics.

Can't this be avoided by using aliases?

In any language where variables are hard to declare and use pipes always feel like a good option; however, just adding some variables in is probably the better solution in most cases.

darksaints

1 replies

4d20h

2024-08-24 21:45:34 UTC

I've got plenty of complaints about SQL, but as one of the most useful programming languages to have ever been invented, I have to say that syntax complaints are one of lowest items I would have on my list of things to be prioritized for a change. Sure, the syntax could be better, but why do we care so much about it over the dozens of other problems with it?

How about we get a SQL successor with algebraic data types, true boolean logic (as opposed to SQL's ternary logic), or functional composition? Null values are the bane of any query writer's existence, and we should have a reasonable solution by now...we've already done it with other programming languages.

crabmusket

0 replies

4d14h

2024-08-25 04:24:02 UTC

It sounds like you want SQL to be more like a "real programming language", but I feel like there's a real chasm. SQL itself seems to be oriented towards "non-programmers" which is why it has declarative English syntax. But so many systems that interface with SQL databases are software systems written by programmers.

Why are our programming languages, which have rich ADTs, Boolean logic, etc. serialising queries into an English syntax written for ad-hoc business intelligence tasks? Why not have a binary query interface to the database that provides a programmatic, rather than human-readable, API?

The first time I got a "too many bind variables" error I was flabbergasted. All I wanted to do was insert a ton of rows into a table. But the database expects me to construct an English sentence containing a placeholder for each value of each row?

vendiddy

0 replies

4d11h

2024-08-25 06:37:46 UTC

Are there any gentle intros to how a simple SQL engine implemented under the hood?

I vaguely remember someone telling me that it has mathematical underpinnings.

thom

0 replies

4d21h

2024-08-24 21:22:34 UTC

Of all the problems with SQL, this solves one I absolutely don’t care about at all. Kill SQL or don’t kill SQL, but don’t add pointless syntax sugar to make normal queries look alien.

solidsnack9000

0 replies

5d1h

2024-08-24 17:17:46 UTC

This may be the most practical way to make things better for developers, analysts, data scientists...

There have been so many attempts to alleviate the difficulty of writing SQL -- from ORMs, to alternate syntaxes, to alternate databases -- and none have been successful. The authors identify many reasons for this. The approach the authors have taken is incremental and makes only slight demands one people already familiar with elementary SQL -- and even people familiar with advanced SQL will likely find this approach to be easier to use for advanced queries.

sitkack

0 replies

4d13h

2024-08-25 05:22:54 UTC

Reminds me of Pig. Or a pathological stack of CTEs.

Doesn't look compelling for the lockin.

singularity2001

0 replies

4d13h

2024-08-25 04:49:14 UTC

on iPad FF when I tried to view the PDF it downloads binary gibbous instead

simonw

0 replies

4d19h

2024-08-24 23:04:23 UTC

Google's open source ZetaSQL project added documentation covering pipe syntax last week, which I'm finding easier to follow than this paper:

https://github.com/google/zetasql/blob/2024.08.2/docs/pipe-s...

sgarland

0 replies

4d4h

2024-08-25 14:14:48 UTC

I personally find the syntactical flow of SQL to be quite logical: you SELECT tuples FROM a table, JOINing others as needed, filtering WHERE necessary. It’s much like list comprehensions in Python, which I love, so maybe that's biasing me.

rawgabbit

0 replies

4d23h

2024-08-24 19:15:58 UTC

It looks great. When is Google going to create an IDE or extension that I can use?

ram_rar

0 replies

4d23h

2024-08-24 19:06:28 UTC

I appreciated the paper's effort to highlight data analysis techniques, but I was hoping for a more forward-thinking perspective.

The inclusion of pipe syntax or data processing using pipe-structured data flow syntax is not very novel at all. Splunk's SPL and similar languages have been a longstanding industry practice.

I wish the paper had provided more extensive references to the existing body of work in this domain, as it would have added depth and context to the discussion.

mingodad

0 replies

4d21h

2024-08-24 21:23:20 UTC

I've just update the EBNF for rarilroad diagram here https://github.com/google/zetasql/issues/134#issuecomment-14... with the full grammar.

jwilber

0 replies

4d14h

2024-08-25 03:34:29 UTC

This is almost exactly how R’s dplyr library works, and it’s honestly a pretty ergonomic, intuitive way to write queries.

imperfect_light

0 replies

4d16h

2024-08-25 01:57:02 UTC

But SQL is an old language with significant design problems, making it difficult to learn, difficult to use, and difficult to extend.

It's difficult to learn and use? Compared to what? The "difficult to extend" might be accurate, but I really question the initial premise here.

hintymad

0 replies

4d22h

2024-08-24 20:24:34 UTC

Piping syntax is particularly useful when querying time series too, especially when we need to write complex queries in a one-liner to share with others.

flusteredBias

0 replies

4d21h

2024-08-24 20:35:04 UTC

... so dplyr.

cryptonector

0 replies

4d13h

2024-08-25 04:58:27 UTC

This strikes me as a) yet another FROM-clause-first variant of SQL, b) syntax for LinkQ.

croes

0 replies

5d1h

2024-08-24 17:28:26 UTC

Can the first example not simply be done with a window function?

chucke1992

0 replies

4d20h

2024-08-24 22:09:18 UTC

Fix it by piping

chuckadams

0 replies

4d23h

2024-08-24 19:22:30 UTC

I wonder if the pipe operator will land in SQL before we see it in JS.

brikym

0 replies

4d18h

2024-08-25 00:11:14 UTC

It's been done before: Kusto Query Language