Here is one example from the PDF:
FROM r JOIN s USING (id)
|> WHERE r.c < 15
|> AGGREGATE sum(r.e) AS s GROUP BY r.d
|> WHERE s > 3
|> ORDER BY d
|> SELECT d, s, rank() OVER (order by d)
Can we call this SQL anymore after this? This re-ordering of things has been done by others too, like PRQL, but they didn't call it SQL. I do think it makes things more readable.
The multiple uses of WHERE with different meanings is problematic for me. The second WHERE, filtering an aggregate, would be HAVING in standard SQL.
Not sure if this is an attempt to simplify things or an oversight, but favoring convenience (no need to remember multiple keywords) over explicitness (but the keywords have different meanings) tends to cause problems, in my observation.
In the query plan, filtering before or after an aggregation is the same, so it's a strange quirk that SQL requires a different word.
Indeed. Just as I think git’s N different ways to refer to the same operation was a blunder.
But pre- and post- aggregation filtering is not really "the same" operation.
If I use a CTE and filter the aggregate, feels the same to me.
If you perform an aggregation query in a CTE, then filter on that in a subsequent query, that is different, because you have also added another SELECT and FROM. You would use WHERE in that case whether using a CTE or just an outer query on an inner subquery. HAVING is different from WHERE because it filters after the aggregation, without requiring a separate query with an extra SELECT.
Personally I rarely use HAVING and instead use WHERE with subqueries for the following reasons:
1-I don't like repeating/duplicating a bunch of complex calcs, easier to just do WHERE in outer query on result
2-I typically have outer queries anyway for multiple reasons: break logic into reasonable chunks for humans, also for join+performance reasons (to give the optimizer a better chance at not getting confused)
The main (only?) task I routinely use HAVING for is finding duplicates.
I was not there at the original design decisions of the language, but I imagine it was there specifically to help the person writing/editing the query easily recognize and interpret filtering before or after an aggregation. The explicitness makes debugging a query much easier and ensures it fails earlier. I don't see much reason to stop distinguishing one use case from the other, I'm not sure how that helps anything.
I think the original sin here is not making aggregation an explicitly separate thing, even though it should be. Adding a count(*) fundamentally changes what the query does, and what it returns, and what restrictions apply.
I also wasn't there, but I think this actually wasn't to help authors and instead was a workaround for the warts of SQL. It's a pain to write
and they decided this was common enough that they would introduce a HAVING clause for this case But the real issue is that in order to make operations in certain orders, SQL requires you to use subselects, which require restating a projection for no reason and a lot of syntactical ceremony. E.g. you must give the FROM item a name (t), but it's not required for disambiguation.Another common case is projecting before the filter. E.g. you want to reuse a complicated expression in the SELECT and WHERE clauses. Standard SQL requires you to repeat it or use a subselect since the WHERE clause is evaluated first.
I think this stems from the non-linear approach to reading a SQL statement. If it were top-to-bottom linear, like PRQL, then the distinction does not seem merited. It would then always be filtering from what you have collected up to this line.
Only if you aren't using a subquery otherwise you would use WHERE even in plain SQL. Since the pipe operator is effectively creating subqueries the syntax is perfectly consistent with SQL.
Perhaps, however then you eliminate the use of WHERE/HAVING sum(r.e) > 3, so in case you forgot what the alias s means, you have to figure that part out before proceeding. Maybe I'm just used to the existing style but as stated earlier, seems this is reducing explicitness which IMO tends to lead to more bugs.
A lot of SQL engines don't support aliases in the HAVING clause and that can require duplication of potentially complex expressions which I find very bug-inducing. Removing duplication and using proper naming I think would be much better.
I will already use subqueries to avoid issues with HAVING.
We're moving from SQLAnywhere to MSSQL, and boy, we're adding 2-5 levels of subqueries to most non-trivial queries due to issues like that. Super annoying.
I had one which went from 2 levels deep to 9... not pleasant. CTEs had some issues so couldn't use those either.
I'm surprised you had issues with CTEs -- MS SQL has one of the better CTE implementations. But I could see how it might take more than just trivial transformations to make efficient use of them.
I don't recall all off the top of my head.
One issue, that I mentioned in a different comment, is that we have a lot of queries which are used transparently as sub-queries at runtime to get count first, in order to limit rows fetched. The code doing the "transparent" wrapping doesn't have a full SQL parser, so can't hoist the CTEs out.
One performance issue I do recall was that a lateral join of a CTE was much, much slower than just doing 5-6 sub-queries of the same table, selecting different columns or aggregates for each. Think selecting sum packages, sum net weight, sum gross weight, sum value for all items on an invoice.
There were other issues using plain joins, but I can't recall them right now.
CTE's (at least in MS SQL land) are a syntax level operation, meaning CTE's get expanded to be as if you wrote the same subquery at each place a CTE was, which frequently impacts the optimizer and performance.
I like the idea of CTE's, but I typically use temp tables instead of CTE's to avoid optimizer issues.
If you use temp tables you're subverting the optimizer. Sometimes that's what you want but often it's not.
I use them on purpose to "help" the optimizer by reducing the search space for query plan ((knowing that query plan optimization is a combinatorial problem and the optimizer frequently can't evaluate enough plans in a reasonable amount of time).
Can you please share the SQL queries? If tables/columns are sensitive, maybe it can be anonymized replacing tables with t1,t2,t3 and columns c1,c2,c3.
Should we introduce a SUBSELECT keyword to distinguish between a top-level select and a subquery?
To me that feels as redundant as having WHERE vs HAVING, i.e. they do the same things, but at different points in the execution plan. It feels weird to need two separate keywords for that.
You can always turn a HAVING in SQL into a WHERE by wrapping the SELECT that has the GROUP BY in another SELECT that has the WHERE that would have been the HAVING if you hadn't bothered.
You don't need a |> operator to make this possible. Your point is that there is a reason that SQL didn't just allow two WHERE clauses, one before and one after GROUP BY: to make it clearer syntactically.
Whereas the sort of proposal made by TFA is that if you think of the query as a sequence of steps to execute then you don't need the WHERE vs. HAVING clue because you can see whether a WHERE comes before or after GROUP BY in some query.
But the whole point of SQL is to _not have to_ think of how the query is to be implemented. Which I think brings us back to: it's better to have HAVING. But it's true also that it's better to allow arbitrary ordering of some clauses: there is no reason that FROM/JOIN, SELECT, ORDER BY / LIMIT have to be in the order that they are -- only WHERE vs. GROUP BY ordering matters, and _only_ if you insist on using WHERE for pre- and post-GROUP BY, but if you don't then all clauses can come in any order you like (though all table sources should come together, IMO).
So all in all I agree with you: keep HAVING.
Honestly SQL screwed things up from the very beginning. "SELECT FROM" makes no sense at all. The projection being before the selection is dumb as hell. This is why we can’t get proper tooling for writing SQL, even autocompletion can’t work sanely. You write "SELECT", what’s it gonna autocomplete?
PRQL gives me hope that we might finally get something nice some day
WITH clauses are optional and appear before SELECT. No reason why the FROM clause couldn't behave the same
Isn't that strictly for CTEs? In which case, you are SELECTing from the CTE.
Doesn’t change anything, you can still have the select at the end, and optional from and joins at the beginning. In your example, the select could be at the end, it’s just that there’s nothing before.
Beginning with Oracle Database Release 23 [released May 2, 2024], it is now optional to select expressions using the FROM DUAL clause.
The initial version of SQL was called "Structured English Query Language".
If the designers intended to create a query language that resembled an English sentence, it makes sense why they chose "SELECT FROM".
"Select the jar from the shelf" vs. "From the shelf, select the jar".
“Go to the shelf and select the jar”. You’re describing a process, so it’s natural to formulate it in chronological order.
SQL is a declarative language not a procedural one. You tell the query planner what you want, not how to do it.
otoh if you selected something the from clause and potentially some joins could autocomplete
Not reliably, especially if you alias tables. Realistically, you need to know what you’re selecting from before knowing what you’re selecting.
I also hate having SELECT before FROM because I want to think of the query as a transformation that can be read from top to bottom to understand the flow.
But I assume that that’s part of why they didn’t set it up that way — it’s just a little thing to make the query feel more declarative and less imperative
The point of SQL pipe syntax is that there is no reordering. You read the query as a sequence of operations, and that's exactly how it's executed. (Semantically. Of course, the query engine is free to optimize the execution plan as long as the semantics are preserved.)
The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins:
From the paper:
Vanilla SQL is actually more complex in this respect because you have, for example, at least 3 different keywords for filtering (WHERE, HAVING, QUALIFY) and everyone who reads your query needs to understand what each keyword implies regarding execution scheduling. (WHERE is before grouping, HAVING is after aggregates, and QUALIFY is after analytic window functions.)
If you're referring to this in the comment you're replying to:
Then they're clearly just saying that this is a reordering compared to SQL, which is undeniably true (and the while point).
The post I was referring to said that this new pipe syntax was a big reordering compared to the vanilla syntax, which it is. But my point is that if you're going to understand the vanilla syntax, you already have to do this reordering in your head because the order in which the the vanilla syntax executes (inside out) is the order in which pipes syntax reads. So it's just easier all around to adopt the pipe syntax so that reading and execution are the same.
Golly, QUALIFY, a new SQL operator I didn’t know existed. I tend not to do much with window functions and I would have reached for a CTE instead but it’s always nice to be humbled by finding something new in a language you thought you knew well.
Is not common at all, is a non ANSI SQL clause that afaik was created by Teradata, syntactic sugar for filtering using window functions directly without CTEs or temp tables, especially useful for dedup. In most cases at least, for example you can't do a QUALIFY in an query that is aggregating data just as you can't use a window function when aggregating.
Other engines that implement it are direct competitors in that space: Snowflake, Databricks SQL, BigQuery, Clickhouse, and duckdb (only OSS implementation I now). Point is: if you want to compete with Teradata and be a possible migration target, you want to implement QUALIFY.
Anecdote: I went from a company that had Teradata to another where I had to implement all the data stack in GCP. I shed tears of joy when I knew BQ also had QUALIFY. And the intent was clear, as they also offered various Teradata migration services.
This is an interesting point.
All these years I've been doing that reordering and didn't even realize!
I already think about SQL like this (as operation on lists/sets), however thinking of it like that, and having previous operations feed into the next, which is conceptually nice, seems to make it hard to do, and think about:
since logically each part between the pipes doesn't know about the others, so global optimizations, such as use of indexes to restrict the result of a join based on the where clause can't be done/is more difficult.
Isn't that FILTER (WHERE), as in SELECT avg(...) FILTER (WHERE ...) FROM ...?
But this thing resembles other FROM-clause-first variants of SQL, thus GP's point about this being just a reordering. GP is right: the FROM clause gets re-ordered to be first, so it's a reordering.
This kind of implies there's better or worse ordering. AFAIK that's pretty subjective. If the idea was to expose how the DB is ordering things, or even make things easier for autocomplete OK, but this just feels like a "I have a personal aesthetic problem with SQL and I think we should spend thousands of engineering hours and bifurcate SQL projects forever to fix it" kind of thing.
'qualify' is now standard? Thought it was a vendor extension currently.
IMO having SELECT before FROM is one of SQL's biggest mistakes. I would gladly welcome a new syntax that rectifies this. (Also https://duckdb.org/2022/05/04/friendlier-sql.html)
I don't love the multiple WHEREs.
duckDB is what sql should be in 2024
https://duckdbsnippets.com/
The very first example on that page is vulnerable to injection.
Which one?
Eh, in the context of the site and other snippets that seems pedantic.
Could it be run on untrusted user input? Sure. Does it actually pose a threat? It's improbable.
Duckdb also supports prql with an extension https://github.com/ywelsch/duckdb-prql
SQL was supposed to follow English grammar. Having FROM before SELECT is like having “Begun” before “these clone wars have.”
That's a great list of friendlier sql in DuckDB. For most of that list I either run into it regularly or have wanted the exact fix they have.
Yes, having |> isn't breaking SQL but rather enhancing it.
I really like this idea of piping SQL queries rather than trying to create the perfect syntax from the get go.
+1 for readability too.
Honestly, it seems like a band-aid on legacy query language.
SQL a legacy query language?
In order for a thing to be considered legacy, there needs to be a widespread successor available.
SQL might have been invented in the 70s but it's still going strong as no real alternative has been widely adopted so far - I'd wager that you will find SQL at most software companies today.
Calling it legacy is not realistic IMO.
I mean kinda? It's legacy in the "we would never invent this as the solution to the problem domain that's today asked of it."
We would invent the underlying engines for sure but not the language on top of it. It doesn't map at all to how it's actually used by programmers. SQL is the JS to WebAssembly, being able to write the query plan directly via whatever language or mechanism you prefer would be goated.
It has to be my biggest pain point dealing with SQL, having to hint to the optimizer or write meta-SQL to get it to generate the query plan I already know I want dammit! is unbelievably frustrating.
By that definition JavaScript is also legacy.
That's not in the domain of SQL. If you're not getting the most optimized query plan, there is something wrong with the DBMS engine or statistics -- SQL, the language, isn't supposed to care about those details.
That's my point, I think we've reached the point where SQL the langage can be more of a hindrance than help because in a lot of cases we're writing directly to the engine but with oven mitts on. If I could build the query from the tree with scan, filter, index scan, cond, merge join as my primitives it would be so nice.
Sounds like you don’t want SQL at all. Some sort of non-SQL, or not-SQL, never-SQL. Something along those lines.
That's the thing though, I still want my data to be relational so NoSQL databases don't fit the bill. I want to interact with a relational database via something other than the SQL language and given that this language already exists (Postgres compiles your SQL into an IR that uses these primitives) I don't think it's a crazy ask.
I don't think that definition of legacy is useful because so many things which hardly anyone calls "legacy" fit the definition - for example: Javascript as the web standard, cars in cities and bipartisan democracy.
I think many of us would say that that none of these is an ideal solution for the problem being solved, but it's what we are stuck with and I cannot think anyone could call it "legacy systems" until a viable successor is widespread.
> Can we call this SQL anymore after this?
Maybe not, just as we don't call "rank() OVER" SQL. We call it SQL:2003. Seems we're calling this GoogleSQL. But perhaps, in both cases, we can use SQL for short?
EssGyooGell: A Modest Proposal
You show a good example. Many people would call that SQL, and if pipes become popular, they too might simply be called SQL one day.
At this point I think that vanilla SQL should just support optionally putting the from before the select. It's useful for enabling autocompletion, among other things.
And a simple keyword that does a GROUP BY on all columns in select that aren't aggregates, just a syntax level macro-ish type of thing.
The proposal here adds pipe syntax to SQL.
So it would be reasonable to call it SQL, if it gets traction. You want to see some of the big dogs adopting it.
That should at least be possible since it looks like it could be added to an existing implementation without significant disruption/extra complexity.
There may be trademark issues, but even if not, doing sufficient violence to the original thing argues for using a new name for the new thing.
This is an extension on top of all existing SQL. The pipe functions more or less as a unix pipe. There is no reordering, but the user selects the order. The core syntax is simply:
Which results in a new query that can be piped again. So e.g. this would be valid too: Personally, I can see this fix so much SQL pain.okay, now I can see why this so much reminds of CTE
My initial reaction is that the pipes are redundant (syntactic vinegar). Syntactic order is sufficient.
The changes to my SQL grammar to accomodate this proposal are minor. Move the 'from' rule to the front. Add a star '*' around a new filters rule (eg zero-or-more, in any order), removing the misc dialect specific alts, simplifying my grammar a bit.
Drop the pipes and this would be terrific.
We can call it "Linq2SQL" and what a disaster it was...
this is consistent, non-pseudo-english, reusable, and generic. The SQL standard largely defines the aesthetic of the language, and is in complete opposition to these qualities. I think would be fundamentally incorrect to call it SQL
Perhaps if they used a keyword PIPE and used a separate grammar definition for the expressions that follow the pipe, such that it is almost what you’d expect but randomly missing things or changes up some keywords
In that example, "s" has two meanings: 1. A table being joined. 2. A column being summed.
For clarity, they should have assigned #2 to a different variable letter.
Looking at this reminds me of Apache Pig. That’s not a compliment.
Yes we can call it sql.
Language syntax changes all the time. Their point is that sql syntax is a mess and can be cleaned up.
What's the SQL equivalent of this?
Not bad, very similar to dplyr syntax. Personally i’m too used to classic SQL though and this would be more readable as CTEs. In particular how would this syntax fair if it was much more complicated with with 4-5 tables and joins?
In the example would there a difference between `|> where s > 3` and `|> having s > 3` ?
Edit: nope, just that you don't need having to exist with the pipe syntax.