So it looks like it scores 76.5% on SQL-Eval [0], a bit behind GPT-4 at 83% and sqlcoder-15b at 78%.
What kind of applications would this be useful for? What can you build with an AI data science intern that's right 75% of the time?
As a programmer who always has to look stuff up when I SQL, I could definitely see asking something like this for a first draft of a query but it seems like I'm slightly better off asking the bigger models in these one-off cases (and I can run a 15b easily on my 64GB m1). If I'm in a corporate setting I'm not going to leak my schema into OpenAI's training data and there are definitely times when I'd want to run queries offline. Small/local models are great when you want to do a ton of queries (save $$).
A mini data scientist that could be queried by non-technical folks would be awesome but I wonder if there's a way to determine whether the query is falling in the 25% "incorrect" case... maybe there's a RAID-like consensus algorithm where you have multiple interrogate each other's answers to get a higher overall success rate.
Mostly thinking out loud :) but maybe ya'll have more ideas. Congrats on the release, OP!
1. If you are a a programmer I think you should learn SQL. It will give you a different perspective on programming that I think is invaluable. (I.e. programming without for loops)
2. Combining and slicing data is a craft, and doing it subtly wrong in one step can lead to fatal errors in the outcome.
And most importantly, it can be very difficult to notice. Numbers don't smell.
That is why I would be very hesitant to give a slightly more than trivial task to an engine that fails 25% of the time.
But I guess that is the same as any other programming task. Just that other programming tasks require a lot of boilerplate where an AI can help. SQL is much more straight to it.
Maybe it could be useful to ask questions that are similar to writing testcases "how can I verify that my query is doing the right thing"?
What makes you think that SQL doesn't have "for loops"?
Ever heard of LATERAL joins/CROSS APPLY?
TIL
They're incredibly useful -- any time that you want to both:
- Reference data from the previous part of the query (the "left-hand side")
- Return multiple columns
The only way you can achieve it is with LATERAL/CROSS APPLY.
Regular correlated subqueries can only return a single column, so something like this doesn't work:
You'd get:I love LATERALs, but this still fits within set theory and a bulk application rather than an iterative for-loop. It may even be implemented as a for-loop within the engine, but SQL being declarative abstracts that away from the query interface.
It's sets all the way down. A set of f(x) is still a set.
Lets get even more cursed then:
Uw you win, I guess. Already started to think what I would write in that code review comment ;)
At the very least mention that re-running the SELECT in the same connection would include the prior results as well because they are preserved across commands within the same connection.
Ha! plpgsql's seemingly sole purpose is to inject imperative code into a set-based environment. Probably does it more smoothly than most pl languages, but that's at the cost of imperative clarity.
But you're right. Postgres does allow for-loops like this. (They're also slower than the equivalent set-oriented approach.)
If anyone is interested, I had gpt explain this SQL to me - and it was really helpful, as I couldnt parse that with my level of SQL...
https://chat.openai.com/share/931b1778-6393-4e86-94b4-b3b5a5...
What is the point of the cross join? This would work as well:
For this example, nothing. It would be useful where neither of the two SELECT queries is a subset/superset of the other. (Not saying you didn't know that.)
Could you give an example?
This will be useful if you have a table with some related history records, e.g., products with product price history, and you want to get the latest price. The lateral join would get all prices for the current product, sort them by date and then pick the top row.
Is a lateral join more efficient than just saying where product = current sort by date desc ?
How does a lateral (cross) join compare to a window function in your example?
I don't think that the point was that there are no for loops in SQL. I think the point was that almost always using for loops is wrong and super inefficient and there is a much more efficient way to just use joins instead.
I wonder if anyone realizes that selects and joins in every database are implemented literally with for loops written in c.
The reason people use a relational database is because it has loops that are faster, safer, and more efficient than anything you can write.
What does it matter? Yeah, at the very bottom of it all there will be a loop. I wouldn't attribute that to C but to the von Neumann architecture.
The point is that by getting rid of loops you remove one way of telling the computer "How" to do it. Start here, do it in this order, combine the data this way.
When "How" is a solved problem, it is a waste of your brain to think about that again. "What" is a better use of your brain cycles.
Exactly, SQL is like a very specific natural language used to tell a database what data you need. The database contains AI (the query optimizer) that will use statistics to determine the fastest way to retrieve the data requested. Depending on your query and the data you have the fastest way to get your data will change. It’s hard to manually write this yourself.
LATERAL first available in PostgreSQL 9.3 (2013), but still not available in SQLite.
SQLite has RECURSIVE, so you can generate a table with all numbers using something like:
WITH RECURSIVE cnt(x) AS ( SELECT 1 UNION ALL SELECT x+1 FROM cnt LIMIT 5 ) SELECT x FROM cnt;
And then do a regular CROSS JOIN on that table.
But generating a table with all numbers was just a throwaway example of the LATERAL JOIN syntax/use, and (hopefully) not what you'd actually use it for in the real world.
It's not clear to me that (mathematically) a lateral join can be reduced to a recursive cte (and if the performance of a recursive cte would be acceptable for the cases where it does work as a substitute).
This is not what I call a for loop but more akin to a list comprehension.
And this is the point I was trying to make.
Instead of start with the "how", learn to do the "what".
I’ve been working in this industry professionally for close to 25 years now. By far the most valuable and useful language I have learned is SQL.
I have been working in this industry for 30 years now. SQL is one of the most ugliest programming language I have ever working with. Like Perl, it's write once, hope you never need to read, language.
SQL was created in 60s. It has not really kept up with the pace of modern programming language ergonomics. It was made for a single person executing a batch job query pulling data from the database.
On the other hand, I kind of agree SQL is good to learn. It's an counter example on how not to design a programming language.
There are preprocessors like PRQL that help with some (far from all) of the pain points of SQL as a language
Hopefully we get over the fact that we still need to transpile something to the ugly SQL and databases themselves start to speak a better language.
Despite all the hate MongoDb deserves, it solved the problem how application developers can easily get data in and out of a database.
There are certainly valid syntatical criticisms you can have of SQL[1] but on the whole SQL is expressive and rather concise while being solid enough to remain the default expression language of choice for fifty years.
1. The order of the phrases, the lack of trailing commas, the fact that an awful organization controls the standards and gatekeeps them to a bizarre extent.
Seriously? I just had a Quick Look at the mongodb docs, but when I see that I’m glad to use SQL. All the brackets and dollar prefixed expressions make mongodb queries hard to read.
The IBM System R and SEQUEL paper was 1974, while Oracle 2 was the first commercial database which added it in 1979.
Agreed, it's limited, ugly, and I'd rather use something else. Everytime I have to use it it's cumbersome.
I actually work in the Data/AI space and SQL has been of limited use to me.
Whilst Snowflake is pretty popular the days of elaborately modelled EDWs are long gone.
And so typically I find I am doing queries, transformations etc in some abstraction layer e.g. Spark on a data lake, ORM for web applications etc.
They're more prevalent than ever in my experience. Consider the popularity of dbt.
dbt is not elaborately modeled at all, it eschews all the up front work of data modeling with ... well basically a lot of code and poorly maintained balls of mud.
Don't forget the models other's create for you - often hilariously slow code to present a set of facets that often barely align with your business delivery needs; and don't forget to sync it even more slowly with Fivetran, the DE platform of the future!
This doesn't make any sense, or I'm guessing you've never actually used it. Modeling is something you can do with dbt, not what dbt does (or is, or can be?). I've used it to create data marts and EDW's with hundreds of tables, no differently than I would have created a decade ago with other tools.
As a counterpoint, I’ve also worked in this industry for 25 years, and SQL has had no relevance to my career. I write it very occasionally, and if I didn’t, it wouldn’t make any difference.
The C for loop on the other hand…
I am happy for you, you seem to have reached your local optimum!
> the most valuable and useful language ...
Programmers think the most valuable language is a programming language. Therefore, an LLM that can generate quality code in that language should also be extremely valuable.
I'd argue that the most valuable language is the natural language of the organization you're writing code for.
That language is vague and ambiguous, and encapsulates many hidden assumptions about how that organization works.
For example, if the organization is a retailer or a wholesaler, you might ask the LLM to generate SQL to return last month's sales and inventory for each site and product category.
The LLM can query the database system tables and guess which tables have sales transactions and inventory transactions. The LLM can look at foreign key definitions to make a good guess how to join these tables with metadata about items, sites, categories and the calendar.
But will the LLM know that inventory is a stock, and sales is a flow? Will it know it should sum the sales transactions but average the inventory balances, or take just the beginning or ending inventory balance?
Many human engineers struggle to translate ambiguous requests in natural language into code which reflects the assumptions and mechanics of the organization.
An LLM that generates consistently useful code needs a world model of how things work, not just how to write code.
Honestly I would rather write stuff in code thats inefficient than deal with one more stored procedure. Its like a black box
Just that deeply inside that forest of functions you just wrote is the inner join that the SQL query would do in a couple of lines of code embedded.
more like, when it comes to complex data structures and logic, i will do that outside of sql. I'll do a join with sql no problem, by the time we're doing multiple inner joins I usually prefer to just do multiple sql queries. I don't care about performance that badly.
That’ll often not scale to millions of records. Letting the database optimizer find the optimal execution path instead of doing it procedurally elsewhere might result in “finishes in 5 minutes”, versus “doesn’t fit in a night”.
This isn’t the 90s. Most hardware is way over-specced for the data sizes most people are dealing with.
The number of use cases which are too heavy to finish in hours but small enough to fit in a single instance is pretty limited.
SQL is popular because it can be run on a map/reduce backend. So once you have written your code it can run on any number of machines.
a) SQL is not that popular on map/reduce backends. Most people are doing it in code.
b) Only basic SQL works on any database and even then there are major differences in how they treat things like nulls, type coercion etc.
BigQuery? Athena/Redshift?
Costs are another reason to optimize queries, long running, inefficient queries will be a lot more expensive on things like snowflake than more efficient queries.
I usually only do one join at a time. But I separate them with CTEs ("WITH"). I can agree that many joins at once can make you grow grey hair.
That will be interesting soon when 80% of your code will be ai-generated black boxes…
Ian Malcom gif
Also good to note that unlike many other forms of AI, the SQL query optimizer AI is not a black box. It will tell you exactly how it thinks your data can be retrieved in the fastest way if you ask it to explain its plan for your query.
For those people who would rather use an ORM, its worth mentioning that ORMs write very bad, un-performant SQL under the hood.
They may save you a bit of time initially but when your company gets bigger, the ORMs will become a bottleneck.
"Bad" and "un-performant" are relative terms and as your company gets bigger, you're increasingly more and more likely to have colleagues who write even worse queries than an ORM would.
For example I've encountered queries that are not only slow, but they generate several hundred megabytes of output all of which is sent to the user's web browser where JavaScript selects the relevant two kilobytes of data to show the user.
The worst I've ever seen was a system where every single write to the database would be sent to every single web browser viewing certain webpages. 99.999999% of the writes were completely irrelevant and javascript in the browser would simply disregard them. The server load was immense... and eventually our Sysadmin brought it to someone's attention. Where we found out it was leaking sensitive data.
I guess once you get to a certain point, you need to hire engineers that have strong SQL skills in order to scale.
At which point - but not earlier! - you just make your ORM print the queries, fix them manually (or write them from scratch). You then can ditch the ORM, use it as a query builder DSL only, or use escape hatches provided by the ORM to inject raw SQL.
Don't use ORMs as a crutch to get away with not knowing SQL, that's bad. However, saving "a bit" - and with good ORMs, that "bit" is quite large - of time in the beginning is often very valuable. Just have a clear "exit strategy" for when (and in 90% of projects, if) it's needed.
SQL is a programming language! It is just declarative. And most programmers are associating imperative programming language as programming language. [1]
I found while declarative and functional programming languages are not as often used, however learning them made me a better programer.
[1]: https://stackoverflow.com/questions/10925689/functional-prog...
everyone (not just programmers) should learn SQL, and I have many times! For me, programming languages are pretty different from bikes in that I have very little "muscle memory" for syntax when staring at a blank page, especially when the bikes I ride more often feel totally different. Having my intern instantly write an almost correct draft of a function/query helps a lot.
that seems like a good thread to pull on!
SQL is like regex, one of those things I will keep unlearning if I'm not using daily.
Also the way each DB supports its own dialer is quite maddening.
do you not trust the setting OAI provides to exclude your conversation from training data?
1) OpenAI has consistently gone back on commitments it has made
2) Sam Altman has a shady track record publicly, and if you believe the things people say privately he has consistently done business very dishonestly throughout his career. He is the CEO, and virtually the entire executive team are people he brought in from his network. It’s his company.
3) To give one example of many, OpenAI recently changed the terms of ChatGPT so that web users conversations can now be trained on (and if you want to save any chats you must opt in). Presumably this also applies to all conversations you had under the old policy despite saying they would never train on those conversations.
I could go on at length…
If you don't trust OpenAI, you can choose to trust Microsoft. Azure OpenAI is 3-months behind OpenAI that costs 3x as much, but you get more 'security' and better performance.
How is it 3 months behind if you get access to current OpenAI models?
You don't.
It took like 2 extra months for Azure OpenAI to get GPT-4 turbo. There's a noticeable time delay between OpenAI deploying their latest model and when Microsoft manages to shove it in Azure.
And it's not just the models. E.g., the Assistants API is not available yet in Azure, and there is no expected ship date for it. But I'm confident it's coming.
Great to know, thank you.
Yes and I recommend that people interested in GPT-4 use this service as it’s isolated from OpenAI and it is the absolute best model right now.
That said, there are 3 quite worrying future possibilities, both stemming from the level of investment and commitment by Microsoft in OpenAI - a huge percentage of Azure’s value is bet on an exclusive partnership with them. That gives OpenAI a lot of leverage. And customer data is a very tempting cookie jar to build a long term moat for these companies
Possibility 1: someone overtakes OpenAI models and you’re stuck on Azure who won’t offer that model
Possibility 2: OpenAI decides to break with Microsoft and you’re stuck with a useless application. AWS Bedrock is far less likely to leave an AI application obsolete.
Possibility 3: OpenAI put pressure on Microsoft to loosen the data protections around the service, or go around them entirely. This is a particular concern as the systems in the service become more complex and agentic and more difficult for Microsoft to audit. Model weights are highly opaque and Microsoft cannot trace the exact possible behaviour of these systems. What if GPT-6 changes its own weights during inference for example? How can Microsoft ever understand if that’s a true critical piece of functionality or a proxy method to access customer data etc.
Good luck to OAI breaking with Microsoft when they have a 49% ownership share in your company and own all of your GPUs.
I don't see how anyone could trust Microsoft or openai given their track records.
Hm. I disagree that executive level shenanigans translate to real world non-compliance with privacy law when there is an explicit request to delete/not store data.
Uber
Airbnb
Facebook
The entire financial industry
also, every huge corporation that paid a vast (but relative to their market cap, insignificant) settlement, long after incidents in question, without admitting guilt.
Corporations have essentially culled vast numbers of humans until forced to stop. What’s a little lucrative nosiness in that context?
The massive tide of legally gray (including very dark gray) media hoovered up by training data vacuums isn’t exactly an industry secret. Whether any known player is more serious about protecting data source interests over their own ambitions remains to be verifiably demonstrated.
Someone once said, “it is easier to ask forgiveness than permission”. Someone might add, “if ever, or only performatively for congressional testimony theatre purposes, and definitely only after requiring a more punitive legal/regulatory moat to lock in the benefits of your non-compliance”.
ANY sketchiness should be taken seriously. Not making accusations. But sooner or later, somebody is going to say, “it’s a lot easier to cry and complain than claw back information someone took from you, who has billions to spend on lawyers”.
None of these cases involve unambiguously claiming that you will delete data and then not doing it.
Closer would be FTX which unambiguously claimed one thing and did the opposite, but I do not think OAI has FTX level of dysfunction.
Lots of companies have interpreted “deleting data” in creative ways, or just declared “oopsie” when caught. [0][1][2]
[0] Google: lied about deleting data (Google as a verb)
[1] Google: lied about deleting data (Google as a noun) https://arstechnica.com/tech-policy/2023/02/us-says-google-r...
[2] Google: lied about deleting data (Google as a definition in Webster’s dictionary -> Rickroll)
Even if there weren't shenanigans, it's valid to be concerned. Incentives lead to outcomes. Companies do a cost-benefit analysis, if the legal/reputational costs are less than what they stand to gain, history shows that they'll do the thing and then lie about it. Sam might be uniquely resistant to this due to a personal ethical code, but it's impossible to know for sure given that I can't read his mind.
Your average SQL query is worthless, there no special sauce in SQL queries, their value comes from the dataset they run on.
These small SQL LLMs are indeed worthless, since LLM performance on SQL queries is quite limited right now, every % of accuracy matters.
Did you mean to reply to a different comment?
with data privacy decisions I like to think about how the policy/reasoning would sound in a NYT headline or congress testimony. your stance does not imbue trust for me.
0 employment contracts or user facing ToS I've read make a distinction between sharing schema vs data with unvetted third parties. In my opinion, schemas reveal pretty valuable info about an application (and they're absolutely valuable in aggregate because you can use them to train AI data scientists).
Maybe I sound like a curmudgeon but since I have the option of running the AI locally with a 5 percentage point accuracy loss, I absolutely will. If GPT-4 was 100% that would be different because you could build totally different things, but 83% has most of the same design problems as 78%.
about as much as I trust 23andme to keep my genetic data private
i give their intent the benefit of the doubt but I’ve been on the other side of too many data collection systems to trust that theirs is foolproof and more secure than my local machine.
To me there is a significant difference between permanent storing and “keeping private” (23andme is legally prohibited from deleting your genetic data) and not using and deleting after 30 days.
fwiw 23andme genetic data (distinct from summary ancestry data) has not leaked afaik
it's not about trust, they will always be indirectly controlled by the US government who could force them to leak/release your data. OpenAI is already working with the US military and removed their restrictions on allowing their AI to be used for military purposes
https://time.com/6556827/openai-us-military-cybersecurity/
Yeah this is the issue I have with all of the SQL generation stuff. Not only should the SQL be valid, a prompt like "generate a query that pulls sales for the last quarter" should generate the same output for everyone without fail. Vanna's business logic embedding is a good first step but even then it is only correct like 90% of the time with GPT-4.
Even then, it will only work if there are strong standards and data governance structures in place that everyone within an organization is aligned on. For example, "sales" can mean different things to different people and all of that needs to be buttoned up as well.
Having an LLM be in charge of business logic is madness.
There cannot be any AI involved when processing the definition of a KPI. Otherwise you'll never be able to roll it out to thousands of users when there's always a 90% (or even 99%) chance that the business logic might not get applied correctly.
Check out what we do at Veezoo (https://www.veezoo.com) with the Knowledge Graph / Semantic Layer to mitigate that.
As someone who works with lots of analysts, I can guarantee that they also don't make the correct interpretations all the time, and that you have to sense-check the results back against reality.
In either case, validation is the key step - you can't just trust that your SQL query is correct regardless of if you have manually written it, you still have to go through the data and check it.
That's where the SQL generation stuff can save time - if 50% of the time you can get to an answer in half the time, then it's great! Normally in my experience with current-gen LLM's when they fail they fail quickly, so the other 50% of queries don't take twice as long to write manually.
Then there is the other use case - if you aren't sure why a particular SQL query is erroring, these LLM's are great at telling you why and fixing your code.
As an end consumer, you should be well versed in SQL (or whatever subject) yourself and should proof read twice all the queries generated before you make a decision based on the data that's pulled from the generated queries.
That's the story of the LLMs in general.
The hype is free. Startup ecosystem are a bonus.
Yep, just use GPT-4
The 25% of SQL that covers everything in 99% of applications is a subset of the 76.5% of SQL that the LLM got right.
Imo stuff like this can have applications in a narrower set of problems, where you can also further finetune or simple prompt engineer a model to have a much higher accuracy. I do not think that they can become the bases in which database work is gonna take place in general, but it could be used to automate certain types of tasks that would be too hard/complicated to automate in a classical programmatic way, but not too hard/complicated a real intelligence to be needed to solve.
This kind of approach could be a way to make these models reliable. Encode your business restrictions in the database, if they are rigorous enough, if the query passes then it's valid in your business world.
I would wonder how many of the 25% are actually dangerous to run, considering the target audience might be someone who is rusty at SQL and etc.
I can definitely appreciate security concerns but I don't see how providing your schema is of any concern whatsoever.
I have written a bunch of more or less complicated SQL during my career. And I am pretty sure that if I need to write a SQL statement that's anything but select * from table, my output won't work 75% of time.
I may be special case, but typically if I work on a hard problem, it is not a single hard problem but a sh*tload of connected simple problems. If I can get someone to solve the simple problems 75% of the time correctly so that I can spend my time figuring out how those simple problems are to be connected, I'm ore than happy. And that's exactly how I use chatgpt. I have learned not to ask too complex questions from it. But the simple ones, it mostly aces and when it does not , they are easy to spot, as it is not that I could not have solved them myself, I just did not want to spend time for that. Now, if only the chatgpt was not almost as lazy as me to produce long simple stuff, that would be awesome.
I know sql pretty well, but every once in a while doing things like percentiles come up where I don't remember the exact incantations to get the window syntax right. Things like this can save some googling and documentation reading. I can read it to see if it makes sense, and execute it. I can create tests for the output to verify. Can save a lot of time. If it's wrong, I use the fallback of googling for examples.
The kind of stuff that it is very easy to validate if it works or not :)
I am building a warehouse management system at the moment, and it's great to quickly churn out lots of SQL views (particularly as the schema is changing/evolving slightly as I am writing it, so being able to go back to GPT4 to churn through the changes to some of the 'views' of my pages helps, even if it requires a little testing/validation).