Show HN: Natural-SQL-7B, a strong text-to-SQL model

So it looks like it scores 76.5% on SQL-Eval [0], a bit behind GPT-4 at 83% and sqlcoder-15b at 78%.

What kind of applications would this be useful for? What can you build with an AI data science intern that's right 75% of the time?

As a programmer who always has to look stuff up when I SQL, I could definitely see asking something like this for a first draft of a query but it seems like I'm slightly better off asking the bigger models in these one-off cases (and I can run a 15b easily on my 64GB m1). If I'm in a corporate setting I'm not going to leak my schema into OpenAI's training data and there are definitely times when I'd want to run queries offline. Small/local models are great when you want to do a ton of queries (save $$).

A mini data scientist that could be queried by non-technical folks would be awesome but I wonder if there's a way to determine whether the query is falling in the 25% "incorrect" case... maybe there's a RAID-like consensus algorithm where you have multiple interrogate each other's answers to get a higher overall success rate.

Mostly thinking out loud :) but maybe ya'll have more ideas. Congrats on the release, OP!

[0]: https://github.com/defog-ai/sql-eval

1. If you are a a programmer I think you should learn SQL. It will give you a different perspective on programming that I think is invaluable. (I.e. programming without for loops)

2. Combining and slicing data is a craft, and doing it subtly wrong in one step can lead to fatal errors in the outcome.

And most importantly, it can be very difficult to notice. Numbers don't smell.

That is why I would be very hesitant to give a slightly more than trivial task to an engine that fails 25% of the time.

But I guess that is the same as any other programming task. Just that other programming tasks require a lot of boilerplate where an AI can help. SQL is much more straight to it.

Maybe it could be useful to ask questions that are similar to writing testcases "how can I verify that my query is doing the right thing"?

What makes you think that SQL doesn't have "for loops"?

Ever heard of LATERAL joins/CROSS APPLY?

  SELECT loop.value, x.squared
  FROM generate_series(1,5) AS loop(value)
  CROSS JOIN LATERAL (SELECT loop.value * loop.value AS squared) AS x;

TIL

They're incredibly useful -- any time that you want to both:

- Reference data from the previous part of the query (the "left-hand side")

- Return multiple columns

The only way you can achieve it is with LATERAL/CROSS APPLY.

Regular correlated subqueries can only return a single column, so something like this doesn't work:

  SELECT
    loop.val, (SELECT loop.val * loop.val, 'second column') AS squared
  FROM
    (SELECT loop.val FROM generate_series(1,5) AS loop(val)) as loop

You'd get:

   error: subquery must return only one column

I love LATERALs, but this still fits within set theory and a bulk application rather than an iterative for-loop. It may even be implemented as a for-loop within the engine, but SQL being declarative abstracts that away from the query interface.

It's sets all the way down. A set of f(x) is still a set.

Lets get even more cursed then:

    CREATE TEMP TABLE temp_results(value int, value_squared int);

    DO $$
    DECLARE
        r int;
    BEGIN
        FOR r IN SELECT generate_series FROM generate_series(1,5)
        LOOP
            INSERT INTO temp_results VALUES (r, r * r);
        END LOOP;
    END$$;

    SELECT * FROM temp_results;

Uw you win, I guess. Already started to think what I would write in that code review comment ;)

At the very least mention that re-running the SELECT in the same connection would include the prior results as well because they are preserved across commands within the same connection.

Ha! plpgsql's seemingly sole purpose is to inject imperative code into a set-based environment. Probably does it more smoothly than most pl languages, but that's at the cost of imperative clarity.

But you're right. Postgres does allow for-loops like this. (They're also slower than the equivalent set-oriented approach.)

If anyone is interested, I had gpt explain this SQL to me - and it was really helpful, as I couldnt parse that with my level of SQL...

https://chat.openai.com/share/931b1778-6393-4e86-94b4-b3b5a5...

What is the point of the cross join? This would work as well:

   SELECT loop.value, loop.value * loop.value
   FROM generate_series(1,5) AS loop(value)

For this example, nothing. It would be useful where neither of the two SELECT queries is a subset/superset of the other. (Not saying you didn't know that.)

Could you give an example?

This will be useful if you have a table with some related history records, e.g., products with product price history, and you want to get the latest price. The lateral join would get all prices for the current product, sort them by date and then pick the top row.

Is a lateral join more efficient than just saying where product = current sort by date desc ?

How does a lateral (cross) join compare to a window function in your example?

I don't think that the point was that there are no for loops in SQL. I think the point was that almost always using for loops is wrong and super inefficient and there is a much more efficient way to just use joins instead.

I wonder if anyone realizes that selects and joins in every database are implemented literally with for loops written in c.

The reason people use a relational database is because it has loops that are faster, safer, and more efficient than anything you can write.

What does it matter? Yeah, at the very bottom of it all there will be a loop. I wouldn't attribute that to C but to the von Neumann architecture.

The point is that by getting rid of loops you remove one way of telling the computer "How" to do it. Start here, do it in this order, combine the data this way.

When "How" is a solved problem, it is a waste of your brain to think about that again. "What" is a better use of your brain cycles.

Exactly, SQL is like a very specific natural language used to tell a database what data you need. The database contains AI (the query optimizer) that will use statistics to determine the fastest way to retrieve the data requested. Depending on your query and the data you have the fastest way to get your data will change. It’s hard to manually write this yourself.

LATERAL first available in PostgreSQL 9.3 (2013), but still not available in SQLite.

SQLite has RECURSIVE, so you can generate a table with all numbers using something like:

WITH RECURSIVE cnt(x) AS ( SELECT 1 UNION ALL SELECT x+1 FROM cnt LIMIT 5 ) SELECT x FROM cnt;

And then do a regular CROSS JOIN on that table.

But generating a table with all numbers was just a throwaway example of the LATERAL JOIN syntax/use, and (hopefully) not what you'd actually use it for in the real world.

It's not clear to me that (mathematically) a lateral join can be reduced to a recursive cte (and if the performance of a recursive cte would be acceptable for the cases where it does work as a substitute).

This is not what I call a for loop but more akin to a list comprehension.

And this is the point I was trying to make.

Instead of start with the "how", learn to do the "what".

I’ve been working in this industry professionally for close to 25 years now. By far the most valuable and useful language I have learned is SQL.

I have been working in this industry for 30 years now. SQL is one of the most ugliest programming language I have ever working with. Like Perl, it's write once, hope you never need to read, language.

SQL was created in 60s. It has not really kept up with the pace of modern programming language ergonomics. It was made for a single person executing a batch job query pulling data from the database.

On the other hand, I kind of agree SQL is good to learn. It's an counter example on how not to design a programming language.

There are preprocessors like PRQL that help with some (far from all) of the pain points of SQL as a language

Hopefully we get over the fact that we still need to transpile something to the ugly SQL and databases themselves start to speak a better language.

Despite all the hate MongoDb deserves, it solved the problem how application developers can easily get data in and out of a database.

There are certainly valid syntatical criticisms you can have of SQL[1] but on the whole SQL is expressive and rather concise while being solid enough to remain the default expression language of choice for fifty years.

1. The order of the phrases, the lack of trailing commas, the fact that an awful organization controls the standards and gatekeeps them to a bizarre extent.

Seriously? I just had a Quick Look at the mongodb docs, but when I see that I’m glad to use SQL. All the brackets and dollar prefixed expressions make mongodb queries hard to read.

SQL was created in 60s

The IBM System R and SEQUEL paper was 1974, while Oracle 2 was the first commercial database which added it in 1979.

Agreed, it's limited, ugly, and I'd rather use something else. Everytime I have to use it it's cumbersome.

I actually work in the Data/AI space and SQL has been of limited use to me.

Whilst Snowflake is pretty popular the days of elaborately modelled EDWs are long gone.

And so typically I find I am doing queries, transformations etc in some abstraction layer e.g. Spark on a data lake, ORM for web applications etc.

the days of elaborately modelled EDWs are long gone

They're more prevalent than ever in my experience. Consider the popularity of dbt.

dbt is not elaborately modeled at all, it eschews all the up front work of data modeling with ... well basically a lot of code and poorly maintained balls of mud.

Don't forget the models other's create for you - often hilariously slow code to present a set of facets that often barely align with your business delivery needs; and don't forget to sync it even more slowly with Fivetran, the DE platform of the future!

dbt is not elaborately modeled at all

This doesn't make any sense, or I'm guessing you've never actually used it. Modeling is something you can do with dbt, not what dbt does (or is, or can be?). I've used it to create data marts and EDW's with hundreds of tables, no differently than I would have created a decade ago with other tools.

As a counterpoint, I’ve also worked in this industry for 25 years, and SQL has had no relevance to my career. I write it very occasionally, and if I didn’t, it wouldn’t make any difference.

The C for loop on the other hand…

I am happy for you, you seem to have reached your local optimum!

> the most valuable and useful language ...

Programmers think the most valuable language is a programming language. Therefore, an LLM that can generate quality code in that language should also be extremely valuable.

I'd argue that the most valuable language is the natural language of the organization you're writing code for.

That language is vague and ambiguous, and encapsulates many hidden assumptions about how that organization works.

For example, if the organization is a retailer or a wholesaler, you might ask the LLM to generate SQL to return last month's sales and inventory for each site and product category.

The LLM can query the database system tables and guess which tables have sales transactions and inventory transactions. The LLM can look at foreign key definitions to make a good guess how to join these tables with metadata about items, sites, categories and the calendar.

But will the LLM know that inventory is a stock, and sales is a flow? Will it know it should sum the sales transactions but average the inventory balances, or take just the beginning or ending inventory balance?

Many human engineers struggle to translate ambiguous requests in natural language into code which reflects the assumptions and mechanics of the organization.

An LLM that generates consistently useful code needs a world model of how things work, not just how to write code.

Honestly I would rather write stuff in code thats inefficient than deal with one more stored procedure. Its like a black box

Just that deeply inside that forest of functions you just wrote is the inner join that the SQL query would do in a couple of lines of code embedded.

more like, when it comes to complex data structures and logic, i will do that outside of sql. I'll do a join with sql no problem, by the time we're doing multiple inner joins I usually prefer to just do multiple sql queries. I don't care about performance that badly.

That’ll often not scale to millions of records. Letting the database optimizer find the optimal execution path instead of doing it procedurally elsewhere might result in “finishes in 5 minutes”, versus “doesn’t fit in a night”.

This isn’t the 90s. Most hardware is way over-specced for the data sizes most people are dealing with.

The number of use cases which are too heavy to finish in hours but small enough to fit in a single instance is pretty limited.

SQL is popular because it can be run on a map/reduce backend. So once you have written your code it can run on any number of machines.

a) SQL is not that popular on map/reduce backends. Most people are doing it in code.

b) Only basic SQL works on any database and even then there are major differences in how they treat things like nulls, type coercion etc.

BigQuery? Athena/Redshift?

Costs are another reason to optimize queries, long running, inefficient queries will be a lot more expensive on things like snowflake than more efficient queries.

I usually only do one join at a time. But I separate them with CTEs ("WITH"). I can agree that many joins at once can make you grow grey hair.

That will be interesting soon when 80% of your code will be ai-generated black boxes…

Ian Malcom gif

Also good to note that unlike many other forms of AI, the SQL query optimizer AI is not a black box. It will tell you exactly how it thinks your data can be retrieved in the fastest way if you ask it to explain its plan for your query.

For those people who would rather use an ORM, its worth mentioning that ORMs write very bad, un-performant SQL under the hood.

They may save you a bit of time initially but when your company gets bigger, the ORMs will become a bottleneck.

"Bad" and "un-performant" are relative terms and as your company gets bigger, you're increasingly more and more likely to have colleagues who write even worse queries than an ORM would.

For example I've encountered queries that are not only slow, but they generate several hundred megabytes of output all of which is sent to the user's web browser where JavaScript selects the relevant two kilobytes of data to show the user.

The worst I've ever seen was a system where every single write to the database would be sent to every single web browser viewing certain webpages. 99.999999% of the writes were completely irrelevant and javascript in the browser would simply disregard them. The server load was immense... and eventually our Sysadmin brought it to someone's attention. Where we found out it was leaking sensitive data.

I guess once you get to a certain point, you need to hire engineers that have strong SQL skills in order to scale.

They may save you a bit of time initially but when your company gets bigger, the ORMs will become a bottleneck.

At which point - but not earlier! - you just make your ORM print the queries, fix them manually (or write them from scratch). You then can ditch the ORM, use it as a query builder DSL only, or use escape hatches provided by the ORM to inject raw SQL.

Don't use ORMs as a crutch to get away with not knowing SQL, that's bad. However, saving "a bit" - and with good ORMs, that "bit" is quite large - of time in the beginning is often very valuable. Just have a clear "exit strategy" for when (and in 90% of projects, if) it's needed.

SQL is a programming language! It is just declarative. And most programmers are associating imperative programming language as programming language. [1]

I found while declarative and functional programming languages are not as often used, however learning them made me a better programer.

[1]: https://stackoverflow.com/questions/10925689/functional-prog...

everyone (not just programmers) should learn SQL, and I have many times! For me, programming languages are pretty different from bikes in that I have very little "muscle memory" for syntax when staring at a blank page, especially when the bikes I ride more often feel totally different. Having my intern instantly write an almost correct draft of a function/query helps a lot.

Maybe it could be useful to ask questions that are similar to writing testcases "how can I verify that my query is doing the right thing"?

that seems like a good thread to pull on!

SQL is like regex, one of those things I will keep unlearning if I'm not using daily.

Also the way each DB supports its own dialer is quite maddening.

do you not trust the setting OAI provides to exclude your conversation from training data?

1) OpenAI has consistently gone back on commitments it has made

2) Sam Altman has a shady track record publicly, and if you believe the things people say privately he has consistently done business very dishonestly throughout his career. He is the CEO, and virtually the entire executive team are people he brought in from his network. It’s his company.

3) To give one example of many, OpenAI recently changed the terms of ChatGPT so that web users conversations can now be trained on (and if you want to save any chats you must opt in). Presumably this also applies to all conversations you had under the old policy despite saying they would never train on those conversations.

I could go on at length…

If you don't trust OpenAI, you can choose to trust Microsoft. Azure OpenAI is 3-months behind OpenAI that costs 3x as much, but you get more 'security' and better performance.

Azure OpenAI is 3-months behind OpenAI…

How is it 3 months behind if you get access to current OpenAI models?

You don't.

It took like 2 extra months for Azure OpenAI to get GPT-4 turbo. There's a noticeable time delay between OpenAI deploying their latest model and when Microsoft manages to shove it in Azure.

And it's not just the models. E.g., the Assistants API is not available yet in Azure, and there is no expected ship date for it. But I'm confident it's coming.

Great to know, thank you.

Yes and I recommend that people interested in GPT-4 use this service as it’s isolated from OpenAI and it is the absolute best model right now.

That said, there are 3 quite worrying future possibilities, both stemming from the level of investment and commitment by Microsoft in OpenAI - a huge percentage of Azure’s value is bet on an exclusive partnership with them. That gives OpenAI a lot of leverage. And customer data is a very tempting cookie jar to build a long term moat for these companies

Possibility 1: someone overtakes OpenAI models and you’re stuck on Azure who won’t offer that model

Possibility 2: OpenAI decides to break with Microsoft and you’re stuck with a useless application. AWS Bedrock is far less likely to leave an AI application obsolete.

Possibility 3: OpenAI put pressure on Microsoft to loosen the data protections around the service, or go around them entirely. This is a particular concern as the systems in the service become more complex and agentic and more difficult for Microsoft to audit. Model weights are highly opaque and Microsoft cannot trace the exact possible behaviour of these systems. What if GPT-6 changes its own weights during inference for example? How can Microsoft ever understand if that’s a true critical piece of functionality or a proxy method to access customer data etc.

Good luck to OAI breaking with Microsoft when they have a 49% ownership share in your company and own all of your GPUs.

I don't see how anyone could trust Microsoft or openai given their track records.

Hm. I disagree that executive level shenanigans translate to real world non-compliance with privacy law when there is an explicit request to delete/not store data.

I disagree that executive level shenanigans translate to real world non-compliance with ________ law

Uber

Airbnb

Facebook

The entire financial industry

also, every huge corporation that paid a vast (but relative to their market cap, insignificant) settlement, long after incidents in question, without admitting guilt.

Corporations have essentially culled vast numbers of humans until forced to stop. What’s a little lucrative nosiness in that context?

The massive tide of legally gray (including very dark gray) media hoovered up by training data vacuums isn’t exactly an industry secret. Whether any known player is more serious about protecting data source interests over their own ambitions remains to be verifiably demonstrated.

Someone once said, “it is easier to ask forgiveness than permission”. Someone might add, “if ever, or only performatively for congressional testimony theatre purposes, and definitely only after requiring a more punitive legal/regulatory moat to lock in the benefits of your non-compliance”.

ANY sketchiness should be taken seriously. Not making accusations. But sooner or later, somebody is going to say, “it’s a lot easier to cry and complain than claw back information someone took from you, who has billions to spend on lawyers”.

None of these cases involve unambiguously claiming that you will delete data and then not doing it.

Closer would be FTX which unambiguously claimed one thing and did the opposite, but I do not think OAI has FTX level of dysfunction.

Lots of companies have interpreted “deleting data” in creative ways, or just declared “oopsie” when caught. [0][1][2]

[0] Google: lied about deleting data (Google as a verb)

[1] Google: lied about deleting data (Google as a noun) https://arstechnica.com/tech-policy/2023/02/us-says-google-r...

[2] Google: lied about deleting data (Google as a definition in Webster’s dictionary -> Rickroll)

Even if there weren't shenanigans, it's valid to be concerned. Incentives lead to outcomes. Companies do a cost-benefit analysis, if the legal/reputational costs are less than what they stand to gain, history shows that they'll do the thing and then lie about it. Sam might be uniquely resistant to this due to a personal ethical code, but it's impossible to know for sure given that I can't read his mind.

Your average SQL query is worthless, there no special sauce in SQL queries, their value comes from the dataset they run on.

These small SQL LLMs are indeed worthless, since LLM performance on SQL queries is quite limited right now, every % of accuracy matters.

Did you mean to reply to a different comment?

with data privacy decisions I like to think about how the policy/reasoning would sound in a NYT headline or congress testimony. your stance does not imbue trust for me.

0 employment contracts or user facing ToS I've read make a distinction between sharing schema vs data with unvetted third parties. In my opinion, schemas reveal pretty valuable info about an application (and they're absolutely valuable in aggregate because you can use them to train AI data scientists).

Maybe I sound like a curmudgeon but since I have the option of running the AI locally with a 5 percentage point accuracy loss, I absolutely will. If GPT-4 was 100% that would be different because you could build totally different things, but 83% has most of the same design problems as 78%.

about as much as I trust 23andme to keep my genetic data private

i give their intent the benefit of the doubt but I’ve been on the other side of too many data collection systems to trust that theirs is foolproof and more secure than my local machine.

To me there is a significant difference between permanent storing and “keeping private” (23andme is legally prohibited from deleting your genetic data) and not using and deleting after 30 days.

fwiw 23andme genetic data (distinct from summary ancestry data) has not leaked afaik

it's not about trust, they will always be indirectly controlled by the US government who could force them to leak/release your data. OpenAI is already working with the US military and removed their restrictions on allowing their AI to be used for military purposes

https://time.com/6556827/openai-us-military-cybersecurity/

What kind of applications would this be useful for? What can you build with an AI data science intern that's right 75% of the time?

Yeah this is the issue I have with all of the SQL generation stuff. Not only should the SQL be valid, a prompt like "generate a query that pulls sales for the last quarter" should generate the same output for everyone without fail. Vanna's business logic embedding is a good first step but even then it is only correct like 90% of the time with GPT-4.

Even then, it will only work if there are strong standards and data governance structures in place that everyone within an organization is aligned on. For example, "sales" can mean different things to different people and all of that needs to be buttoned up as well.

Having an LLM be in charge of business logic is madness.

There cannot be any AI involved when processing the definition of a KPI. Otherwise you'll never be able to roll it out to thousands of users when there's always a 90% (or even 99%) chance that the business logic might not get applied correctly.

Check out what we do at Veezoo (https://www.veezoo.com) with the Knowledge Graph / Semantic Layer to mitigate that.

As someone who works with lots of analysts, I can guarantee that they also don't make the correct interpretations all the time, and that you have to sense-check the results back against reality.

In either case, validation is the key step - you can't just trust that your SQL query is correct regardless of if you have manually written it, you still have to go through the data and check it.

That's where the SQL generation stuff can save time - if 50% of the time you can get to an answer in half the time, then it's great! Normally in my experience with current-gen LLM's when they fail they fail quickly, so the other 50% of queries don't take twice as long to write manually.

Then there is the other use case - if you aren't sure why a particular SQL query is erroring, these LLM's are great at telling you why and fixing your code.

As an end consumer, you should be well versed in SQL (or whatever subject) yourself and should proof read twice all the queries generated before you make a decision based on the data that's pulled from the generated queries.

That's the story of the LLMs in general.

The hype is free. Startup ecosystem are a bonus.

Yep, just use GPT-4

The 25% of SQL that covers everything in 99% of applications is a subset of the 76.5% of SQL that the LLM got right.

Imo stuff like this can have applications in a narrower set of problems, where you can also further finetune or simple prompt engineer a model to have a much higher accuracy. I do not think that they can become the bases in which database work is gonna take place in general, but it could be used to automate certain types of tasks that would be too hard/complicated to automate in a classical programmatic way, but not too hard/complicated a real intelligence to be needed to solve.

This kind of approach could be a way to make these models reliable. Encode your business restrictions in the database, if they are rigorous enough, if the query passes then it's valid in your business world.

I would wonder how many of the 25% are actually dangerous to run, considering the target audience might be someone who is rusty at SQL and etc.

I can definitely appreciate security concerns but I don't see how providing your schema is of any concern whatsoever.

What kind of applications would this be useful for? What can you build with an AI data science intern that's right 75% of the time?

I have written a bunch of more or less complicated SQL during my career. And I am pretty sure that if I need to write a SQL statement that's anything but select * from table, my output won't work 75% of time.

I may be special case, but typically if I work on a hard problem, it is not a single hard problem but a sh*tload of connected simple problems. If I can get someone to solve the simple problems 75% of the time correctly so that I can spend my time figuring out how those simple problems are to be connected, I'm ore than happy. And that's exactly how I use chatgpt. I have learned not to ask too complex questions from it. But the simple ones, it mostly aces and when it does not , they are easy to spot, as it is not that I could not have solved them myself, I just did not want to spend time for that. Now, if only the chatgpt was not almost as lazy as me to produce long simple stuff, that would be awesome.

I know sql pretty well, but every once in a while doing things like percentiles come up where I don't remember the exact incantations to get the window syntax right. Things like this can save some googling and documentation reading. I can read it to see if it makes sense, and execute it. I can create tests for the output to verify. Can save a lot of time. If it's wrong, I use the fallback of googling for examples.

What kind of applications would this be useful for? What can you build with an AI data science intern that's right 75% of the time?

The kind of stuff that it is very easy to validate if it works or not :)

I am building a warehouse management system at the moment, and it's great to quickly churn out lots of SQL views (particularly as the schema is changing/evolving slightly as I am writing it, so being able to go back to GPT4 to churn through the changes to some of the 'views' of my pages helps, even if it requires a little testing/validation).

This is not open source, because you have use-based restrictions. Call it what it is, source available.

Call it what it is, source available.

Also, it's only weights AFAICT — no source training data/code is available.

Shareware?

Yep. Open source means you can build and modify it. If not, it’s not open source.

You know it’s a bad timeline when releasing the equivalent of a binary is considered “open”.

Except people modify "non open source" but "weights available" models all the time.

In fact, this very model is such modification (fine tune) of the original base model.

Except it's not really known for all non-open-source licenses are allowed to be modified. I can similarly jailbreak a phone but it's not open source.

What is old is new again!

Share-able weights are an interesting one, because although you can't re-generate it from scratch, you can modify it and share it. They're sort of halfway in between source code (which allows you to regenerate a binary from scratch and also inspect everything that went into the binary) and a free-as-in-beer binary (which you typically can't change at all). We sort of need a new term for this kind of thing.

I feel like we should try to reserve "open" for something that has all of the "four freedoms". The key thing about this is that it's not inspect-able, but it is derivable. Derivable-weight license?

EDIT: Looking at the "four freedoms" [1], "freedom 1" is:

The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.

Essentially the thing about weights is that you can superficially retrain bits of it to adapt it to your use case without needing to do a full re-train. But of course, without access to the training set, you can't really be sure what's in those weights, nor make more fundamental changes that would require adding or removing data.

[1] https://www.gnu.org/philosophy/free-sw.en.html#four-freedoms

Updated the title since it may have been confusing, appreciate the feedback!

I don't think any of those text-to-sql models are solving the right problems. The hard part is not syntax or I don't know how to write a group by query. Most data scientists and engineers spend more time on understanding the meaning of the data. One cannot simply look at a 50 columns table in Snowflake and guess what columns are by their names. For example, we have 10 columns in one tables, all named ...price. We have to go to wiki to find the actual meaning or read the DBT definitions. I cannot trust any queries that models produce because they don't understand the data; they only understand the query syntax.

Yeah, we've been working on this problem a good bit and I think text-to-sql is a dead-end for analytical questions.

We (https://www.definite.app/) ended up abandoning text-to-sql in favor of answering questions with a semantic layer (which LLM's are far more effective against).

https://www.loom.com/share/a0d3c0e273004d7982b2aed24628ef40

So you don’t use AI to generate SQL to retrieve data? As you say on the web site?

At Databricks we have an LLM that is fine-tuned to do the problem you raise -

https://www.databricks.com/blog/announcing-public-preview-ai...

Many customers like it a lot. Although perhaps in your case if there are many pricing details it may not be quite accurate.

Can I ask how you fine-tune or if you can be a bit more specific?

You are right but you can use RAG to “teach” AI about your schema. I did a write up on my implementation [1].

[1]: https://www.sqlai.ai/posts/enhancing-ai-accuracy-for-sql-gen...

I totally agree that the value is in understanding the data. However, as a tool, do you see value in being able to quickly skeleton out the query? Autocomplete in code is a reasonable analogy to me.

It seems like fine tuning on queries using the underlying schema would work, people are doing this.

really cool, the license is not really standard, but seems open source. The actual model can be found here: https://huggingface.co/cfahlgren1/NaturalSQL-6.7B-v0

This seems like a great base model, although I wonder if text-to-sql is good use case for small models. We are also building a tool in the space and I regularly wish gpt-4 to be even more knowledgable when answering. Even gpt 3.5 is not good enough for production.

Thanks! Yes, that was an earlier iteration. The v1 is here https://huggingface.co/chatdb/natural-sql-7b. Plan to push to Ollama soon and build an open source free tool around it.

Would love to hear about what you are building!

We are building Dot (https://www.getdot.ai/). We mostly focus on good conversation design and data governance to ensure a great end user experience.

Here is an example conversation about HN data https://eu.getdot.ai/share/c80139c9-13f4-4db4-88f6-6e058ba31...

That's impressive

complex questions like above.

This is cool, and up my alley. But that's not a complex question, it's a basic analytics question. Most analysts will be able to write something like that in their sleep.

I've been using ChatGPT for writing SQL, and it's mediocre. But it'll get better, I'm sure.

Sure, there is a blog post with some other examples of the first model iteration with more complex, multipart questions. https://www.chatdb.ai/post/naturalsql-vs-sqlcoder-for-text-t...

I will update that to be a more truly difficult question. Appreciate the feedback!

Thank you, this is great work! I'll check it out.

Even GPT-4 often writes queries with redundant subqueries, excessive nesting or joins etc. I don't think that's particularly valuable.

On the other hand, telling GPT to generate SQL to query a data store as part of solving some task that requires inference from facts captured in that data store works surprisingly well - better than "function calls" with JSON, in my opinion. While such generated queries are also suboptimal, they still capture the intent correctly, and GPT is surprisingly adept at using nested subqueries to get the answer it needs in a single query. And when such generated SQL is wrong, it usually fails to parse (e.g. due to typos in field names), at which point you can just feed the error message back to the model and have it correct that.

I doubt it's useful for complex queries or databases without proper relation info in the database schemes.

So it's limited to rather simple queries for users without SQL knowledge. But I doubt they should have direct access to the database tables.

Usually what you do with these type of LLMs is pass on most if not all of the schema in your query.

And you’re right in that the end user wouldn’t have access to the schema, although perhaps via prompt injection they could.

The problem is there are programs where the schemes don't show the necessary links between the programs data tables like foreign keys for instance.

Yes it is designed for users without SQL knowledge, however, it can still perform fairly well with questions on the difficult side (for non technical users) with queries having multiple joins, aggregations, ratios, and subqueries.

Next step is fine tuning and leveraging larger models that can handle very complex questions, reasoning, and data schemas since this is only a 7B.

Is this the data that was used for fine tuning?

https://github.com/defog-ai/sql-eval/blob/main/data/question...

No, those are benchmark, evaluation questions. The fine tune dataset was a custom, synthetically generated dataset of ~20k PostgreSQL Text to SQL pairs covering different SQL categories and question types.

I mention a little more about it here https://x.com/calebfahlgren/status/1754247740291207198?s=20

So this is essentially postgres only? Or how will it handle e.g. MS SQL Schemas and output?

Currently Postgres yes, already working on a dataset with more DDLs like MySQL, DuckDB, MSSQL, etc for a second iteration.

Very cool. Would the license allow for use with Vanna? https://github.com/vanna-ai/vanna

Yes please do! Looks awesome, would love to help any way I can as well.

Not OP, but would it be possible to use a standardized license? Every time a special purpose license is used for a software that gains adaptation, the lawyers of hundreds to thousands of different companies must spend a lot of time and iterations with the team to figure out if they can actually use this model. There is something magical in the GPL, MIT, Apache, etc licenses because these lawyers have already opined on them once and no longer create a bottleneck.

This is a big improvement, but I'm not a believer that SQL is the most appropriate query lang here. Personally am more bullish on language models being trained with ORMs, as those normally capture much more information about the fields.

e.g. Passing in some of my more complex table schemas related to flight data and asking about overflights, the model struggles to resolve out information related to aviation. However, GitHub Copilot writes me a perfect call to Prisma with the same single line instruction + information spanning the rest of my codebase.

Agree that an approach that more semantically models the data is better, especially when you want to eventually let the non-technical users ask questions.

When you're on a higher abstraction level, it also allows you to make clear definitions (e.g. for certain KPIs) and define business logic that always needs to be applied to get the correct results.

There you don't want to leave it up to chance that a filter gets hallucinated in or out when you ask e.g. about your company's revenue.

At Veezoo (https://www.veezoo.com) we have taken the approach that instead of going directly to SQL. So when a user asks a question, Veezoo translates it first into a query against the Knowledge Graph (which represents the business objects, their relationship etc.). From there we compile it into a SQL query depending on the target database (they all have slight differences) without any AI involvement. In this compilation step we also make sure that the business logic is properly applied.

Neat, would you ever use a local model for that if it could work with ORMs?

Context is 4096? My app db DDL is 19877 tokens (using Llama2 tokenizer) long, so that means we need to do a RAG for handling the DDL prompt injection.

A model like this with a 32k long seq_len, like Mixtral, would be a killer for me.

Great call out. Will definitely focus on that in the next iteration!

So I need to ask my lawyer to review a custom license before I can consider using this? No thanks.

Sorry, it seems complicated. Since it is a finetune of Deepseek-Coder, I had to include their license.

Deepseek is pretty open, just says not to use it for: - military purposes - exploiting vulnerabilities etc.

Just trying to include as much information as possible from the initial base model.

Is there a DSPy module, where I can give a schema, and can start asking sql questions in a structured way?

Not yet, planning to build an open source, local, CLI tool that utilizes natural-sql though!

In the data world I've worked with tons of folks whose responsibilities include getting questions from execs, knowing their way around the data warehouse enough to write SQL to answer those questions, and delivering the answers back (sometimes formatted nicely). Sometimes they have to predict what followup questions the exec will ask, like "why is that number so low, it obviously shouldn't be that low" so they can press the data engineers for bugs.

Like all things LLM, I don't know if this is about to make those responsibilities a lot easier, or just eliminate them altogether.

My money is on "overpaid execs will use this to get wrong information, and get mad at their subordinates for correcting them."

So you feed it the whole DDL with the prompt. I wonder how it performs on a task like schema normalization or optimization? Say, also include the slow queries log and ask it for the SQL commands to modify the indexes to speed up those queries. Allow it to normalize/denormalize/materialize as options. Give it a way to run the resulting DDL against a test suite and iterate toward some fitness goal.

This would save gobs of compute.

That is really interesting. I took note of this. That would be really cool!

Like many uses of AI, very good as a "seed" especially if it comes up with nuggets like grouping on a range instead of a single value. But as with almost any database, devil is in details. Different products have different interp of "quantity" (e.g. box vs. unit), coupons and discounts are modeled in weird ways, weights are assumed to be pounds/kg and are mixed without assigning units, etc. etc.

Good point. However, AI may also train on the database metadata, including DDL comments, locale, etc. and perform data sampling to glean the nuances that are "understood".

Next generation of 'software engineers' are going to be brain dead: 'duhhh gpt how 2 do x... duh...'

Maybe? Or the bar will just be a lot higher for them since there will be less work at the entry level. Either way I think previous generations of SWEs will be at an advantage going forward.

One Problem I always see in such apps is that the ai can't see in to the database or into all entries, so queries without stating data EXACTLY as in the database run into issues.

example: give me the revenue for all logistics firms

but in the database these might not be called "logistics" and may be called "transport" (or anything)

maybe there are some counters to this like finding unique values per column or even better use a grammar based approach, wich will select only valid entries.

but the simple text to SQL is at this point not the "hard thing to solve"

Usually you include the database schema in the context, usually by showing the CREATE statement for the tables you want to query. I've also found that including comments in the CREATE sql can guide the model somewhat. The best approach is probably to finetune one of these models using curated questions for your database.

For anyone who says it's useless because it's only 75% correct, please consider these two points:

(1) this is the first instalment, and it's already close to be a thousand times more useful for product owners and analytics than any airtable you can imagine.

(2) as much as I love being on point on every challenge, we're leaving in "good enough" economics for quite some time, and if this will be close enough that will be good enough for business.

this is pretty cool, I've seen a similar project already, I think it's called SQL translate

Natural language-derived SQL will be useful for programming language integration involving type-safe SQL. I'm currently using manifold-sql[1] for this, and having the ability to transform English directly to SQL type-safely is amazing.

1. https://github.com/manifold-systems/manifold/blob/master/man...

Would love to get an ollama Modelfile showing an imaginary database schema as an example!

Would be very interesting if you wrote a bit about what you did to actually do the finetuning

I would 100% use this if you had a text -> Big Query SQL feature. Please add.

We’re calling anyone who can write SQL a data scientist now?

previous palantir & scale ai engineer w/ an ai data science startup here. we found that RAG on a large corpus was the best outcome, not just RAG on queries, but RAG across any data the org uploaded to our system, athena intelligence. semantic search on documents, slack, and other sources provided better outcomes w/ on-prem models too. this has allowed us to unlock some novel analytics workflows including multi-modal, figma style, etc. we also just opened up our platform for public access. currently working with enterprises like anheuser-busch.

https://app.athenaintelligence.ai/

I've been using GPT-4 to write SQL to get a bunch of insights. Its very good at writing up queries for all kinds of metrics. These queries answer questions that would have previously required an actual data squad.

Hard to claim success with "complex" questions if you don't account for business context and organizational nuances. For example, "active" listings on Redfin may be a combination of days on market, last open house, last update, etc instead of a Boolean flag called "is_active". How can we expect models to generate correct SQL at the enterprise level without providing a support structure of business context? The model can only be so good.

At university I studied 'natural language interfaces to databases' (NLIDBs). I recall that very early, I think in the 60, NASA and other organisations were trying to build things like this for their scientists. Of course, there was no SQL in those days. Once SQL rose to prominence, my impression is that NLIDB interest faded, especially with the lack of NLP breakthroughs. My project was to create an analog to SQL in more plain, naturalistic English (a bit like what AppleScript is to other programming languages), and then let the user construct the queries by using a GUI rather than free-typing. The GUI would constrain the query text to permissible syntax, letting the user click to add new clauses and select columns etc, while maintaining a 'layman intelligible' sentence. Anyway, maybe now we can get NLIDBs with true NLP.

Heh. Then there are those that say SQL already reads like natural language.

What happens if you feed the model with sql as input?

I was wondering how it performs in a more complex (and realistic) benchmark like Bird?

https://bird-bench.github.io/