SQL for data scientists in 100 queries

Not to detract from the article, but: Wow the meaning of the term "data scientist" has changed since the days of "sexiest job". From the article description:

- Rachel has a master’s degree in cell biology and now works in a research hospital doing cell assays.

- She learned a bit of R in an undergrad biostatistics course and has been through the Carpentries lesson on the Unix shell.

- Rachel is thinking about becoming a data scientist and would like to understand how data is stored and managed.

Data Scientists, back in the day, were largely people with both a fairly strong quantitative background and a strong software engineering background. The kind of people who could build a demo LSTM in an afternoon. Usually there was a bit of a trade-off between the quant/software aspects (really mathly people might be worse coders, really strong coders might need to freshen up on a few areas of mathematics), but generally they were fairly strong in each area.

In many orgs it's been reduced to "over paid data analysts" but I wouldn't even hire "Rachel" for a role like that.

No that's MLE. A DS rarely gets asked leetcode algos questions, an MLE would.

Depends on the company, a research level MLE would be asked to derive loss functions and perform partial differentiation on pen and paper. You have to answer questions like what Kullbeck Leiber divergence is and how it can be utilized etc.

Would that be a lesser known cousin of the better known Kullback–Leibler separation measure for distributions?

Full marks for snark, but points off for being incorrect.

https://en.m.wikipedia.org/wiki/Kullback%E2%80%93Leibler_div...

From your link:

    a measure of how one probability distribution P is different from a second,

ie. literally it's a separation measure for distributions .. just as I recalled from my first encounter with the notion ~ 1984 (ish).

If you're sincere you should either add those points back or, preferably, expand upon your theory of how my snap take is incorrect.

( I'm aware it's not a metric due to triangle inequality, etc. )

The snap take came across as an argument about which of two names for the measure is better-known.

The wikipedia page implies the opposite of that argument.

Perhaps that’s changed since 1984, but the proposition was about current practices.

It's been Kullback since birth in 1907 to the best of my knowledge, never once Kullbeck.

As a fully anglicized US citizen born in Brooklyn, New York I don't think there's ever been any vowel confusion over the spelling of the name:

https://en.m.wikipedia.org/wiki/Solomon_Kullback

Admittedly I did check as it's not uncommon for mathematicians to have alternate spellings for their names.

Ditto Leibler, born Chicago, Illinois in 1914, no dropped L

https://en.m.wikipedia.org/wiki/Richard_Leibler

I literally was asked two leetcode questions verbatim when interviewing for a data science position at TikTok a few months ago. Dynamic programming (I won't mention which question) and then one regarding binary trees.

(I won't mention which question)

You must protect the corporate overlords.

Alternatively, protect themselves since giving away an individualized question could identify them.

The question does not matter at all.

All of the information in the knowledge of Leetcode + category.

(Does it really matter WHICH question?? They are different but all the same. That is the point.)

You must have worked at different places from me. Nearly every DS job I had (before wisely apparently) leaving that area had leetcode style algo questions during the interviews.

Again, things have apparently changed.

No, it hasn't.

The term has always gone in a half-dozen directions at once, and ranged anything from

* an idiot making PPT decks for business presentations based on sales data; to

* a statistician with very sophisticated mathematical background but minimal programming skills doing things in R or State; to

* a person with a random degree making random dashboard in Tableau; to

* a person with sophisticate background in software engineering, data engineering, and related fields who can kind of do math

* an expert in machine learning (of various calibers)

* a physicist using their quantitative skills to munge data

... and so on. That's been confusing people since the title came out. It depends on the industry, and there's a dozen overlapping titles too, some with well-defined meanings and some varying from company to company (business analyst, data engineering, etc.).

Relatedly - and I’m a lead data engineer at my current $JOB — I’ve yet to find a definition of what a data engineer is/does that I find easy to share with people. Of course I have flippant ones (YAML dev with a bit of Python) but nothing more than: Database Admins who learned Python and now care about more of the data lifecycle than the data that resides in the DBs they managed.

As a data engineer do you find it your job to transform and clean data? How much AI stuff do you implement that does data transformations?

That’s a good question. I think LLMs will have a place in the connector space. It would be really cool if they could dynamically handle changes in the source (the api changed and added some new data new columns etc). But right now — at least I — don’t trust AI to do much of anything in terms of ingestion. When data is extracted from the source it’s got to be as close to a 1:1 of the source as possible. Any errors introduced will have a snowball effect down the line.

For data cleaning we do tend to write the same sort of things over and over. And that’s where I think things could improve. Though what makes a data engineer special in my mind is that they get to know the nuances of data in detail. They get familiar with the columns and their meanings to the business and the expected volume and all sorts of things. And when you get that deeply involved with the data you clearly see where things are jarringly and almost like a vet to a sick animal you write data cleaning things because you care about the data that much.

A joke I read shortly after the term Data Scientist was introduced:

Data Scientist - a statistics major living in San Francisco

Lets face it job titles are a lot of bullshit. I was a "programmer". I call myself a "software engineer". I probably do better SQL than many data engineers / scientists, which is getting annoying as I am shoehorned into roles where I plug API's together rather than deal with SQL. But Data engineer roles, always want a load of stuff I have never needed to deal with.

This is so true. Outside of maybe FAANG companies, a lot of places have wildly different expectations for that role. While one company may refer to the guy doing simple PPTs as a business analyst, others might call that a data analyst or a data scientist or something else. The pay probably mostly reflects the truth though outside of exceptions from office politics.

The term sharded into multiple different terms

Strong coder who can implement an LSTM = ML Engineer

Decent coder who can implement a recent paper with scaffolding code = Applied Scientist

Acceptable coder who is good enough at math to innovate and publish = Research Scientist

Strong coder who cares about data = Data Engineer

Acceptable coder who has lots of domain knowledge = Business analyst, Data Scientist.

If you're just a Data scientist without any domain knowledge...... then you're in a precarious career position.

I've seen a disturbing rise in the number of people who think data engineering isn't software engineering. I don't plan to play up that part of my experience the next time I'm applying.

It's because data engineering has been reduced to be able to login to a cloud provider and know which workflow to drag and drop. This is easily learned in a couple of weeks so that s why those skills might not be considered software engineering.

Well, that's the GP's point, I guess: this thing was called "Business analyst", and, honestly, I don't know what being a domain-expert with somewhat above-average computer skills has to do with "data science".

I guess I'm a data scientist then, that sounds better than business analyst.

i think you may be overestimating the avg. past data scientist's software engineering chops, but it's definitely true that the term has become more diluted than ever

you still find these kinds of people and roles at smaller companies but at largecorps, what's the point? the interesting modelbuilding you shunt off to your army of phd-holding research scientists. deploying models and managing infra goes to MLE. what's left is the data analyst stuff, which you repackage as "data science" because cmon, "analytics"? are we dinosaurs? this is modern tech, we have an image to uphold!

there's not really a need for, or supply of, people who can do everything (edit: _at largecorps_, obviously)

There's not really a need for

Oh sure, if you have teams of research scientists and machine learning engineers to shunt the work to. That's, like, what? 5% of companies out there? Less?

No need, indeed.

so why exactly did you skip the first sentence of the paragraph so that you could make a self-evident point?

anyway that 5% hires a disproportionately larger # of "data scientists"

Data scientists with a strong software engineering background , where are they hiding?

Jokes apart there used to be two categories of data scientists, those that came from a science/phd background where they duct taped their mathematical understanding to code which might work in production, and those those that come from a CS background that duct taped their mathematical/medium tutorial knowledge to an extravaganza of grid search and micro-services that made unscientific predictions in a scalable way.

So now we have the ml engineer (engineer) and the data scientist (science) with clear roles and expectations. Both are full time jobs, most people cannot to both.

all 5 of them are at Alphabet/Meta/OpenAI, no?

but more seriously, unless someone's explicitly doing ML research for most applications using something off-the-shelf-ish[0] and tinkering with it works best. and this mostly requires direct experience[1] with the stack.

and sure, of course, if said project/team/org/corp has so much money they even can train their own model, sure, they can then afford to have these separate roles with "more dedicated" domain experts.

[0] from YOLO to LLaMa to whatever's now on HuggingFace

[1] the more direct the better. you have used LLMs before? great. pyTorch? great. you can deploy stuff on k8s and played with ChatGPT? well, okay, that's ... also great. you know how to get stuff from Snowflake/Databricks/SQL to some training job? take my money!

I use the following table (edit: table turned out ugly, sorry)

  | domain knowledge | quantitative knowledge  | technical knowledge  |

--------------------------------------------------------------------------------------

data analyst | high | mid | low |

data engineer | low | mid | high |

data scientist | mid | high | mid |

Is quantitative knowledge "knowing stats"?

people with both a fairly strong quantitative background and a strong software engineering background.

In my experience this intersection is a null set. And not just that it's an extremely rare feat to pull off IMO, the mental bandwidth and time needed to be good at one of those two alone would consume one person fully. This is why quant/stat specialists were paired with ETL/data-pipeline specialists to build end to end solution.

One reason Data Science became such a hot role back in the day was that it was amorphously defined; because no one knew what exactly it entailed folks across a broad range of skill sets (stats, data engineers, NoSQL folks, visualisation and so on) jumped into the fray. But now companies have burnt their hands, they have learnt to call out exactly what's needed; even when they advertise for DS role they specify what's required of them. For example, this page on Coursera[1] is clear about emphasis on Quant, which is a welcome development IMO.

[1] https://www.coursera.org/articles/what-is-a-data-scientist

I hate to be "that guy" but I find it a little bit sexist that the noob is called "Rachel". OK OK I'm gone.

the meaning of the term "data scientist" has changed

http://i.stack.imgur.com/eLrhI.png

http://image.slidesharecdn.com/daml-150908205332-lva1-app689...

as the name implies a data scientist is a scientist that works on data. There is no reference to the need to be able to code a LSTM in one afternoon (and it would be absurd for most DS tasks)

Sadly in practice data scientist has always been person who can present data which supports what his/her boss expects.

edit: title has been updated: https://github.com/gvwilson/sql-tutorial/commit/14d1e57b94a8...

One funny aspect about the changing definition of "data scientist" is that I, currently a data scientist, spend most of my professional day working with the LLM/AI modeling areas nowadays and building custom models instead of building analyses and dashboards, since the former is more impactful.

Job positions still want the latter, though. If I ever left my job I'm not confident I could get another job with the Data Scientist title, nor could I get a "ML Engineer" job since those focus more on deployment than development.

My R is embarrassingly rusty nowadays and I miss making pretty charts with ggplot2.

In the Enterprise, the best qualified person for any specialisation you may want has always been whoever IBM/Oracle/Tata has sitting on their bench that week.

They have the courses and certifications to prove it, too. It's magic!

Data Scientists, back in the day, were largely people with both a fairly strong quantitative background and a strong software engineering background. The kind of people who could build a demo LSTM in an afternoon.

As a field of "science" perhaps.

In real life (when it became hot) data scientists mostly meant "devs doing analytics" and a lot of it involved R and Python, or the term "big data" thrown around for 10GB logs, and things like Cassandra, with or without some background in math or statistics.

What it never has been, in practice, was a combination of strong math/statistics AND strong software engineering background. 99.9999% of the time it's one or the other.

I really wonder whose fault is it. Unfortunately, what I see the most are many companies expecting you to be a jack of all trades (you should have GenAI/LLM skills, ML, Data Engineering, and what not)

Yes, unfortunately when it was declared the Sexist Job, there was a tremendous influx of bootcamps promising you a six figure income after 3 months of part-time study. That has certainly lowered the overall quality of the Data Scientist title.

I remember when it was a pejorative, literally.

Before flights with patchy/no wifi, I often download a long, single-page tutorial. This is perfect. Curious if anyone knows of any for other languages/tech (e.g. beyond SQL).

Got any other favorite tutorials you've used for this kind of flight?

I made a couple of my own (they're incredibly scrappy, probably won't make sense to anyone but me):

- Ruby: https://github.com/stevecondylios/ruby-learning-resources/bl...

- Rails: https://github.com/stevecondylios/ruby-learning-resources/bl...

But this vim cheatsheet was great:

- https://vim.rtorr.com/

A failed attempt was to load (very) many (e.g. about 100) pages of javascript lessons from w3schools before the plane took off, but for some reason the pages tried to refresh during the flight and I lost them all, so that was a massive waste of time (opening them all before the flight took about 20 minutes).

Zig docs are all one page:

https://ziglang.org/documentation/master/

Wow - gave me a goofy smile.

Also cool 'Zig Zen': https://ziglang.org/documentation/master/#toc-Zen

If not familiar: https://learnxinyminutes.com/

Funny on that one, there is so much but not SQLite :-)

Shameless Plug:

- Not single page _per se_ but I have plenty of Jupyter Notebook based tutorials here: https://github.com/DataForScience/

Self contained with slide decks and notebooks.

These are great, thanks!

https://devdocs.io

each one of Peter Norvig's etudes at https://github.com/norvig/pytudes

from the sudoku solver http://norvig.com/sudoku.html to things like "NPL in python" https://colab.research.google.com/github/norvig/pytudes/blob...

...

https://cryptopals.com/ (a remake of the Matasano crypto challenges) understand crypto by actually building and then subsequently breaking it (it's not strictly single-page, but wget can mirror it nicely)

Learning outcomes:

* Explain the difference between a database and a database manager.

* Write SQL to select, filter, sort, group, and aggregate data.

* Define tables and insert, update, and delete records.

* Describe different types of join and write queries that use them to combine data.

* Use windowing functions to operate on adjacent rows.

* Explain what transactions are and write queries that roll back when constraints are violated.

* Explain what triggers are and write SQL to create them.

* Manipulate JSON data using SQL.

* Interact with a database using Python directly, from a Jupyter notebook, and via an ORM.

One thing that I have noticed confuses a lot of people is "timeseries joins" (I don't know the real term for this) I'm talking about where there is no "one to one" match between keys in the two tables.

I'm a non software type of engineer in my world a lot of tables are structured as timeseries data (such as readings from a device or instrument) which uses timestamp as a key.

Then we have other tables which log event or batch data (such as an alarm start and end time, or Machine start/machine stop etc).

So a lot of queries end up being of the form

Select A.AlarmId, B.Reading, B.Timestamp from Alarms A, Readings B where A.StartTime >= B.Timestamp and A.EndTime < B.Timestamp

A lot of people seem to have problems grasping these kinds of joins.

Duck and Clickhouse call this an AsOf join - https://duckdb.org/docs/guides/sql_features/asof_join.html

Yup https://clickhouse.com/docs/en/sql-reference/statements/sele...

You mean slowly changing dimensions? It’s not something there is much literature on, especially for outer joins.

Cool use case. They're just called "non equi-joins" - because the join condition is an inequality. In general a join produces a row in the output table for each (left, right) pair of rows from the input tables that satisfies the join condition. It's just so common for joins to use a simple equality condition, where one or both sides is a unique id for its table, and people don't as often encounter joins where one input row can inform multiple output rows.

If you are interested in reading about JOINS in detail, ClickHouse has good blogs around it:

- https://clickhouse.com/blog/clickhouse-fully-supports-joins-...

I'm sorry but Clickhouse isn't SQL-compliant so it has no business teaching us JOIN.

No database is fully SQL compliant. Though clickhouse can at times be painfully so.

https://stackoverflow.com/questions/784900/why-does-no-datab...

these does seem to be a fairly good curriculum for an introduction to SQL

Shameless Plug: If anyone here wants to practice their SQL, they are welcome to try my Mac app: TextQuery [1]. I built it because I wanted to quickly import CSV datasets and run SQL queries on them. I don't think there could be a more fun way to learn SQL than to jump in and start analyzing thousands of public datasets. Sure, you can use CLI/Code as well, but GUI is often faster and easier. Currently, the app is in the beta period and free-to-use. When launched, you'll get to keep the latest beta version.

[1]: https://textquery.app/

Isn’t this just a ui over SQLite?

There's clearly some non-SQLite functionality that seems really neat!

Super cool! Thanks for sharing.

This is brilliant.

Been meaning to learn SQL, so will check this out

This is a very interesting idea! Great work

Why do you need to harvest email addresses?

Nice. I built some code that did this about a ~6 years ago, but it had no GUI and had to deal with some very irregular csv files. I always wondered why the default SQLite didn't have better support for reading CSV directly and auto ascertaining data types.

left outer join

A join that is guaranteed to keep all rows from the first (left) table. Columns from the right table are filled with actual values if available or with null otherwise.

This wording only works for identity equality join condition. It creates misleading mental model of left joins, and unfortunately is very common.

Can you clarify? Is the problem here that the "guaranteed to keep all rows from the left table" piece is accurate, but the section about the right table might not be a good mental model if the join condition is more complex than a simple equality?

There was discussion about this exact same thing about 17 days ago. It comes up surprisingly frequently. https://news.ycombinator.com/item?id=39071550

I'm not sure I understand, I think this definition still works for left outer joins on conditions other than identity equality, since joins on, say, inequalities or multiple conditions would still be "guaranteed to keep all rows from the first (left) table. Columns from the right table are filled with....".

It kind of implies that the left join process works by first taking the left table and then filling in the right table with a match if one exists, and otherwise null.

That model obviously doesn't work because if there's more than one match as the matching left row is duplicated for each match. However I don't understand their point of this being a problem when you don't have a "identity equality join condition", since this can also occur for equality joins as long as you're not joining on a unique key.

Can you elaborate with a gotcha example?

I assume they mean that row multiplication can occur but otherwise not sure.

I'd say it's misleading at best. It should specify that for multiple matches in the right table, the row from the left table is duplicated per matching row in the right table.

Does anyone remember some kind of adventure or mystery novel that you progress through by solving sql challenges?

I remember seeing it once and I can never find it now.

There was one about space exploration based on Postgres, but I don’t remember the name.

Mastery with SQL. A paid course that‘s excellent.

Yeah it's excellent, but I am pretty sure OP was talking about something else:

https://sales.bigmachine.io/curious-moon

Damn, I meant Curious Moon and said the other great one.

People, do both! Worth every cent!

Sounds like https://mystery.knightlab.com/ SQL Murder Mystery by Northwestern University Knight Lab.

It's a different area of SWE, but https://deadlockempire.github.io/ is great except the final level.

Select Star SQL? https://selectstarsql.com/

It opens up with analyzing death row inmates, so significantly more real than classifying flowers.

Also, remember ChatGPT can help you write queries like a data scientist, without actually being one. Just tell it what you want in plain english. I've used this to get all sorts of useful metrics like conversion rates, messaging click through rates, etc...

Conversely, it’s very helpful at dissecting a 3k line insanity query and explaining it.

3k line insanity query

*laughs in PTSD*

I have always found SQL to be one of the easiest languages to read, despite not being the easiest to write. Though I have never had to read 3k of SQL, maybe a page or two

Yes. It'd be interesting to see what it makes of some of those ugly ORM generated queries.

ChatGPT can help you write queries like a data scientist

Badly then?

Personally I want my SQL queries to be written like a database professional.

Don't forget the venerable SQLZoo - I have referred a zillion people to it over the years. https://www.sqlzoo.net/wiki/SQL_Tutorial

Edit: also an inaccuracy that's minor but can bite you if you're not careful - they mention temporary tables are in memory not on disk - that's not true in almost all sql databases, they are just connection specific eg they don't persist after you disconnect.

Some databases are optimized for a temp table to be a throwaway, but that can be a good or bad thing depending on the use case.

Where is the datasources to practice them?

Each query has a built in data source/test but dbfiddle and friends are great if you want to do some setup.

I am also a fan of Mode's SQL Tutorial: https://mode.com/sql-tutorial

What is a data scientist, if they are not a statistician?

I am a grumpy old man, fed up with newspeak

Whats this newspeak? Why cant people just talk properly like we did before the war? Not that silly little war, the actual war, WW1.

I am a grumpy old man, fed up with all this newfangled nonsense.

Constantly evolving language and job titles. I consider myself a programmer. I market myself as a software engineer in LinkedIn. I am a generalist who uses mostly python / Django. I like SQL and feel that the database should be leveraged to do the heavy lifting. But software engineers plug APIs together these days.

I don't see how this is "for data scientists" it looks like a good summary of SQL in general.

agreed. and sqlite specifically. i’d really like to see an authoritative resource that shows me how to use sql and relational algebra to do some basic and intermediate data analysis. i can find like 8 different ways to calculate skew and kurtosis, but is there a trusted resource that can show me how to do it? what other interesting data analysis can i do using sql?

According to the article itself it's explicitly not anyway - it's for instructors (of data scientists potentially, sure).

Really excellent concise SQL guide, and great teaching by example. At most universities in the US this content gets taught over an entire semester (and probably shouldn't be). This guide is complete enough for 99% of industry SWE jobs outside of database optimization.

It's a great service when someone takes the time to document knowledge on a single page with quality examples, and trust the reader to follow along. Reminds me of the Rudin analysis book.

Exactly this, bookmarked.

There is a W3Schools online tutorial that is excellent. I had all the new engineers do it before they started running queries at work. This guide is great too.

Nice! It could be better if the queries were interactive.

That's what would change this from a lousy resource to a brilliant one.

Little SQLer ("Little Squealer")

I’m sure MIT Press is working the phone lines right now

full outer join (also called cross join) constructs their cross product

Full outer joins and cross joins are different types of joins. A cross join returns the Cartesian product of both tables, while a full outer join is like a combination of a left and right join.

Better explanation here: https://stackoverflow.com/questions/3228871/sql-server-what-...

Could have done with a visual representation of the starting tables and resulting tables to be a good description.It was 20 years ago that I was learning about cartesian joins. When I do SQL I visualize an excel like structure in my head.

I’ve found that chatGPT is excellent in helping with generating and testing SQL queries. This tutorial is likely helpful for its intended audience; I’d like to see a bit more discussion of query optimization, stored procedures, table design, and available options both in db and in bindings, but it’s hard to fit everything in one place and keep it clean, which is why LLMs will eventually win.

I've found the same. Especially dealing with sorta complex JSON column migrations where the JSON structure needs to be changed. Can describe in plain english what I want to do and get a working query for it.

A lot of these queries are only valid on SQLite (or maybe not "only" but certainly not generally available across DBMS's), including cases where the author doesn't point this out. For one example filtering on aggregates is not supported by MySQL, MS SQL, Oracle or most other DBMS's.

For one example filtering on aggregates is not supported by MySQL, MS SQL, Oracle or most other DBMS's.

HAVING is one of the standard clauses, I use it on mysql all the time and a quick search shows it exists for the others.

do not expect novices with no prior SQL experience to be able to learn from them

Please suggest this entry level thing.

SQL is enduring because it is logical and understandable. But there is still an initial vertical learning curve.

I am trying to teach a friend of mine SQL but I’m not sure how to construct a lesson plan.

Is there a canonical SQL 101?

Just above that it states

[what this is] notes and working examples that instructors can use to perform a lesson

So use the resource to teach your friend.

Also StrataScratch.com for leetcode for SQL.

What really distinguishes an SQL master is working with queries hundreds of lines long, and query optimization. For example, you can often make queries faster by taking out joins and replacing them with window functions. It's hard to practice these techniques outside of a legit enterprise dataset with billions of rows (maybe a good startup idea).

one of the most performance-killing practices i’ve seen (outside of brutal orm-generated queries) is the sql hero who wants to do everything all at once in a single splat of sql. at a certain level of complexity, breaking things up into logical units and even forcing materialization (instead of hoping for some good intermediate materialization) can go a long way

Glad to have learned about iif() from this!

It's really handy in conjunction with GROUP BY.

Watch out, you're only referring to MALE and FEMALE sex, that might trigger a few people here

Sorry for feeding the trolls, but: I feel like I observe way more people complaining about people getting triggered by M/F sex than I actually see people complain about it. That is why this person had to make up the triggered person, because you will not find them in this comment section.

Some useful SQL queries for me in this list of 100 queries, thanks!

Lots of great and very clear examples here. As someone who's been writing SQL for over 30 years, I wish I'd had something like this at the start.

Obviously there's a lot more to learn about many of these areas to really make the most of them, but this is a really good launchpad.

Item 70 doesn't look right.

This is from Greg Wilson (https://third-bit.com), co-editor of "The Architecture of Open Source Applications" and many other superb references.

Edit: He wrote about this project here: https://third-bit.com/2024/02/03/sql-tutorial/

The tutorial can be downloaded here: https://github.com/gvwilson/sql-tutorial/raw/main/sql-tutori...

(gives the penguins.db file necessary for the examples)

The diagrams at "check your understanding" sections are great.

I would expect someone that has the title of data scientists to at least be a master in the universal data query language SQL.

But im so used to ORM these days anything more complex then a sql join is already going over my head if i didn't do a sql refresher. As far as i have skimmed the article it seems like a very good refresher for even a SWE. I will definitely put this tutorial on my todo list.

Thanks for sharing this! Learn by example resources can be super helpful.

Why bother, just use ChatGPT

As mentioned by others, the author Dr. Greg Wilson has written/compiled many books/tutorials which I can recommend. I would especially laud Software Design by Example [1][2], The Architecture of Open Source Applications [3] and Teaching Tech Together [4].

[1] https://third-bit.com/sdxpy/ (python version) [2] https://third-bit.com/sdxjs/ (js version) [3] https://aosabook.org/ [4] http://teachtogether.tech/

Don’t have time to do it myself, but might be good to redo this guide but with DuckDB (more likely used by DS’es nowadays than SQLite)

The queries are solid, but I really appreciate throwing in the worlds most confusing diagrams here and there. It keeps me alert trying to find where to even start with them. So to be clear, tables must have a name, and must have rows and the rows must have the same name as the table, yes?

Fantastic summary, saving for just general tips when trying to show people SQL