return to table of content

SQL for data scientists in 100 queries

PheonixPharts
42 replies
18h59m

Not to detract from the article, but: Wow the meaning of the term "data scientist" has changed since the days of "sexiest job". From the article description:

- Rachel has a master’s degree in cell biology and now works in a research hospital doing cell assays.

- She learned a bit of R in an undergrad biostatistics course and has been through the Carpentries lesson on the Unix shell.

- Rachel is thinking about becoming a data scientist and would like to understand how data is stored and managed.

Data Scientists, back in the day, were largely people with both a fairly strong quantitative background and a strong software engineering background. The kind of people who could build a demo LSTM in an afternoon. Usually there was a bit of a trade-off between the quant/software aspects (really mathly people might be worse coders, really strong coders might need to freshen up on a few areas of mathematics), but generally they were fairly strong in each area.

In many orgs it's been reduced to "over paid data analysts" but I wouldn't even hire "Rachel" for a role like that.

carabiner
11 replies
18h55m

No that's MLE. A DS rarely gets asked leetcode algos questions, an MLE would.

KRAKRISMOTT
5 replies
18h50m

Depends on the company, a research level MLE would be asked to derive loss functions and perform partial differentiation on pen and paper. You have to answer questions like what Kullbeck Leiber divergence is and how it can be utilized etc.

defrost
4 replies
17h59m

Would that be a lesser known cousin of the better known Kullback–Leibler separation measure for distributions?

c0pium
3 replies
17h24m

Full marks for snark, but points off for being incorrect.

https://en.m.wikipedia.org/wiki/Kullback%E2%80%93Leibler_div...

defrost
2 replies
16h39m

From your link:

    a measure of how one probability distribution P is different from a second,
ie. literally it's a separation measure for distributions .. just as I recalled from my first encounter with the notion ~ 1984 (ish).

If you're sincere you should either add those points back or, preferably, expand upon your theory of how my snap take is incorrect.

( I'm aware it's not a metric due to triangle inequality, etc. )

elefanten
1 replies
14h43m

The snap take came across as an argument about which of two names for the measure is better-known.

The wikipedia page implies the opposite of that argument.

Perhaps that’s changed since 1984, but the proposition was about current practices.

defrost
0 replies
14h21m

It's been Kullback since birth in 1907 to the best of my knowledge, never once Kullbeck.

As a fully anglicized US citizen born in Brooklyn, New York I don't think there's ever been any vowel confusion over the spelling of the name:

https://en.m.wikipedia.org/wiki/Solomon_Kullback

Admittedly I did check as it's not uncommon for mathematicians to have alternate spellings for their names.

Ditto Leibler, born Chicago, Illinois in 1914, no dropped L

https://en.m.wikipedia.org/wiki/Richard_Leibler

timdellinger
3 replies
18h4m

I literally was asked two leetcode questions verbatim when interviewing for a data science position at TikTok a few months ago. Dynamic programming (I won't mention which question) and then one regarding binary trees.

pests
2 replies
17h39m

(I won't mention which question)

You must protect the corporate overlords.

l33t7332273
0 replies
15h11m

Alternatively, protect themselves since giving away an individualized question could identify them.

fn-mote
0 replies
14h3m

The question does not matter at all.

All of the information in the knowledge of Leetcode + category.

(Does it really matter WHICH question?? They are different but all the same. That is the point.)

PheonixPharts
0 replies
18h50m

You must have worked at different places from me. Nearly every DS job I had (before wisely apparently) leaving that area had leetcode style algo questions during the interviews.

Again, things have apparently changed.

blagie
6 replies
18h14m

No, it hasn't.

The term has always gone in a half-dozen directions at once, and ranged anything from

* an idiot making PPT decks for business presentations based on sales data; to

* a statistician with very sophisticated mathematical background but minimal programming skills doing things in R or State; to

* a person with a random degree making random dashboard in Tableau; to

* a person with sophisticate background in software engineering, data engineering, and related fields who can kind of do math

* an expert in machine learning (of various calibers)

* a physicist using their quantitative skills to munge data

... and so on. That's been confusing people since the title came out. It depends on the industry, and there's a dozen overlapping titles too, some with well-defined meanings and some varying from company to company (business analyst, data engineering, etc.).

gigatexal
2 replies
6h36m

Relatedly - and I’m a lead data engineer at my current $JOB — I’ve yet to find a definition of what a data engineer is/does that I find easy to share with people. Of course I have flippant ones (YAML dev with a bit of Python) but nothing more than: Database Admins who learned Python and now care about more of the data lifecycle than the data that resides in the DBs they managed.

kordlessagain
1 replies
4h23m

As a data engineer do you find it your job to transform and clean data? How much AI stuff do you implement that does data transformations?

gigatexal
0 replies
32m

That’s a good question. I think LLMs will have a place in the connector space. It would be really cool if they could dynamically handle changes in the source (the api changed and added some new data new columns etc). But right now — at least I — don’t trust AI to do much of anything in terms of ingestion. When data is extracted from the source it’s got to be as close to a 1:1 of the source as possible. Any errors introduced will have a snowball effect down the line.

For data cleaning we do tend to write the same sort of things over and over. And that’s where I think things could improve. Though what makes a data engineer special in my mind is that they get to know the nuances of data in detail. They get familiar with the columns and their meanings to the business and the expected volume and all sorts of things. And when you get that deeply involved with the data you clearly see where things are jarringly and almost like a vet to a sick animal you write data cleaning things because you care about the data that much.

jimbokun
0 replies
3h4m

A joke I read shortly after the term Data Scientist was introduced:

Data Scientist - a statistics major living in San Francisco

collyw
0 replies
30m

Lets face it job titles are a lot of bullshit. I was a "programmer". I call myself a "software engineer". I probably do better SQL than many data engineers / scientists, which is getting annoying as I am shoehorned into roles where I plug API's together rather than deal with SQL. But Data engineer roles, always want a load of stuff I have never needed to deal with.

7thaccount
0 replies
18h1m

This is so true. Outside of maybe FAANG companies, a lot of places have wildly different expectations for that role. While one company may refer to the guy doing simple PPTs as a business analyst, others might call that a data analyst or a data scientist or something else. The pay probably mostly reflects the truth though outside of exceptions from office politics.

screye
4 replies
16h22m

The term sharded into multiple different terms

Strong coder who can implement an LSTM = ML Engineer

Decent coder who can implement a recent paper with scaffolding code = Applied Scientist

Acceptable coder who is good enough at math to innovate and publish = Research Scientist

Strong coder who cares about data = Data Engineer

Acceptable coder who has lots of domain knowledge = Business analyst, Data Scientist.

If you're just a Data scientist without any domain knowledge...... then you're in a precarious career position.

CalRobert
1 replies
12h42m

I've seen a disturbing rise in the number of people who think data engineering isn't software engineering. I don't plan to play up that part of my experience the next time I'm applying.

chupy
0 replies
10h7m

It's because data engineering has been reduced to be able to login to a cloud provider and know which workflow to drag and drop. This is easily learned in a couple of weeks so that s why those skills might not be considered software engineering.

krick
0 replies
14h5m

Well, that's the GP's point, I guess: this thing was called "Business analyst", and, honestly, I don't know what being a domain-expert with somewhat above-average computer skills has to do with "data science".

HPsquared
0 replies
8h59m

I guess I'm a data scientist then, that sounds better than business analyst.

bllguo
2 replies
18h36m

i think you may be overestimating the avg. past data scientist's software engineering chops, but it's definitely true that the term has become more diluted than ever

you still find these kinds of people and roles at smaller companies but at largecorps, what's the point? the interesting modelbuilding you shunt off to your army of phd-holding research scientists. deploying models and managing infra goes to MLE. what's left is the data analyst stuff, which you repackage as "data science" because cmon, "analytics"? are we dinosaurs? this is modern tech, we have an image to uphold!

there's not really a need for, or supply of, people who can do everything (edit: _at largecorps_, obviously)

itsoktocry
1 replies
18h23m

There's not really a need for

Oh sure, if you have teams of research scientists and machine learning engineers to shunt the work to. That's, like, what? 5% of companies out there? Less?

No need, indeed.

bllguo
0 replies
18h12m

so why exactly did you skip the first sentence of the paragraph so that you could make a self-evident point?

anyway that 5% hires a disproportionately larger # of "data scientists"

antman
1 replies
12h19m

Data scientists with a strong software engineering background , where are they hiding?

Jokes apart there used to be two categories of data scientists, those that came from a science/phd background where they duct taped their mathematical understanding to code which might work in production, and those those that come from a CS background that duct taped their mathematical/medium tutorial knowledge to an extravaganza of grid search and micro-services that made unscientific predictions in a scalable way.

So now we have the ml engineer (engineer) and the data scientist (science) with clear roles and expectations. Both are full time jobs, most people cannot to both.

pas
0 replies
5h26m

all 5 of them are at Alphabet/Meta/OpenAI, no?

but more seriously, unless someone's explicitly doing ML research for most applications using something off-the-shelf-ish[0] and tinkering with it works best. and this mostly requires direct experience[1] with the stack.

and sure, of course, if said project/team/org/corp has so much money they even can train their own model, sure, they can then afford to have these separate roles with "more dedicated" domain experts.

[0] from YOLO to LLaMa to whatever's now on HuggingFace

[1] the more direct the better. you have used LLMs before? great. pyTorch? great. you can deploy stuff on k8s and played with ChatGPT? well, okay, that's ... also great. you know how to get stuff from Snowflake/Databricks/SQL to some training job? take my money!

TrackerFF
1 replies
9h17m

I use the following table (edit: table turned out ugly, sorry)

  | domain knowledge | quantitative knowledge  | technical knowledge  |
--------------------------------------------------------------------------------------

data analyst | high | mid | low |

data engineer | low | mid | high |

data scientist | mid | high | mid |

collyw
0 replies
9m

Is quantitative knowledge "knowing stats"?

vishnugupta
0 replies
8h28m

people with both a fairly strong quantitative background and a strong software engineering background.

In my experience this intersection is a null set. And not just that it's an extremely rare feat to pull off IMO, the mental bandwidth and time needed to be good at one of those two alone would consume one person fully. This is why quant/stat specialists were paired with ETL/data-pipeline specialists to build end to end solution.

One reason Data Science became such a hot role back in the day was that it was amorphously defined; because no one knew what exactly it entailed folks across a broad range of skill sets (stats, data engineers, NoSQL folks, visualisation and so on) jumped into the fray. But now companies have burnt their hands, they have learnt to call out exactly what's needed; even when they advertise for DS role they specify what's required of them. For example, this page on Coursera[1] is clear about emphasis on Quant, which is a welcome development IMO.

[1] https://www.coursera.org/articles/what-is-a-data-scientist

sukruh
0 replies
9h45m

I hate to be "that guy" but I find it a little bit sexist that the noob is called "Rachel". OK OK I'm gone.

sampo
0 replies
11h55m
poulpy123
0 replies
8h29m

as the name implies a data scientist is a scientist that works on data. There is no reference to the need to be able to code a LSTM in one afternoon (and it would be absurd for most DS tasks)

mzs
0 replies
3h42m

Sadly in practice data scientist has always been person who can present data which supports what his/her boss expects.

edit: title has been updated: https://github.com/gvwilson/sql-tutorial/commit/14d1e57b94a8...

minimaxir
0 replies
15h52m

One funny aspect about the changing definition of "data scientist" is that I, currently a data scientist, spend most of my professional day working with the LLM/AI modeling areas nowadays and building custom models instead of building analyses and dashboards, since the former is more impactful.

Job positions still want the latter, though. If I ever left my job I'm not confident I could get another job with the Data Scientist title, nor could I get a "ML Engineer" job since those focus more on deployment than development.

My R is embarrassingly rusty nowadays and I miss making pretty charts with ggplot2.

hyperman1
0 replies
10h18m

In the Enterprise, the best qualified person for any specialisation you may want has always been whoever IBM/Oracle/Tata has sitting on their bench that week.

They have the courses and certifications to prove it, too. It's magic!

coldtea
0 replies
11h5m

Data Scientists, back in the day, were largely people with both a fairly strong quantitative background and a strong software engineering background. The kind of people who could build a demo LSTM in an afternoon.

As a field of "science" perhaps.

In real life (when it became hot) data scientists mostly meant "devs doing analytics" and a lot of it involved R and Python, or the term "big data" thrown around for 10GB logs, and things like Cassandra, with or without some background in math or statistics.

What it never has been, in practice, was a combination of strong math/statistics AND strong software engineering background. 99.9999% of the time it's one or the other.

boredemployee
0 replies
18h15m

I really wonder whose fault is it. Unfortunately, what I see the most are many companies expecting you to be a jack of all trades (you should have GenAI/LLM skills, ML, Data Engineering, and what not)

ProjectArcturis
0 replies
18h45m

Yes, unfortunately when it was declared the Sexist Job, there was a tremendous influx of bootcamps promising you a six figure income after 3 months of part-time study. That has certainly lowered the overall quality of the Data Scientist title.

2devnull
0 replies
3h59m

I remember when it was a pejorative, literally.

nomilk
10 replies
19h31m

Before flights with patchy/no wifi, I often download a long, single-page tutorial. This is perfect. Curious if anyone knows of any for other languages/tech (e.g. beyond SQL).

nbbaier
1 replies
9h33m

Got any other favorite tutorials you've used for this kind of flight?

nomilk
0 replies
1h49m

I made a couple of my own (they're incredibly scrappy, probably won't make sense to anyone but me):

- Ruby: https://github.com/stevecondylios/ruby-learning-resources/bl...

- Rails: https://github.com/stevecondylios/ruby-learning-resources/bl...

But this vim cheatsheet was great:

- https://vim.rtorr.com/

A failed attempt was to load (very) many (e.g. about 100) pages of javascript lessons from w3schools before the plane took off, but for some reason the pages tried to refresh during the flight and I lost them all, so that was a massive waste of time (opening them all before the flight took about 20 minutes).

beeburrt
1 replies
19h20m
nomilk
0 replies
19h18m

Wow - gave me a goofy smile.

Also cool 'Zig Zen': https://ziglang.org/documentation/master/#toc-Zen

adamiscool8
1 replies
19h18m
belter
0 replies
18h22m

Funny on that one, there is so much but not SQLite :-)

Anon84
1 replies
19h24m

Shameless Plug:

- Not single page _per se_ but I have plenty of Jupyter Notebook based tutorials here: https://github.com/DataForScience/

Self contained with slide decks and notebooks.

blackhaj7
0 replies
13h29m

These are great, thanks!

vdm
0 replies
7h7m
pas
0 replies
5h13m

each one of Peter Norvig's etudes at https://github.com/norvig/pytudes

from the sudoku solver http://norvig.com/sudoku.html to things like "NPL in python" https://colab.research.google.com/github/norvig/pytudes/blob...

...

https://cryptopals.com/ (a remake of the Matasano crypto challenges) understand crypto by actually building and then subsequently breaking it (it's not strictly single-page, but wget can mirror it nicely)

vavooom
9 replies
19h51m

Learning outcomes:

* Explain the difference between a database and a database manager.

* Write SQL to select, filter, sort, group, and aggregate data.

* Define tables and insert, update, and delete records.

* Describe different types of join and write queries that use them to combine data.

* Use windowing functions to operate on adjacent rows.

* Explain what transactions are and write queries that roll back when constraints are violated.

* Explain what triggers are and write SQL to create them.

* Manipulate JSON data using SQL.

* Interact with a database using Python directly, from a Jupyter notebook, and via an ORM.

bigger_cheese
4 replies
17h24m

One thing that I have noticed confuses a lot of people is "timeseries joins" (I don't know the real term for this) I'm talking about where there is no "one to one" match between keys in the two tables.

I'm a non software type of engineer in my world a lot of tables are structured as timeseries data (such as readings from a device or instrument) which uses timestamp as a key.

Then we have other tables which log event or batch data (such as an alarm start and end time, or Machine start/machine stop etc).

So a lot of queries end up being of the form

Select A.AlarmId, B.Reading, B.Timestamp from Alarms A, Readings B where A.StartTime >= B.Timestamp and A.EndTime < B.Timestamp

A lot of people seem to have problems grasping these kinds of joins.

Jgrubb
1 replies
14h2m

Duck and Clickhouse call this an AsOf join - https://duckdb.org/docs/guides/sql_features/asof_join.html

pradeepchhetri
0 replies
12h24m
mr_toad
0 replies
16h4m

You mean slowly changing dimensions? It’s not something there is much literature on, especially for outer joins.

davery22
0 replies
15h24m

Cool use case. They're just called "non equi-joins" - because the join condition is an inequality. In general a join produces a row in the output table for each (left, right) pair of rows from the input tables that satisfies the join condition. It's just so common for joins to use a simple equality condition, where one or both sides is a unique id for its table, and people don't as often encounter joins where one input row can inform multiple output rows.

tucnak
1 replies
11h2m

I'm sorry but Clickhouse isn't SQL-compliant so it has no business teaching us JOIN.

392
0 replies
6h38m

No database is fully SQL compliant. Though clickhouse can at times be painfully so.

https://stackoverflow.com/questions/784900/why-does-no-datab...

LegitShady
0 replies
18h33m

these does seem to be a fairly good curriculum for an introduction to SQL

shubhamjain
8 replies
19h48m

Shameless Plug: If anyone here wants to practice their SQL, they are welcome to try my Mac app: TextQuery [1]. I built it because I wanted to quickly import CSV datasets and run SQL queries on them. I don't think there could be a more fun way to learn SQL than to jump in and start analyzing thousands of public datasets. Sure, you can use CLI/Code as well, but GUI is often faster and easier. Currently, the app is in the beta period and free-to-use. When launched, you'll get to keep the latest beta version.

[1]: https://textquery.app/

Atotalnoob
1 replies
16h22m

Isn’t this just a ui over SQLite?

aeturnum
0 replies
16h9m

There's clearly some non-SQLite functionality that seems really neat!

xp84
0 replies
18h32m

Super cool! Thanks for sharing.

saadatq
0 replies
19h21m

This is brilliant.

friendlynokill
0 replies
15h33m

Been meaning to learn SQL, so will check this out

_kush
0 replies
13h19m

This is a very interesting idea! Great work

Terretta
0 replies
15h37m

Why do you need to harvest email addresses?

7thaccount
0 replies
5h29m

Nice. I built some code that did this about a ~6 years ago, but it had no GUI and had to deal with some very irregular csv files. I always wondered why the default SQLite didn't have better support for reading CSV directly and auto ascertaining data types.

petalmind
7 replies
19h50m

left outer join

A join that is guaranteed to keep all rows from the first (left) table. Columns from the right table are filled with actual values if available or with null otherwise.

This wording only works for identity equality join condition. It creates misleading mental model of left joins, and unfortunately is very common.

simonw
1 replies
19h12m

Can you clarify? Is the problem here that the "guaranteed to keep all rows from the left table" piece is accurate, but the section about the right table might not be a good mental model if the join condition is more complex than a simple equality?

reaperman
0 replies
18h7m

There was discussion about this exact same thing about 17 days ago. It comes up surprisingly frequently. https://news.ycombinator.com/item?id=39071550

nomilk
1 replies
19h40m

I'm not sure I understand, I think this definition still works for left outer joins on conditions other than identity equality, since joins on, say, inequalities or multiple conditions would still be "guaranteed to keep all rows from the first (left) table. Columns from the right table are filled with....".

orlp
0 replies
18h12m

It kind of implies that the left join process works by first taking the left table and then filling in the right table with a match if one exists, and otherwise null.

That model obviously doesn't work because if there's more than one match as the matching left row is duplicated for each match. However I don't understand their point of this being a problem when you don't have a "identity equality join condition", since this can also occur for equality joins as long as you're not joining on a unique key.

erehweb
1 replies
19h46m

Can you elaborate with a gotcha example?

hobs
0 replies
19h45m

I assume they mean that row multiplication can occur but otherwise not sure.

magicalhippo
0 replies
16h40m

I'd say it's misleading at best. It should specify that for multiple matches in the right table, the row from the left table is duplicated per matching row in the right table.

bilsbie
7 replies
17h55m

Does anyone remember some kind of adventure or mystery novel that you progress through by solving sql challenges?

I remember seeing it once and I can never find it now.

proamdev123
3 replies
17h4m

There was one about space exploration based on Postgres, but I don’t remember the name.

Tomte
2 replies
14h9m

Mastery with SQL. A paid course that‘s excellent.

natrys
1 replies
9h54m

Yeah it's excellent, but I am pretty sure OP was talking about something else:

https://sales.bigmachine.io/curious-moon

Tomte
0 replies
9h38m

Damn, I meant Curious Moon and said the other great one.

People, do both! Worth every cent!

simonw
0 replies
16h52m

Sounds like https://mystery.knightlab.com/ SQL Murder Mystery by Northwestern University Knight Lab.

orthoxerox
0 replies
7h31m

It's a different area of SWE, but https://deadlockempire.github.io/ is great except the final level.

fbdab103
0 replies
16h37m

Select Star SQL? https://selectstarsql.com/

It opens up with analyzing death row inmates, so significantly more real than classifying flowers.

aussieguy1234
5 replies
15h57m

Also, remember ChatGPT can help you write queries like a data scientist, without actually being one. Just tell it what you want in plain english. I've used this to get all sorts of useful metrics like conversion rates, messaging click through rates, etc...

jkrubin
3 replies
12h2m

Conversely, it’s very helpful at dissecting a 3k line insanity query and explaining it.

mdekkers
0 replies
11h5m

3k line insanity query

*laughs in PTSD*

collyw
0 replies
7m

I have always found SQL to be one of the easiest languages to read, despite not being the easiest to write. Though I have never had to read 3k of SQL, maybe a page or two

aussieguy1234
0 replies
11h12m

Yes. It'd be interesting to see what it makes of some of those ugly ORM generated queries.

blitzar
0 replies
8h51m

ChatGPT can help you write queries like a data scientist

Badly then?

Personally I want my SQL queries to be written like a database professional.

hobs
3 replies
19h57m

Don't forget the venerable SQLZoo - I have referred a zillion people to it over the years. https://www.sqlzoo.net/wiki/SQL_Tutorial

Edit: also an inaccuracy that's minor but can bite you if you're not careful - they mention temporary tables are in memory not on disk - that's not true in almost all sql databases, they are just connection specific eg they don't persist after you disconnect.

Some databases are optimized for a temp table to be a throwaway, but that can be a good or bad thing depending on the use case.

wanderingmind
1 replies
11h34m

Where is the datasources to practice them?

hobs
0 replies
4h37m

Each query has a built in data source/test but dbfiddle and friends are great if you want to do some setup.

vavooom
0 replies
19h51m

I am also a fan of Mode's SQL Tutorial: https://mode.com/sql-tutorial

worik
2 replies
18h5m

What is a data scientist, if they are not a statistician?

I am a grumpy old man, fed up with newspeak

blitzar
1 replies
8h47m

Whats this newspeak? Why cant people just talk properly like we did before the war? Not that silly little war, the actual war, WW1.

I am a grumpy old man, fed up with all this newfangled nonsense.

collyw
0 replies
15m

Constantly evolving language and job titles. I consider myself a programmer. I market myself as a software engineer in LinkedIn. I am a generalist who uses mostly python / Django. I like SQL and feel that the database should be leveraged to do the heavy lifting. But software engineers plug APIs together these days.

user3939382
2 replies
18h38m

I don't see how this is "for data scientists" it looks like a good summary of SQL in general.

swasheck
0 replies
17h29m

agreed. and sqlite specifically. i’d really like to see an authoritative resource that shows me how to use sql and relational algebra to do some basic and intermediate data analysis. i can find like 8 different ways to calculate skew and kurtosis, but is there a trusted resource that can show me how to do it? what other interesting data analysis can i do using sql?

OJFord
0 replies
15h31m

According to the article itself it's explicitly not anyway - it's for instructors (of data scientists potentially, sure).

throwaway99989
2 replies
17h2m

Really excellent concise SQL guide, and great teaching by example. At most universities in the US this content gets taught over an entire semester (and probably shouldn't be). This guide is complete enough for 99% of industry SWE jobs outside of database optimization.

It's a great service when someone takes the time to document knowledge on a single page with quality examples, and trust the reader to follow along. Reminds me of the Rudin analysis book.

flaie
0 replies
7h20m

Exactly this, bookmarked.

7thaccount
0 replies
5h31m

There is a W3Schools online tutorial that is excellent. I had all the new engineers do it before they started running queries at work. This guide is great too.

ddgflorida
2 replies
18h15m

Nice! It could be better if the queries were interactive.

blagie
1 replies
18h12m

That's what would change this from a lousy resource to a brilliant one.

Little SQLer ("Little Squealer")

selimthegrim
0 replies
5h18m

I’m sure MIT Press is working the phone lines right now

thehours
1 replies
15h8m

full outer join (also called cross join) constructs their cross product

Full outer joins and cross joins are different types of joins. A cross join returns the Cartesian product of both tables, while a full outer join is like a combination of a left and right join.

Better explanation here: https://stackoverflow.com/questions/3228871/sql-server-what-...

collyw
0 replies
23m

Could have done with a visual representation of the starting tables and resulting tables to be a good description.It was 20 years ago that I was learning about cartesian joins. When I do SQL I visualize an excel like structure in my head.

pama
1 replies
19h41m

I’ve found that chatGPT is excellent in helping with generating and testing SQL queries. This tutorial is likely helpful for its intended audience; I’d like to see a bit more discussion of query optimization, stored procedures, table design, and available options both in db and in bindings, but it’s hard to fit everything in one place and keep it clean, which is why LLMs will eventually win.

cyral
0 replies
16h36m

I've found the same. Especially dealing with sorta complex JSON column migrations where the JSON structure needs to be changed. Can describe in plain english what I want to do and get a working query for it.

mvdtnz
1 replies
16h29m

A lot of these queries are only valid on SQLite (or maybe not "only" but certainly not generally available across DBMS's), including cases where the author doesn't point this out. For one example filtering on aggregates is not supported by MySQL, MS SQL, Oracle or most other DBMS's.

Izkata
0 replies
4h41m

For one example filtering on aggregates is not supported by MySQL, MS SQL, Oracle or most other DBMS's.

HAVING is one of the standard clauses, I use it on mysql all the time and a quick search shows it exists for the others.

mulmen
1 replies
8h21m

do not expect novices with no prior SQL experience to be able to learn from them

Please suggest this entry level thing.

SQL is enduring because it is logical and understandable. But there is still an initial vertical learning curve.

I am trying to teach a friend of mine SQL but I’m not sure how to construct a lesson plan.

Is there a canonical SQL 101?

meigwilym
0 replies
6h44m

Just above that it states

[what this is] notes and working examples that instructors can use to perform a lesson

So use the resource to teach your friend.

carabiner
1 replies
19h44m

Also StrataScratch.com for leetcode for SQL.

What really distinguishes an SQL master is working with queries hundreds of lines long, and query optimization. For example, you can often make queries faster by taking out joins and replacing them with window functions. It's hard to practice these techniques outside of a legit enterprise dataset with billions of rows (maybe a good startup idea).

swasheck
0 replies
17h24m

one of the most performance-killing practices i’ve seen (outside of brutal orm-generated queries) is the sql hero who wants to do everything all at once in a single splat of sql. at a certain level of complexity, breaking things up into logical units and even forcing materialization (instead of hoping for some good intermediate materialization) can go a long way

atseajournal
1 replies
19h46m

Glad to have learned about iif() from this!

webdoodle
0 replies
19h30m

It's really handy in conjunction with GROUP BY.

AmazingTurtle
1 replies
9h15m

Watch out, you're only referring to MALE and FEMALE sex, that might trigger a few people here

iNic
0 replies
7h43m

Sorry for feeding the trolls, but: I feel like I observe way more people complaining about people getting triggered by M/F sex than I actually see people complain about it. That is why this person had to make up the triggered person, because you will not find them in this comment section.

zubairq
0 replies
9h14m

Some useful SQL queries for me in this list of 100 queries, thanks!

simonhaven
0 replies
4h50m

Lots of great and very clear examples here. As someone who's been writing SQL for over 30 years, I wish I'd had something like this at the start.

Obviously there's a lot more to learn about many of these areas to really make the most of them, but this is a really good launchpad.

rqtwteye
0 replies
16h42m

Item 70 doesn't look right.

rgovostes
0 replies
13h41m

This is from Greg Wilson (https://third-bit.com), co-editor of "The Architecture of Open Source Applications" and many other superb references.

Edit: He wrote about this project here: https://third-bit.com/2024/02/03/sql-tutorial/

nomilk
0 replies
19h39m

The tutorial can be downloaded here: https://github.com/gvwilson/sql-tutorial/raw/main/sql-tutori...

(gives the penguins.db file necessary for the examples)

franole
0 replies
3h10m

The diagrams at "check your understanding" sections are great.

dragonelite
0 replies
9h18m

I would expect someone that has the title of data scientists to at least be a master in the universal data query language SQL.

But im so used to ORM these days anything more complex then a sql join is already going over my head if i didn't do a sql refresher. As far as i have skimmed the article it seems like a very good refresher for even a SWE. I will definitely put this tutorial on my todo list.

dinkleberg
0 replies
19h33m

Thanks for sharing this! Learn by example resources can be super helpful.

andrscyv
0 replies
17h9m

Why bother, just use ChatGPT

_diyar
0 replies
9h24m

As mentioned by others, the author Dr. Greg Wilson has written/compiled many books/tutorials which I can recommend. I would especially laud Software Design by Example [1][2], The Architecture of Open Source Applications [3] and Teaching Tech Together [4].

[1] https://third-bit.com/sdxpy/ (python version) [2] https://third-bit.com/sdxjs/ (js version) [3] https://aosabook.org/ [4] http://teachtogether.tech/

StarlaAtNight
0 replies
15h59m

Don’t have time to do it myself, but might be good to redo this guide but with DuckDB (more likely used by DS’es nowadays than SQLite)

SilverBirch
0 replies
19h30m

The queries are solid, but I really appreciate throwing in the worlds most confusing diagrams here and there. It keeps me alert trying to find where to even start with them. So to be clear, tables must have a name, and must have rows and the rows must have the same name as the table, yes?

Exuma
0 replies
13h50m

Fantastic summary, saving for just general tips when trying to show people SQL