Not to detract from the article, but: Wow the meaning of the term "data scientist" has changed since the days of "sexiest job". From the article description:
- Rachel has a master’s degree in cell biology and now works in a research hospital doing cell assays.
- She learned a bit of R in an undergrad biostatistics course and has been through the Carpentries lesson on the Unix shell.
- Rachel is thinking about becoming a data scientist and would like to understand how data is stored and managed.
Data Scientists, back in the day, were largely people with both a fairly strong quantitative background and a strong software engineering background. The kind of people who could build a demo LSTM in an afternoon. Usually there was a bit of a trade-off between the quant/software aspects (really mathly people might be worse coders, really strong coders might need to freshen up on a few areas of mathematics), but generally they were fairly strong in each area.
In many orgs it's been reduced to "over paid data analysts" but I wouldn't even hire "Rachel" for a role like that.
No that's MLE. A DS rarely gets asked leetcode algos questions, an MLE would.
Depends on the company, a research level MLE would be asked to derive loss functions and perform partial differentiation on pen and paper. You have to answer questions like what Kullbeck Leiber divergence is and how it can be utilized etc.
Would that be a lesser known cousin of the better known Kullback–Leibler separation measure for distributions?
Full marks for snark, but points off for being incorrect.
https://en.m.wikipedia.org/wiki/Kullback%E2%80%93Leibler_div...
From your link:
ie. literally it's a separation measure for distributions .. just as I recalled from my first encounter with the notion ~ 1984 (ish).If you're sincere you should either add those points back or, preferably, expand upon your theory of how my snap take is incorrect.
( I'm aware it's not a metric due to triangle inequality, etc. )
The snap take came across as an argument about which of two names for the measure is better-known.
The wikipedia page implies the opposite of that argument.
Perhaps that’s changed since 1984, but the proposition was about current practices.
It's been Kullback since birth in 1907 to the best of my knowledge, never once Kullbeck.
As a fully anglicized US citizen born in Brooklyn, New York I don't think there's ever been any vowel confusion over the spelling of the name:
https://en.m.wikipedia.org/wiki/Solomon_Kullback
Admittedly I did check as it's not uncommon for mathematicians to have alternate spellings for their names.
Ditto Leibler, born Chicago, Illinois in 1914, no dropped L
https://en.m.wikipedia.org/wiki/Richard_Leibler
I literally was asked two leetcode questions verbatim when interviewing for a data science position at TikTok a few months ago. Dynamic programming (I won't mention which question) and then one regarding binary trees.
You must protect the corporate overlords.
Alternatively, protect themselves since giving away an individualized question could identify them.
The question does not matter at all.
All of the information in the knowledge of Leetcode + category.
(Does it really matter WHICH question?? They are different but all the same. That is the point.)
You must have worked at different places from me. Nearly every DS job I had (before wisely apparently) leaving that area had leetcode style algo questions during the interviews.
Again, things have apparently changed.
No, it hasn't.
The term has always gone in a half-dozen directions at once, and ranged anything from
* an idiot making PPT decks for business presentations based on sales data; to
* a statistician with very sophisticated mathematical background but minimal programming skills doing things in R or State; to
* a person with a random degree making random dashboard in Tableau; to
* a person with sophisticate background in software engineering, data engineering, and related fields who can kind of do math
* an expert in machine learning (of various calibers)
* a physicist using their quantitative skills to munge data
... and so on. That's been confusing people since the title came out. It depends on the industry, and there's a dozen overlapping titles too, some with well-defined meanings and some varying from company to company (business analyst, data engineering, etc.).
Relatedly - and I’m a lead data engineer at my current $JOB — I’ve yet to find a definition of what a data engineer is/does that I find easy to share with people. Of course I have flippant ones (YAML dev with a bit of Python) but nothing more than: Database Admins who learned Python and now care about more of the data lifecycle than the data that resides in the DBs they managed.
As a data engineer do you find it your job to transform and clean data? How much AI stuff do you implement that does data transformations?
That’s a good question. I think LLMs will have a place in the connector space. It would be really cool if they could dynamically handle changes in the source (the api changed and added some new data new columns etc). But right now — at least I — don’t trust AI to do much of anything in terms of ingestion. When data is extracted from the source it’s got to be as close to a 1:1 of the source as possible. Any errors introduced will have a snowball effect down the line.
For data cleaning we do tend to write the same sort of things over and over. And that’s where I think things could improve. Though what makes a data engineer special in my mind is that they get to know the nuances of data in detail. They get familiar with the columns and their meanings to the business and the expected volume and all sorts of things. And when you get that deeply involved with the data you clearly see where things are jarringly and almost like a vet to a sick animal you write data cleaning things because you care about the data that much.
A joke I read shortly after the term Data Scientist was introduced:
Data Scientist - a statistics major living in San Francisco
Lets face it job titles are a lot of bullshit. I was a "programmer". I call myself a "software engineer". I probably do better SQL than many data engineers / scientists, which is getting annoying as I am shoehorned into roles where I plug API's together rather than deal with SQL. But Data engineer roles, always want a load of stuff I have never needed to deal with.
This is so true. Outside of maybe FAANG companies, a lot of places have wildly different expectations for that role. While one company may refer to the guy doing simple PPTs as a business analyst, others might call that a data analyst or a data scientist or something else. The pay probably mostly reflects the truth though outside of exceptions from office politics.
The term sharded into multiple different terms
Strong coder who can implement an LSTM = ML Engineer
Decent coder who can implement a recent paper with scaffolding code = Applied Scientist
Acceptable coder who is good enough at math to innovate and publish = Research Scientist
Strong coder who cares about data = Data Engineer
Acceptable coder who has lots of domain knowledge = Business analyst, Data Scientist.
If you're just a Data scientist without any domain knowledge...... then you're in a precarious career position.
I've seen a disturbing rise in the number of people who think data engineering isn't software engineering. I don't plan to play up that part of my experience the next time I'm applying.
It's because data engineering has been reduced to be able to login to a cloud provider and know which workflow to drag and drop. This is easily learned in a couple of weeks so that s why those skills might not be considered software engineering.
Well, that's the GP's point, I guess: this thing was called "Business analyst", and, honestly, I don't know what being a domain-expert with somewhat above-average computer skills has to do with "data science".
I guess I'm a data scientist then, that sounds better than business analyst.
i think you may be overestimating the avg. past data scientist's software engineering chops, but it's definitely true that the term has become more diluted than ever
you still find these kinds of people and roles at smaller companies but at largecorps, what's the point? the interesting modelbuilding you shunt off to your army of phd-holding research scientists. deploying models and managing infra goes to MLE. what's left is the data analyst stuff, which you repackage as "data science" because cmon, "analytics"? are we dinosaurs? this is modern tech, we have an image to uphold!
there's not really a need for, or supply of, people who can do everything (edit: _at largecorps_, obviously)
Oh sure, if you have teams of research scientists and machine learning engineers to shunt the work to. That's, like, what? 5% of companies out there? Less?
No need, indeed.
so why exactly did you skip the first sentence of the paragraph so that you could make a self-evident point?
anyway that 5% hires a disproportionately larger # of "data scientists"
Data scientists with a strong software engineering background , where are they hiding?
Jokes apart there used to be two categories of data scientists, those that came from a science/phd background where they duct taped their mathematical understanding to code which might work in production, and those those that come from a CS background that duct taped their mathematical/medium tutorial knowledge to an extravaganza of grid search and micro-services that made unscientific predictions in a scalable way.
So now we have the ml engineer (engineer) and the data scientist (science) with clear roles and expectations. Both are full time jobs, most people cannot to both.
all 5 of them are at Alphabet/Meta/OpenAI, no?
but more seriously, unless someone's explicitly doing ML research for most applications using something off-the-shelf-ish[0] and tinkering with it works best. and this mostly requires direct experience[1] with the stack.
and sure, of course, if said project/team/org/corp has so much money they even can train their own model, sure, they can then afford to have these separate roles with "more dedicated" domain experts.
[0] from YOLO to LLaMa to whatever's now on HuggingFace
[1] the more direct the better. you have used LLMs before? great. pyTorch? great. you can deploy stuff on k8s and played with ChatGPT? well, okay, that's ... also great. you know how to get stuff from Snowflake/Databricks/SQL to some training job? take my money!
I use the following table (edit: table turned out ugly, sorry)
--------------------------------------------------------------------------------------data analyst | high | mid | low |
data engineer | low | mid | high |
data scientist | mid | high | mid |
Is quantitative knowledge "knowing stats"?
In my experience this intersection is a null set. And not just that it's an extremely rare feat to pull off IMO, the mental bandwidth and time needed to be good at one of those two alone would consume one person fully. This is why quant/stat specialists were paired with ETL/data-pipeline specialists to build end to end solution.
One reason Data Science became such a hot role back in the day was that it was amorphously defined; because no one knew what exactly it entailed folks across a broad range of skill sets (stats, data engineers, NoSQL folks, visualisation and so on) jumped into the fray. But now companies have burnt their hands, they have learnt to call out exactly what's needed; even when they advertise for DS role they specify what's required of them. For example, this page on Coursera[1] is clear about emphasis on Quant, which is a welcome development IMO.
[1] https://www.coursera.org/articles/what-is-a-data-scientist
I hate to be "that guy" but I find it a little bit sexist that the noob is called "Rachel". OK OK I'm gone.
http://i.stack.imgur.com/eLrhI.png
vs
http://image.slidesharecdn.com/daml-150908205332-lva1-app689...
as the name implies a data scientist is a scientist that works on data. There is no reference to the need to be able to code a LSTM in one afternoon (and it would be absurd for most DS tasks)
Sadly in practice data scientist has always been person who can present data which supports what his/her boss expects.
edit: title has been updated: https://github.com/gvwilson/sql-tutorial/commit/14d1e57b94a8...
One funny aspect about the changing definition of "data scientist" is that I, currently a data scientist, spend most of my professional day working with the LLM/AI modeling areas nowadays and building custom models instead of building analyses and dashboards, since the former is more impactful.
Job positions still want the latter, though. If I ever left my job I'm not confident I could get another job with the Data Scientist title, nor could I get a "ML Engineer" job since those focus more on deployment than development.
My R is embarrassingly rusty nowadays and I miss making pretty charts with ggplot2.
In the Enterprise, the best qualified person for any specialisation you may want has always been whoever IBM/Oracle/Tata has sitting on their bench that week.
They have the courses and certifications to prove it, too. It's magic!
As a field of "science" perhaps.
In real life (when it became hot) data scientists mostly meant "devs doing analytics" and a lot of it involved R and Python, or the term "big data" thrown around for 10GB logs, and things like Cassandra, with or without some background in math or statistics.
What it never has been, in practice, was a combination of strong math/statistics AND strong software engineering background. 99.9999% of the time it's one or the other.
I really wonder whose fault is it. Unfortunately, what I see the most are many companies expecting you to be a jack of all trades (you should have GenAI/LLM skills, ML, Data Engineering, and what not)
Yes, unfortunately when it was declared the Sexist Job, there was a tremendous influx of bootcamps promising you a six figure income after 3 months of part-time study. That has certainly lowered the overall quality of the Data Scientist title.
I remember when it was a pejorative, literally.