HN comments for: Big data is dead (2023)

kmarc

199 replies

7h17m

2024-05-27 11:11:17 UTC

When I was hiring data scientists for a previous job, my favorite tricky question was "what stack/architecture would you build" with the somewhat detailed requirements of "6 TiB of data" in sight. I was careful not to require overly complicated sums, I simply said it's MAX 6TiB

I patiently listened to all the big query hadoop habla-blabla, even asked questions about the financials (hardware/software/license BOM) and many of them came up with astonishing tens of thousands of dollars yearly.

The winner of course was the guy who understood that 6TiB is what 6 of us in the room could store on our smart phones, or a $199 enterprise HDD (or three of them for redundancy), and it could be loaded (multiple times) to memory as CSV and simply run awk scripts on it.

I am prone to the same fallacy: when I learn how to use a hammer, everything looks like a nail. Yet, not understanding the scale of "real" big data was a no-go in my eyes when hiring.

pdimitar

40 replies

6h44m

2024-05-27 11:44:08 UTC

Blows my mind. I am a backend programmer and a semi-decent sysadmin and I would have immediately told you: "make a ZFS or BCacheFS pool with 20-30% redundancy bits and just go wild with CLI programs, I know dozens that work on CSV and XML, what's the problem?".

And I am not a specialized data scientist. But with time I am wondering if such a thing even exists... being a good backender / sysadmin and knowing a lot of CLI tools has always seemed to do the job for me just fine (though granted I never actually managed a data lake, so I am likely over-simplifying it).

apwell23

24 replies

5h21m

2024-05-27 13:07:14 UTC

make a ZFS or BCacheFS pool with 20-30% redundancy bits and just go wild with CLI programs

Lol. Data management is about safety, auditablity, access control, knowledge sharing and who bunch of other stuff. I would've immediately shown you the door as someone who i cannot trust data with.

photonthug

10 replies

4h34m

2024-05-27 13:54:13 UTC

Lol. Data management is about safety, auditablity, access control, knowledge sharing and who bunch of other stuff. I would've immediately shown you the door as someone who i cannot trust data with.

No need to act smug and superior, especially since nothing about OP's plan here actually precludes having all the nice things you mentioned, or even having them inside $your_favorite_enterprise_environment.

You risk coming across as a person who feels threatened by simple solutions, perhaps someone who wants to spend $500k in vendor subscriptions every year for simple and/or imaginary problems... exactly the type of thing TFA talks about.

But I'll ask the question.. why do you think safety, auditablity, access control, and knowledge sharing are incompatible with CLI tools and a specific choice of file system? What's your preferred alternative? Are you sticking with that alternative regardless of how often the work load runs, how often it changes, and whether the data fits in memory or requires a cluster?

apwell23

8 replies

4h15m

2024-05-27 14:12:55 UTC

No need to act smug and superior

I responded with the same tone that gp responded with. "blows my mind" ( that people can be so stupid) .

pdimitar

5 replies

3h30m

2024-05-27 14:58:26 UTC

The classic putting words in people's mouths technique it is then. The good old straw man.

If you really must know: I said "blows my mind [that people don't try simpler and proven solutions FIRST]".

I don't know what do you have to gain to come here and pretend to be in my head. Now here's another thing that blows my mind.

apwell23

4 replies

3h27m

2024-05-27 15:01:41 UTC

that people don't try simpler and proven solutions FIRST

Well why don't people do that according to you ?

Its not 'mind blowing' to me because you can never guess what angle interviewer is coming at you. Especially when they use the words like ' data stack'.

pdimitar

2 replies

3h9m

2024-05-27 15:19:24 UTC

I don't know why and this is why I said it's mind-blowing. Because to me trying stuff that can work on most laptops comes naturally in my head as the first viable solution.

As for interviews, sure, they have all sorts of traps. It really depends on the format and the role. Since I already disclaimed that I am not actual data scientist and just a seasoned dev who can make some magic happen without a dedicated data team (if/when the need arises) then I wouldn't even be in a data scientist interview in the first place. ¯\_(ツ)_/¯

apwell23

1 replies

2h56m

2024-05-27 15:32:15 UTC

Thats fair. My comment wasn't directed at you. I was trying to be smart and write an inverse of original comment. Where I as an interviewer was looking for a proper 'data stack' and interviewee responded with a bespoke solution.

"not understanding the scale of "real" big data was a no-go in my eyes when hiring."

pdimitar

0 replies

2h51m

2024-05-27 15:37:40 UTC

Sure, okay, I get it. My point was more like "Have you tried this obvious thing first that a lot of devs can do for you without too much hassle?". If I were to try for a dedicated data scientist position then I'd have done homework.

StrLght

0 replies

2h47m

2024-05-27 15:40:43 UTC

you can never guess what angle interviewer is coming at you

Why would you guess in that situation though?

It’s an interview, there’s at least 1 person talking to you — you should talk to them, ask them questions, share your thoughts. If you talking to them is a red flag, then high chances that you wouldn’t want to work there anyway.

photonthug

1 replies

3h49m

2024-05-27 14:38:44 UTC

Another comment mentions this classic meme:

Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.

A lot of industry work really does fall into this category, and it's not controversial to say that going the wrong way on this thing is mind-blowing. More than not being controversial, it's not confrontational, because his comment was essentially re: the industry, whereas your comment is directed at a person.

Drive by sniping where it's obvious you don't even care to debate the tech itself might get you a few "sick burn, bro" back-slaps from certain crowds, or the FUD approach might get traction with some in management, but overall it's not worth it. You don't sound smart or even professional, just nervous and afraid of every approach that you're not already intimately familiar with.

apwell23

0 replies

3h36m

2024-05-27 14:51:59 UTC

i repurposed the parent comment

"not understanding the scale of "real" big data was a no-go in my eyes when hiring." , "real winner" ect.

But yea you are right. I shouldn't have directed it at commenter. I was miffed at interviewers who use "tricky questions" and expect people to read their minds and come up with their preconceived solution.

HelloNurse

0 replies

4h1m

2024-05-27 14:27:13 UTC

Abstractly, "safety, auditablity, access control, knowledge sharing" are about people reading and writing files: simplifying away complicated management systems improves security. The operating system should be good enough.

zaphar

5 replies

4h57m

2024-05-27 13:31:08 UTC

What about his answer prevents any of that? As stated the question didn't require any of what you outline here. ZFS will probably do a better job of protecting your data than almost any other filesystem out there so it's not a bad foundation to start with if you want to protect data.

Your entire post reeks of "I'm smarter than you" smugness while at the same time revealing no useful information or approaches. Near as I can tell no one should trust you with any data.

apwell23

4 replies

4h14m

2024-05-27 14:14:37 UTC

Your entire post reeks of "I'm smarter than you"

unlike "blows my mind" ?

As stated the question didn't require any of what you outline here.

Right. OP mentioned it was "tricky question" . What makes it tricky is that all those attributes are implicitly assumed. I wouldn't interview at google and tell them my "stack" is "load it on your laptop". I would never say that in an interview even if I think that's the right "stack" .

zaphar

3 replies

4h1m

2024-05-27 14:26:46 UTC

"blows my mind" is similar in tone yes. But I wasn't replying to the OP. Further the OP actually goes into some detail about how he would approach the problem. You do not.

You are assuming you know what the OP meant by tricky question. And your assumption contradicts the rest of the OP's post regarding what he considered good answers to the question and why.

pdimitar

2 replies

3h4m

2024-05-27 15:24:23 UTC

Honest question: was "blows my mind" so offensive? Thought it was quite obvious I meant that "it blows my mind people don't try the simpler stuff first, especially having in mind that it works for much bigger percentage than cloud providers would have you believe"?

I guess it wasn't but even if so, it would be legitimately baffling how people manage to project so much negativity in three words that are slightly tongue-in-cheek casual comment on the state of affairs in an area whose value is not always clear (in my observations, only after you start having 20+ data sources it starts to pay off to have dedicated data team; I've been in teams only 3-4 devs and we still managed to have 15-ish data dashboards for the executives without too much cursing).

An anecdote, surely, but what isn't?

zaphar

1 replies

1h48m

2024-05-27 16:40:24 UTC

I generally don't find that sort of thing offensive when combined with useful alternative approaches like your post provided. However the phrase does come with a connotation that you are surprised by a lack of knowledge or skill in others. That can be taken as smug or elitist by someone in the wrong frame of mind.

pdimitar

0 replies

1h13m

2024-05-27 17:15:21 UTC

Thank you, that's helpful.

pdimitar

3 replies

4h56m

2024-05-27 13:32:28 UTC

I already qualified my statement quite well by stating my background but if it makes you feel better then sure, show me the door. :)

I was never a data scientist, just a guy who helped whenever it was necessary.

apwell23

2 replies

4h17m

2024-05-27 14:11:41 UTC

I already qualified my statement quite well by stating my background

No. You qualified it with "blows my mind" . Why would it 'blow your mind' if you don't have any data background.

zaphar

0 replies

3h51m

2024-05-27 14:37:18 UTC

He didn't say he didn't have any data background. He's clearly worked with data on several occasions as needed.

pdimitar

0 replies

3h35m

2024-05-27 14:53:24 UTC

Are you trolling? Did you miss the part where I said I worked with data but wouldn't say I'm a professional data scientist?

This negative cherry picking does not do your image any favors.

koverstreet

1 replies

4h45m

2024-05-27 13:43:07 UTC

this is how you know when someone takes themself too seriously

buddy, you're just rolling off buzzwords and lording it over other people

apwell23

0 replies

4h4m

2024-05-27 14:24:42 UTC

buddy you suffer from NIH syndrome upset that no one wants your 'hacks'.

apwell23

0 replies

2h52m

2024-05-27 15:36:11 UTC

Edit: for above comment.

My comment wasn't directed at parent. I was trying to be smart and write an inverse of original comment. Opposite scenario Where I as an interviewer was looking for a proper 'data stack' and interviewee responded with a bespoke solution.

"not understanding the scale of "real" big data was a no-go in my eyes when hiring."

i was trying to point out that you can never know where the interviewer is coming from. Unless i know interviewer personally i would bias towards playing it safe and go with 'enterpisey stack'

nevi-me

6 replies

6h28m

2024-05-27 11:59:52 UTC

To be fair on candidates, CLI programs create technical debt the moment they're written.

A good answer that strikes a balance between size of data, latency and frequency requirements is a candidate who is able to show that they can choose the right tool that the next person will be comfortable with.

pdimitar

4 replies

6h25m

2024-05-27 12:03:40 UTC

True on the premise, yep, though I'm not sure how using CLI programs like LEGO blocks creates a tech debt?

ImPostingOnHN

3 replies

5h53m

2024-05-27 12:35:01 UTC

I remember replacing a CLI program built like Lego blocks. It was 90-100 LEGO blocks, written over the course of decades, in: Cobol; Fortran; C; Java; Bash; and Perl, and the Legos "connected" with environmental variables. Nobody wanted to touch it lest they break it. Sometimes it's possible to do things too smartly. Apache Spark runs locally (and via CLI).

pdimitar

2 replies

5h36m

2024-05-27 12:52:33 UTC

No no, I didn't mean that at all. I meant a script using well-known CLI programs.

Obviously organically grown Frankenstein programs are a huge liability, I think every reasonable techie agrees on that.

actionfromafar

1 replies

5h7m

2024-05-27 13:20:53 UTC

Well your little CLI-query is suddenly in production and then... it easily escalates.

pdimitar

0 replies

4h57m

2024-05-27 13:31:28 UTC

I already said I never managed a data lake and simply got stuff when it was needed but if you need to criticize then by all means, go wild.

__MatrixMan__

0 replies

4h53m

2024-05-27 13:35:35 UTC

True but it's typically less debt than anything involving a gui, pricetag, or separate server.

WesolyKubeczek

6 replies

6h33m

2024-05-27 11:55:09 UTC

just go wild with CLI programs, I know dozens that work on CSV and XML

...or put it into SQLite for extra blazing fastness! No kidding.

pdimitar

5 replies

6h30m

2024-05-27 11:57:46 UTC

That's included in CLI tools. Also duckdb and clickhouse-local are amazing.

WesolyKubeczek

3 replies

6h26m

2024-05-27 12:01:45 UTC

I need to learn more about the latter for some log processing...

fijiaarone

2 replies

5h42m

2024-05-27 12:46:04 UTC

Log files aren’t data. That’s your first problem. But that’s the only thing that most people have that generates more bytes than can fit on screen in a single spreadsheet.

thfuran

0 replies

5h35m

2024-05-27 12:53:31 UTC

Of course they are. They just aren't always structured nicely.

WesolyKubeczek

0 replies

5h9m

2024-05-27 13:18:45 UTC

Everything is data if you are brave enough.

c0brac0bra

0 replies

5h39m

2024-05-27 12:48:54 UTC

clickhouse-local had been astonishingly fast for operating on many GB of local CSVs.

I had a heck of a time running the server locally before I discovered the CLI.

ImPostingOnHN

0 replies

5h57m

2024-05-27 12:31:31 UTC

> But with time I am wondering if such a thing even exists

Check out "data science at the command line":

https://jeroenjanssens.com/dsatcl/

palata

24 replies

4h50m

2024-05-27 13:38:19 UTC

One thing that may have an impact on the answers: you are hiring them, so I assume they are passing a technical interview. So they expect that you want to check their understanding of the technical stack.

I would not conclude that they over-engineer everything they do from such an answer, but rather just that they got tricked in this very artificial situation where you are in a dominant position and ask trick questions.

I was recently in a technical interview with an interviewer roughly my age and my experience, and I messed up. That's the game, I get it. But the interviewer got judgemental towards my (admittedly bad) answers. I am absolutely certain that were the roles inverted, I could choose a topic I know better than him and get him in a similarly bad position. But in this case, he was in the dominant position and he chose to make me feel bad.

My point, I guess, is this: when you are the interviewer, be extra careful not to abuse your dominant position, because it is probably counter-productive for your company (and it is just not nice for the human being in front of you).

ufo

23 replies

4h30m

2024-05-27 13:58:40 UTC

From the point of view of the interviewee, it's impossible to guess if they expect you to answer "no need for big data" or if they expect you to answer "the company is aiming for exponential growth so disregard the 6TB limit and architect for scalability"

kmarc

19 replies

4h10m

2024-05-27 14:18:22 UTC

FWIW, it's a 2.5 second extra to say "Although you don't need big data, but if you insist, ..." and gimme the hadoop answer.

palata

6 replies

3h8m

2024-05-27 15:20:23 UTC

Sure, but as you said yourself: it's a trick question. How often does the employee have to answer trick questions without having any time to think in the actual job?

As an interviewer, why not asking: "how would you do that in a setup that doesn't have much data and doesn't need to scale, and then how would you do it if it had a ton of data and a big need to scale?". There is no trick here, do you feel you lose information about the interviewee?

hirsin

2 replies

1h26m

2024-05-27 17:01:50 UTC

Trick questions (although not known as such at the time) are the basis of most of the work we do? XY problem is a thing for a reason, and I cannot count the number of times my teams and I have ratholed on something complex only to realize we were solving for the wrong problem, i.e. A trick question.

As a sibling puts it though, it's a matter of level. Senior/staff and above? Yeah, that's mostly what you do. Lower than that, then you should be able to mostly trust those upper folks to have seen through the trick.

palata

1 replies

1h6m

2024-05-27 17:21:53 UTC

are the basis of most of the work we do?

I don't know about you, but in my work, I always have more than 3 seconds to find a solution. I can slowly think about the problem, sleep on it, read about it, try stuff, think about it while running, etc. I usually do at least some of those for new problems.

Then of course there is a bunch of stuff that is not challenging and for which I can start coding right away.

In an interview, those trick questions will just show you who already has experience with the problem you mentioned and who doesn't. It doesn't say at all (IMO) how good the interviewee is at tackling challenging problem. The question then is: do you want to hire someone who is good at solving challenging problems, or someone who already knows how to solve the one problem you are hiring them for?

theamk

0 replies

2024-05-27 18:27:53 UTC

[delayed]

zdragnar

0 replies

2h52m

2024-05-27 15:36:15 UTC

Depends on the level you're hiring for. At a certain point, the candidate needs to be able to identify the right tool for the job, including when that tool is not the usual big data tools but a simple script.

theamk

0 replies

2024-05-27 18:23:25 UTC

[delayed]

coryrc

0 replies

2024-05-27 18:23:24 UTC

Once had a coworker write a long proposal to rewrite some big old application from Python to Go. I threw in a single comment: why don't we use the existing code as a separate executable?

Turns out he was laid off and my suggestion was used.

(Okay, I'm being silly, the layoff was a coincidence)

whamlastxmas

5 replies

3h42m

2024-05-27 14:46:36 UTC

Is this like interviewing for a chef position for a fancy restaurant and when asked how to perfectly cook a steak, you preface it with “well you can either go to McDonald’s and get a burger, or…”

It may not be reasonable to suggest that in a role that traditionally uses big data tools

hnfong

1 replies

3h21m

2024-05-27 15:06:49 UTC

I see it more like "it's 11pm and a family member suddenly wants to eat a steak at home, what would you do?"

The person who says "I'm going drive back to the restaurant and take my professional equipment home to cook the steak" is probably offering the wrong answer.

I'm obviously not a professional cook, but presumably the ability to improvise with whatever tools you currently have is a desirable skill.

palata

0 replies

3h13m

2024-05-27 15:15:35 UTC

Hmm I would say that the equivalent to your 11pm question is more something like "your sister wants to backup her holiday pictures on the cloud, how do you design it?". The person who says "I ask her 10 millions to build a data center" is probably offering the wrong answer :-).

tored

0 replies

2h49m

2024-05-27 15:39:26 UTC

I think more like, how would you prepare and cook the best five course gala dinner for only $10. That requires true skill.

dkz999

0 replies

3h32m

2024-05-27 14:56:21 UTC

Idk, in this instance I feel pretty strongly that cloud, and solutions with unecessary overhead, are the fast food. The article proposes not eating it all the time.

bee_rider

0 replies

2h13m

2024-05-27 16:15:18 UTC

I’m not sure if you are referencing it intentionally or not, but some chefs (Gordon Ramsey for one) will ask an interviewee to make some scrambled eggs; something not super niche or specialized but enough to see what their technique is.

It is a sort of “interview hack” example that’s been used to emphasize the idea of a simple unspecialized skill-test that went around a while ago. I guess upcoming chefs probably practice egg scrambling nowadays, ruining the value of the test. But maybe they could ask to make a bit of steak now.

jancsika

4 replies

3h31m

2024-05-27 14:57:03 UTC

That's great, but it's really just desiderata about you and your personal situation.

E.g., if a HN'er takes this as advice they're just as likely to be gated by some other interviewer who interprets hedging as a smell.

I believe the posters above are essentially saying: you, the interviewer, can take the 2.5 seconds to ask the follow up, "... and if we're not immediately optimizing for scalability?" Then take that data into account when doing your assessment instead of attempting to optimize based on a single gate.

Edit: clarification

coffeebeqn

2 replies

3h15m

2024-05-27 15:13:39 UTC

This is the crux of it. Another interviewer would’ve marked “run on a local machine with a big SSD” - as: this fool doesn’t know enough about distributed systems and just runs toy projects on one machine

dartos

1 replies

3h3m

2024-05-27 15:25:27 UTC

That is what I think interviewers think when I don’t immediately bring up kubernetes and sqs in an architecture interview

theamk

0 replies

11m

2024-05-27 18:17:37 UTC

depending on the shop? For some kinds of tasks, jumping to kubernets right away would be a minus during interview.

antisthenes

0 replies

2h26m

2024-05-27 16:02:10 UTC

E.g., if a HN'er takes this as advice they're just as likely to be gated by some other interviewer who interprets hedging as a smell.

If people in high stakes environments interpret hedging as a smell - run from that company as fast as you can.

Hedging is a natural adult reasoning process. Do you really want to work with someone who doesn't understand that?

llm_trw

0 replies

3h21m

2024-05-27 15:07:06 UTC

I once killed the deployment of a big data team in a large bank when I laid out in excruciating details exactly what they'd have to deal with during an interview.

Last I heard theyd promoted one unix guy on the inside to baby sit a bunch of chron jobs on the biggest server they could find.

valenterry

0 replies

2h35m

2024-05-27 15:53:03 UTC

It doesn't matter. The answer should be "It depends, what are the circumstances - do we expect high growth in the future? Is it gonna stay around 6TB? How and by whom will it be used and what for?"

Or, if you can guess what the interviewer is aiming for, state the assumption and go from there "If we assume it's gonna stay at <10TB for the next couple of years or even longer, then..."

Then the interviewer can interrupt and change the assumptions to his needs.

layer8

0 replies

1h8m

2024-05-27 17:20:05 UTC

You shouldn’t guess what they expect, you should say what you think is right, and why. Do you want to work at a company where you would fail an interview due to making a correct technical assessment? And even if the guess is right, as an interviewer I would be more impressed by an applicant that will give justified reasons for a different answer than what I expected.

drubio

0 replies

1h19m

2024-05-27 17:08:44 UTC

It's almost a law "all technical discussions devolve into interview mind games", this industry has a serious interview/hiring problem.

mrtimo

22 replies

4h50m

2024-05-27 13:38:27 UTC

.parquet files are completely underrated, many people still do not know about the format!

.parquet preserves data types (unlike CSV)

They are 10x smaller than CSV. So 600GB instead of 6TB.

They are 50x faster to read than CSV

They are an "open standard" from Apache Foundation

Of course, you can't peek inside them as easily as you can a CSV. But, the tradeoffs are worth it!

Please promote the use of .parquet files! Make .parquet files available for download everywhere .csv is available!

riku_iki

6 replies

3h1m

2024-05-27 15:26:46 UTC

They are 50x faster to read than CSV

I actually benchmarked this and duckdb CSV reader is faster than parquet reader.

wenc

3 replies

2h18m

2024-05-27 16:09:54 UTC

I would love to see the benchmarks. That is not my experience, except in the rare case of a linear read (in which CSV is much easier to parse).

CSV underperforms in almost every other domain, like joins, aggregations, filters. Parquet lets you do that lazily without reading the entire Parquet dataset into memory.

riku_iki

2 replies

2h7m

2024-05-27 16:21:04 UTC

That is not my experience, except in the rare case of a linear read (in which CSV is much easier to parse).

Yes, I think duckdb only reads CSV, then projects necessary data into internal format (which is probably more efficient than parquet, again based on my benchmarks), and does all ops (joins, aggregations) on that format.

wenc

1 replies

1h56m

2024-05-27 16:32:02 UTC

Yes, it does that, assuming you read in the entire CSV, which works for CSVs that fit in memory.

With Parquet you almost never read in the entire dataset and it's fast on all the projections, joins, etc. while living on disk.

riku_iki

0 replies

1h48m

2024-05-27 16:40:18 UTC

which works for CSVs that fit in memory.

what? Why CSV is required to fit in memory in this case? I tested CSVs which are far larger than memory, and it works just fine.

xnx

1 replies

2h16m

2024-05-27 16:12:22 UTC

For how many rows?

riku_iki

0 replies

2h12m

2024-05-27 16:16:33 UTC

10B

thesz

5 replies

4h28m

2024-05-27 13:59:48 UTC

Parquet is underdesigned. Some parts of it do not scale well.

I believe that Parquet files have rather monolithic metadata at the end and it has 4G max size limit. 600 columns (it is realistic, believe me), and we are at slightly less than 7.2 millions row groups. Give each row group 8K rows and we are limited to 60 billion rows total. It is not much.

The flatness of the file metadata require external data structures to handle it more or less well. You cannot just mmap it and be good. This external data structure most probably will take as much memory as file metadata, or even more. So, 4G+ of your RAM will be, well, used slightly inefficiently.

(block-run-mapped log structured merge tree in one file can be as compact as parquet file and allow for very efficient memory mapped operations without additional data structures)

Thus, while parqet is a step, I am not sure it is a step in definitely right direction. Some aspects of it are good, some are not that good.

maxnevermind

0 replies

45m

2024-05-27 17:42:43 UTC

7.2 millions row groups

Why would you need 7.2 mil row groups?

Row group size when stored in HDFS is usually equal to HDFS bock size by default, which is 128MB

7.2 mil * 128MB ~ 1PB

You have a single parquet file 1PB in size?

imiric

0 replies

3h54m

2024-05-27 14:33:52 UTC

What format would you recommend instead?

datadeft

0 replies

4h16m

2024-05-27 14:12:31 UTC

Nobody is forcing you to use a single Parquet file.

apwell23

0 replies

4h12m

2024-05-27 14:16:12 UTC

some critiques of parquet by andy pavlo

https://www.vldb.org/pvldb/vol17/p148-zeng.pdf

Renaud

0 replies

3h54m

2024-05-27 14:34:01 UTC

Parquet is not a database, it's a storage format that allows efficient column reads so you can get just the data you need without having to parse and read the whole file.

Most tools can run queries across parquet files.

Like everything, it has its strengths and weaknesses, but in most cases, it has better trade-offs over CSV if you have more than a few thousand rows.

jjgreen

3 replies

2h37m

2024-05-27 15:51:03 UTC

Please promote the use of .parquet files!

  apt-cache search parquet
  <nada>

Maybe later

seabass-labrax

2 replies

1h44m

2024-05-27 16:44:24 UTC

Parquet is a file format, not a piece of software. 'apt install csv' doesn't make any sense either.

jjgreen

0 replies

53m

2024-05-27 17:35:32 UTC

There is no support for parquet in Debian, by contrast

  apt-cache search csv | wc -l
  259

fhars

0 replies

50m

2024-05-27 17:37:56 UTC

If you want to shine with snide remarks, you should at least understand the point being made:

    $ apt-cache search csv | wc -l
    225
    $ apt-cache search parquet | wc -l
    0

sph

2 replies

4h43m

2024-05-27 13:45:33 UTC

Third consecutive time in 86 days that you mention .parquet files. I am out of my element here, but it's a bit weird

ok_computer

0 replies

4h27m

2024-05-27 14:00:51 UTC

Sometimes when people discover or extensively use something they are eager to share in contexts they think are relevant. There is an issue when those contexts become too broad.

3 times across 3 months is hardly astroturfing for big parquet territory.

fifilura

0 replies

4h36m

2024-05-27 13:51:55 UTC

FWIW I am the same. I tend to recommend BigQuery and AWS/Athena in various posts. Many times paired with Parquet.

But it is because it makes a lot of things much simpler, and that a lot of people have not realized that. Tooling is moving fast in this space, it is not 2004 anymore.

His arguments are still valid and 86 days is a pretty long time.

ddalex

1 replies

4h31m

2024-05-27 13:56:58 UTC

Why is .parquet better than protobuf?

sdenton4

0 replies

4h16m

2024-05-27 14:11:43 UTC

Parquet is columnar storage, which is much faster for querying. And typically for protobuf you deserialize each row, which has a performance cost - you need to deserialize the whole message, and can't get just the field you want.

So, of you want to query a giant collection of protobufs, you end up reading and deserializing every record. For parquet, you get much closer to only reading what you need.

mattbillenstein

15 replies

6h26m

2024-05-27 12:02:35 UTC

I can appreciate the vertical scaling solution, but to be honest, this is the wrong solution for almost all use cases - consumers of the data don't want awk, and even if they did, spooling over 6TB for every kinda of query without partitioning or column storage is gonna be slow on a single cpu - always.

I've generally liked BigQuery for this type of stuff - the console interface is good enough for ad-hoc stuff, you can connect a plethora of other tooling to it (Metabase, Tableau, etc). And if partitioned correctly, it shouldn't be too expensive - add in rollup tables if that becomes a problem.

__alexs

5 replies

6h9m

2024-05-27 12:19:29 UTC

A moderately powerful desktop processor has memory bandwidth of over 50TB/s so yeah it'll take a couple of minutes sure.

fijiaarone

3 replies

5h45m

2024-05-27 12:42:45 UTC

The slow part of using awk is waiting for the disk to spin over the magnetic head.

And most laptops have 4 CPU cores these days, and a multiprocess operating system, so you don’t have to wait for random access on a spinning plate to find every bit in order, you can simply have multiple awk commands running in parallel.

Awk is most certainly a better user interface than whatever custom BrandQL you have to use in a textarea in a browser served from localhost:randomport

Androider

1 replies

4h16m

2024-05-27 14:12:03 UTC

The slow part of using awk is waiting for the disk to spin over the magnetic head.

If we're talking about 6 TB of data:

- You can upgrade to 8 TB of storage on a 16-inch MacBook Pro for $2,200, and the lowest spec has 12 CPU cores. With up to 400 GB/s of memory bandwidth, it's truly a case of "your big data problem easily fits on my laptop".

- Contemporary motherboards have 4 to 5 M.2 slots, so you could today build a 12 TB RAID 5 setup of 4 TB Samsung 990 PRO NVMe drives for ~ 4 x $326 = $1,304. Probably in a year or two there will be 8 TB NVMe's readily available.

Flash memory is cheap in 2024!

bewaretheirs

0 replies

1h16m

2024-05-27 17:12:36 UTC

You can go further.

There are relatively cheap adapter boards which let you stick 4 M.2 drives in a single PCIe x16 slot; you can usually configure a x16 slot to be bifurcated (quadfurcated) as 4 x (x4).

To pick a motherboard at quasi-random:

Tyan HX S8050. Two M.2 on the motherboard.

20 M.2 drives in quadfurcated adapter cards in the 5 PCIe x16 slots

And you can connect another 6 NVMe x4 devices to the MCIO ports.

You might also be able to hook up another 2 to the SFF-8643 connectors.

This gives you a grand total of 28-30 x4 NVME devices on one not particularly exotic motherboard, using most of the 128 regular PCIe lanes available from the CPU socket.

hnfong

0 replies

3h11m

2024-05-27 15:17:26 UTC

I haven't been using spinning disks for perf critical tasks for a looong time... but if I recall correctly, using multiple processes to access the data is usually counter-productive since the disk has to keep repositioning its read heads to serve the different processes reading from different positions.

Ideally if the data is laid out optimally on the spinning disk, a single process reading the data would result in a mostly-sequential read with much less time wasted on read head repositioning seeks.

In the odd case where the HDD throughput is greater than a single-threaded CPU processing for whatever reason (eg. you're using a slow language and complicated processing logic?), you can use one optimized process to just read the raw data, and distribute the CPU processing to some other worker pool.

dahart

0 replies

2h50m

2024-05-27 15:38:11 UTC

Running awk on an in-memory CSV will come nowhere even close to the memory bandwidth your machine is capable of.

Stranger43

2 replies

4h19m

2024-05-27 14:09:12 UTC

And here we see this strange thing that data science people does in forgetting that 6TB is small change for any SQL server worth it's salt.

Just dump it into Oracle, postgre, mssql, or mysql and be amazed by the kind of things you can do with 30year old data analysis technology on an modern computer.

apwell23

1 replies

3h24m

2024-05-27 15:04:08 UTC

you wouldn't have been a 'winner' per OP. real answer is loading it on their phones not on sqlserver or whatever.

Stranger43

0 replies

2h38m

2024-05-27 15:49:54 UTC

To be honest OP is kind of making the same mistake in assuming that the only real alternatives is "new data science products" and old school scripting exists as valuable tools.

The extend people goes to to not recognize how much the people creating the SQL language and the relational database engines we now take for granted actually knew what they were doing, are a bit of an mystery to me.

The right answer to any query that can be defined in SQL is pretty much always an SQL engine even if it's just sqlite running on an laptop. But somehow people seems to keep comming up with reasons not to use SQL.

pyrale

1 replies

4h37m

2024-05-27 13:50:49 UTC

Once you understand that 6tb fits on a hard drive, you can just as well put it in a run-of-the-mill pg instance, which metabase will reference just as easily. Hell, metabase is fine with even a csv file...

crowcroft

0 replies

3h34m

2024-05-27 14:53:54 UTC

I worked in a large company that had a remote desktop instance with 256gb ram running a PG instance that analysts would log in to to do analysis. I used to think it was a joke of setup for such a large company.

I later moved to a company with a fairly sophisticated setup with Databricks. While Databricks offered some QoL improvements, it didn't magically make all my queries run quickly, and it didn't allow me anything that I couldn't have done on the remote desktop setup.

kjkjadksj

1 replies

5h14m

2024-05-27 13:14:28 UTC

Hes hiring data scientists not building a service though. This might realistically be a one off analysis for those 6tb. At which point you are happy your data scientists has returned statistical information instead of spending another week making sure the pipeline works if someone puts a greek character in a field.

data-ottawa

0 replies

1h24m

2024-05-27 17:03:50 UTC

Even if I'm doing a one off, depending on the task it can be easier/faster/more reliable to load 6TiB into a big query table than waiting hours for some task to complete and fiddling with parallelism and memory management.

It's a couple hundred bucks a month and $36 to query the entire dataset, after partitioning thats not terrible.

ryguyrg

0 replies

3h1m

2024-05-27 15:27:29 UTC

you can scale vertically with a much better tech than awk.

enter duckdb with columnar vectorized execution and full SQL support. :-)

disclaimer: i work with the author at motherduck and we make a data warehouse powered by duckdb

fifilura

0 replies

5h46m

2024-05-27 12:42:35 UTC

I agree with this. BigQuery or AWS s3/Athena.

You shouldn't have to set up a cluster for data jobs these days.

And it kind of points out the reason for going with a data scientist with the toolset he has in mind instead of optimizing for a commandline/embedded programmer.

The tools will evolve in the direction of the data scientist, while the embedded approach is a dead end in lots of ways.

You may have outsmarted some of your candidates, but you would have hired a person not suited for the job long term.

boppo1

13 replies

6h57m

2024-05-27 11:31:40 UTC

You have 6 TiB of ram?

qaq

2 replies

6h26m

2024-05-27 12:01:46 UTC

You can have 8TB RAM in a 2U box for under 100K. grab a couple and it will save you millions a year compared to over-engineered bigdata setup.

apwell23

1 replies

5h34m

2024-05-27 12:53:52 UTC

Bigquery and snowflake are software. They come with a sql engine, data governance, integration with your ldap, auditing. Loading data into snowflake isn't overegineering. What you described is over-engineering.

No business is passing 6tb data around on their laptops.

qaq

0 replies

3h37m

2024-05-27 14:51:10 UTC

So is ClickHouse your point being ? Please point out what a server being able to have 8TB of RAM has to do with laptops.

ninkendo

1 replies

6h53m

2024-05-27 11:35:42 UTC

You don’t need that much ram to use mmap(2)

marginalia_nu

0 replies

6h7m

2024-05-27 12:20:44 UTC

To be fair, mmap doesn't put your data in RAM, it presents it as though it was in RAM and has the OS deal with whether or not it actually is.

david_allison

1 replies

6h19m

2024-05-27 12:09:27 UTC

https://yourdatafitsinram.net/

compressedgas

0 replies

6h6m

2024-05-27 12:22:14 UTC

Was posted as https://news.ycombinator.com/item?id=9581862 in 2015

cess11

1 replies

6h49m

2024-05-27 11:38:50 UTC

The "(multiple times)" part probably means batching or streaming.

But yeah, they might have that much RAM. At a rather small company I was at we had a third of it in the virtualisation cluster. I routinely put customer databases in the hundreds of gigabytes into RAM to do bug triage and fixing.

kmarc

0 replies

6h39m

2024-05-27 11:49:20 UTC

Indeed, what I meant to say is that you can load it in multiple batches. However, now thinking, I did play around with servers of TiBs of memory :-)

vitus

0 replies

6h37m

2024-05-27 11:51:19 UTC

If you're one of the public clouds targeting SAP use cases, you probably have some machines with 12TB [0, 1, 2].

[0] https://aws.amazon.com/blogs/aws/now-available-amazon-ec2-hi...

[1] https://cloud.google.com/blog/products/sap-google-cloud/anno...

[2] https://azure.microsoft.com/en-us/updates/azure-mv2-series-v...

lizknope

0 replies

3h39m

2024-05-27 14:49:28 UTC

I personally don't but our computer cluster at work as around 50,000 CPU cores. I can request specific configurations through LSF and there are at least 100 machines with over 4TB RAM and that was 3 years ago. By now there are probably machines with more than that. Those machines are usually reserved for specific tasks that I don't do but if I really needed it I could get approval.

chx

0 replies

6h21m

2024-05-27 12:07:30 UTC

If my business depended on it? I can click a few buttons and have a 8TiB Supermicro server on my doorstep in a few days if I wanted to colo that. EC2 High Memory instances offer 3, 6, 9, 12, 18, and 24 TiB of memory in an instance if that's the kind of service you want. Azure Mv2 also does 2850 - 11400GiB.

So yes, if need to be, I have 6 TiB of RAM.

bluedino

0 replies

5h0m

2024-05-27 13:27:48 UTC

We are decomming our 5-year old 4TB systems this year and could have been ordered with more

the_real_cher

10 replies

6h3m

2024-05-27 12:25:13 UTC

How would six terabytes fit into memory?

It seems like it would get a lot of swap thrashing if you had multiple processes operating on disorganized data.

I'm not really a data scientist and I've never worked on data that size so I'm probably wrong.

coldtea

5 replies

5h54m

2024-05-27 12:33:56 UTC

How would six terabytes fit into memory?

What device do you have in mind? I've seen places use 2TB RAM servers, and that was years ago, and it isn't even that expensive (can get those for about $5K or so).

Currently HP allows "up to 48 DIMM slots which support up to 6 TB for 2933 MT/s DDR4 HPE SmartMemory".

Close enough to fit the OS, the userland, and 6 TiB of data with some light compression.

It seems like it would get a lot of swap thrashing if you had multiple processes operating on disorganized data.

Why would you have "disorganized data"? Or "multiple processes" for that matter? The OP mentions processing the data with something as simple as awk scripts.

the_real_cher

2 replies

4h58m

2024-05-27 13:29:53 UTC

I mean if you're doing data science the data is not always organized and of course you would want multi-processing.

1 TB of memory is like 5 grand from a quick Google search then you probably need specialized motherboards.

coldtea

1 replies

4h21m

2024-05-27 14:07:39 UTC

I mean if you're doing data science the data is not always organized and of course you would want multi-processing

Not necessarily - I might not want it or need it. It's a few TB, it can be on a fast HD, on an even faster SSD, or even in memory. I can crunch them quite fast even with basic linear scripts/tools.

And organized could just mean some massaging or just having them in csv format.

This is already the same rushed notions about "needing this" and "must have that" that the OP describes people jumping to, that leads them to suggest huge setups, distributed processing, multi-machine infrastructure, for use cases and data sizes that could fit on a single server with redundancy and be done it.

DHH has often written about this for their Basecamp needs (scalling vertically where others scale horizontally having worked for them for most of their operation), there's also this classic post: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

1 TB of memory is like 5 grand from a quick Google search then you probably need specialized motherboards.

Not that specialized, I've work with server deployments (HP) with 1, 1.5 and 2TB RAM (and > 100 cores), it's trivial to get.

And 5 or even 30 grand would still be cheaper (and more effective and simpler) than the "big data" setups some of those candidates have in mind.

the_real_cher

0 replies

3h2m

2024-05-27 15:25:44 UTC

Yeah I agree about over engineering.

Im just trying to understand the parent to my original comment.

How would running awk for analysis on 6TB of data work quickly and efficiently?

They say it would go into memory but its not clear to me how that would work as would still have paging and thrashing issues if the data didnt have often used sections of the data.

am I overthinking it and they were they just referring to buying a big ass Ram machine?

fijiaarone

1 replies

5h36m

2024-05-27 12:51:53 UTC

“How would six terabytes fit into memory?”

A better question would be:

Why would anyone stream 6 terabytes of data over the internet?

In 2010 the answer was: because we can’t fit that much data in a single computer, and we can’t get accounting or security to approve a $10k purchase order to build a local cluster, so we need to pay Amazon the same amount every month to give our ever expanding DevOps team something to do with all their billable hours.

That may not be the case anymore, but our devops team is bigger than ever, and they still need something to do with their time.

the_real_cher

0 replies

4h48m

2024-05-27 13:40:32 UTC

Well yeah streaming to the cloud to work around budget issues is a while nother convo haha.

allanbreyes

1 replies

5h49m

2024-05-27 12:38:46 UTC

There are machines that can fit that and more: https://yourdatafitsinram.net/

I'm not advocating that this is generally a good or bad idea, or even economical, but it's possible.

the_real_cher

0 replies

4h59m

2024-05-27 13:29:26 UTC

I'm trying to understand what the person I'm replying to had in mind when they said fit six terabytes in memory and search with awk.

is this what they were referring to just by a big ass Ram machine?

jandrewrogers

0 replies

2h29m

2024-05-27 15:59:06 UTC

6 TB does not fit in memory. However, with a good storage engine and fast storage this easily fits within the parameters of workloads that have memory-like performance. The main caveat is that if you are letting the kernel swap that for you then you are going to have a bad day, it needs to be done in user space to get that performance which constrains your choices.

capitol_

0 replies

4h24m

2024-05-27 14:03:53 UTC

It would easy fit in ram: https://yourdatafitsinram.net/

randomtoast

9 replies

5h40m

2024-05-27 12:47:47 UTC

Now, you have to consider the cost it takes for you whole team to learn how to use AWK instead of SQL. Then you do these TCO calculations and revert back to the BigQuery solution.

tomrod

4 replies

5h30m

2024-05-27 12:57:45 UTC

About $20/month for chatgpt or similar copilot, which really they should reach for independently anyhow.

randomtoast

2 replies

5h0m

2024-05-27 13:28:29 UTC

And since the data scientist cannot verify the very complex AWK output that should be 100% compatible with his SQL query, he relies on the GPT output for business-critical analysis.

tomrod

1 replies

4h24m

2024-05-27 14:04:16 UTC

Only if your testing frameworks are inadequate. But I belive you could be missing or mistaken on how code generation successfully integrates into a developer and data scientist's work flow.

Why not take a few days to get familiar with AWK, a skill which will last a lifetime? Like SQL, it really isn't so bad.

randomtoast

0 replies

4h10m

2024-05-27 14:18:31 UTC

It is easier to write complex queries in SQL instead of AWK. I know both AWK and SQL, and I find SQL much easier for complex data analysis, including JOINS, subqueries, window functions, etc. Of course, your mileage may vary, but I think most data scientists will be much more comfortable with SQL.

elicksaur

0 replies

2h50m

2024-05-27 15:38:36 UTC

Many people have noted how when using LLMs for things like this, the person’s ultimate knowledge of the topic is less than it would’ve otherwise been.

This effect then forces the person to be reliant on the LLM for answering all questions, and they’ll be less capable of figuring out more complex issues in the topic.

$20/mth is a siren’s call to introduce such a dependency to critical systems.

kjkjadksj

1 replies

5h12m

2024-05-27 13:15:50 UTC

For someone who is comfortable with sql we are talking minutes to hours to figure out awk well enough to see how its used or use it.

noisy_boy

0 replies

2h50m

2024-05-27 15:38:17 UTC

It is not only about whether people can figure it out awk. It is also about how supportable the solution is. SQL provides many features specifically to support complex querying and is much more accessible to most people - you can't reasonably expect your business analysts to do complex analysis using awk.

Not only that, it provides a useful separation from the storage format so you can use it to query a flat file exposed as table using Apache Drill or a file on s3 exposed by Athena or data in an actual table stored in a database and so on. The flexibility is terrific.

clwg

0 replies

5h15m

2024-05-27 13:13:26 UTC

Not necessarily. I always try to write to disk first, usually in a rotating compressed format if possible. Then, based on something like a queue, cron, or inotify, other tasks occur, such as processing and database logging. You still end up at the same place, and this approach works really well with tools like jq when the raw data is in jsonl format.

The only time this becomes an issue is when the data needs to be processed as close to real-time as possible. In those instances, I still tend to log the raw data to disk in another thread.

RodgerTheGreat

0 replies

4h50m

2024-05-27 13:38:37 UTC

With the exception of regexes- which any programmer or data analyst ought to develop some familiarity with anyway- you can describe the entirety of AWK on a few sheets of paper. It's a versatile, performant, and enduring data-handling tool that is already installed on all your servers. You would be hard-pressed to find a better investment in technical training.

geraldwhen

5 replies

7h2m

2024-05-27 11:26:28 UTC

I ask a similar question on screens. Almost no one gives a good answer. They describe elaborate architectures for data that fits in memory, handily.

mcny

3 replies

6h42m

2024-05-27 11:45:47 UTC

I think that’s the way we were taught in college / grad school. If the premise of the class is relational databases, the professor says, for the purpose of this course, assume the data does not fit in memory. Additionally, assume that some normalization is necessary and a hard requirement.

Problem is most students don’t listen to the first part “for the purpose of this course”. The professor does not elaborate because that is beyond the scope of the course.

kmarc

1 replies

6h37m

2024-05-27 11:50:56 UTC

FWIW if they were juniors, I would've continued the interview and direct them with further questions, and observer their flow of thinking to decide if they are good candidates to pursue further.

But no, this particular person had been working professionally for decades (in fact, he was much older than me).

geraldwhen

0 replies

57m

2024-05-27 17:31:39 UTC

Yeah. I don’t even bother asking juniors this. At that level I expect that training will be part of the job, so it’s not a useful screener.

acomjean

0 replies

4h44m

2024-05-27 13:44:38 UTC

I took a Hadoop class. We learned hadoop and were told by the instructor we probably wouldn’t’t need it, and learned some other Java processing techniques (streams etc)

Joel_Mckay

0 replies

6h23m

2024-05-27 12:05:34 UTC

People can always find excuses to boot candidates.

I would just back-track from a shipped product date, and try to guess who we needed to get there... given the scope of requirements.

Generally, process people from a commercially "institutionalized" role are useless for solving unknown challenges. They will leave something like an SAP, C#, or MatLab steaming pile right in the middle of the IT ecosystem.

One could check out Aerospike rather than try to write their own version (the dynamic scaling capabilities are very economical once setup right.)

Best of luck, =3

chx

5 replies

6h20m

2024-05-27 12:08:25 UTC

https://x.com/garybernhardt/status/600783770925420546 (Gary Bernhardt of WAT fame):

Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.

This is from 2015...

crowcroft

3 replies

3h44m

2024-05-27 14:43:57 UTC

I wonder if it's fair to revise this to 'your data set fits on NVME drives' these days. Astonishing how fast and how much storage you can get these days.

xethos

1 replies

2h16m

2024-05-27 16:12:20 UTC

Based on a very brief search: Samsung's fastest NVME drives [0] could maybe keep up with the slowest DDR2 [1]. DDR5 is several orders of magnitude faster than both [2]. Maybe in a decade you can hit 2008 speeds, but I wouldn't consider updating the phrase before then (and probably not after, either).

[0] https://www.tomshardware.com/reviews/samsung-980-m2-nvme-ssd...

[1] https://www.tomshardware.com/reviews/ram-speed-tests,1807-3....

[2] https://en.wikipedia.org/wiki/DDR5_SDRAM

dralley

0 replies

1h30m

2024-05-27 16:57:57 UTC

The statement was "fits on", not "matches the speed of".

fbdab103

0 replies

2h21m

2024-05-27 16:07:26 UTC

You can always check available ram: https://yourdatafitsinram.net/

RandomCitizen12

0 replies

4h22m

2024-05-27 14:06:33 UTC

https://yourdatafitsinram.net/

apwell23

5 replies

5h38m

2024-05-27 12:49:53 UTC

What kind of business just has a static set of 6TiB data that people are loading on their laptops.

You tricked candidates with your nonsensical scenario. Hate smartass interviewers like this that are trying some gotcha to feel smug about themselves.

Most candidates don't feel comfortable telling ppl 'just load on your laptops' even if they think thats sensible. They want to present a 'professional solution', esp when you tricked them with the word 'stack'. which is how most of them prbly perceived your trick question.

This comment is so infuriating to me. Why be assholes to each other when world is already full of them.

tomrod

1 replies

4h38m

2024-05-27 13:50:01 UTC

I disagree with your take. Your surly rejoinder aside, the parent commenter identifies an area where senior level knowledge and process appropriately assess a problem. Not every job interview is satisfying checklist of prior experience or training, but rather assessing how well that skillset will fit the needed domain.

In my view, it's an appropriate question.

apwell23

0 replies

4h19m

2024-05-27 14:09:24 UTC

What did you gather as 'needed domain' from that comment. 'needed domain' is often implicit, its not a blank slate. candidates assume all sorts of 'needed domain' even before the interview starts, if i am interviewing at bank I wouldn't suggest 'load it on your laptops' as my 'stack'.

OP even mentioned that it his favorite 'tricky question' . It would def trick me because they used the word 'stack' which has specific meaning in the industry. There are even websites dedicated to 'stack's https://stackshare.io/instacart/instacart

yxwvut

0 replies

4h21m

2024-05-27 14:07:10 UTC

Well put. Whoever asked this question is undoubtedly a nightmare to work with. Your data is the engine that drives your business and its margin improvements, so why hamstring yourself with a 'clever' cost saving but ultimately unwieldy solution that makes it harder to draw insight (or build models/pipelines) from?

Penny wise and pound foolish, plus a dash of NIH syndrome. When you're the only company doing something a particular way (and you're not Amazon-scale), you're probably not as clever as you think.

pizzafeelsright

0 replies

1h44m

2024-05-27 16:44:25 UTC

Big data companies or those that work with lots of data.

The largest dataset I worked with was about 60TB

While that didn't fit in ram most people would just load the sample data into the cluster when I told them it would be faster to load 5% locally and work off that.

marcosdumay

0 replies

3h57m

2024-05-27 14:31:20 UTC

What kind of business just has a static set of 6TiB data that people are loading on their laptops.

Most business have static sets of data that people load on their PCs. (Why do you assume laptops?)

The only weird part of that question is that 6TiB is so big it's not realistic.

sfilipco

3 replies

6h47m

2024-05-27 11:41:21 UTC

I agree that keeping data local is great and should be the first option when possible. It works great on 10GB or even 100GB, but after that starts to matter what you optimize for because you start seeing execution bottlenecks.

To mitigate these bottlenecks you get fancy hardware (e.g oracle appliance) or you scale out (and get TCO/performance gains from separating storage and compute - which is how Snowflake sold 3x cheaper compared to appliances when they came out).

I believe that Trino on HDFS would be able to finish faster than awk on 6 enterprise disks for 6TB data.

In conclusion I would say that we should keep data local if possible but 6TB is getting into the realm where Big Data tech starts to be useful if you do it a lot.

nottorp

1 replies

6h7m

2024-05-27 12:21:11 UTC

I agree that keeping data local is great and should be the first option when possible. It works great on 10GB or even 100GB, but after that starts to matter what you optimize for because you start seeing execution bottlenecks.

The point of the article is 99.99% of businesses never pass even the 10 Gb point though.

sfilipco

0 replies

5h32m

2024-05-27 12:55:44 UTC

I agree with the theme of the article. My reply was to parent comment which has a 6 TB working set.

hectormalot

0 replies

2024-05-27 18:22:21 UTC

I wouldn't underestimate how much a modern machine with a bunch of RAM and SSDs can do vs HDFS. This post[1] is now 10 years old and has find + awk running an analysis in 12 seconds (at speed roughly equal to his hard drive) vs Hadoop taking 26 minutes. I've had similar experiences with much bigger datasets at work (think years of per-second manufacturing data across 10ks of sensors).

I get that that post is only on 3.5GB, but, consumer SSDs are now much faster at 7.5GB/s vs 270MB/s HDD back when the article was written. Even with only mildly optimised solutions, people are churning through the 1 billion rows (±12GB) challenge in seconds as well. And, if you have the data in memory (not impossible) your bottlenecks won't even be reading speed.

[1]: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

rqtwteye

3 replies

3h33m

2024-05-27 14:55:34 UTC

Plenty of people get offended if you tell them that their data isn’t really “big data”. A few years ago I had a discussion with one of my directors about a system IT had built for us with Hadoop, API gateways, multiple developers and hundreds of thousands of yearly cost. I told him that at our scale (now and any foreseeable future) I could easily run the whole thing on a USB drive attached to his laptop and a few python scripts. He looked really annoyed and I was never involved again with this project.

I think it’s part of the BS cycle that’s prevalent in companies. You can’t admit that you are doing something simple.

noisy_boy

2 replies

3h1m

2024-05-27 15:27:34 UTC

In most non-tech companies, it comes down to the motive of the manager and in most cases it is expansion of reporting line and grabbing as much budget as possible. Using "simple" solutions runs counter to this central motivation.

eloisant

0 replies

47m

2024-05-27 17:41:42 UTC

- the manager wants expansion

- the developers want to get experience in a fancy stack to build up their resume

Everyone benefits from the collective hallucination

disqard

0 replies

2h40m

2024-05-27 15:48:15 UTC

This is also true of tech companies. Witness how the "GenAI" hammer is being used right now at MS, Google, Meta, etc.

filleokus

2 replies

5h30m

2024-05-27 12:58:21 UTC

I think I've written about it here before, but I imported ≈1 TB of logs into DuckDB (which compressed it to fit in RAM of my laptop) and was done with my analysis before the data science team had even ingested everything into their spark cluster.

(On the other hand, I wouldn't really want the average business analyst walking around with all our customer data on their laptops all the time. And by the time you have a proper ACL system with audit logs and some nice way to share analyses that updates in real time as new data is ingested, the Big Data Solution™ probably have a lower TCO...)

riku_iki

0 replies

2h54m

2024-05-27 15:34:27 UTC

you probably didn't do joins for example on your dataset, because DuckDB is OOMing on them if they don't fit memory.

marcosdumay

0 replies

4h47m

2024-05-27 13:41:18 UTC

And by the time you have ... the Big Data Solution™ probably have a lower TCO...

I doubt it. The common Big Data Solutions manage to have a very high TCO, where the least relevant share is spent on hardware and software. Most of its cost comes from reliability engineering and UI issues (because managing that "proper ACL" that doesn't fit your business is a hell of a problem that nobody will get right).

thunky

1 replies

5h27m

2024-05-27 13:00:51 UTC

requirements of "6 TiB of data"

How could anyone answer this without knowing how the data is to be used (query patterns, concurrent readers, writes/updates, latency, etc)?

Awk may be right for some scenarios, but without specifics it can't be a correct answer.

marginalia_nu

0 replies

4h4m

2024-05-27 14:24:05 UTC

Those are very appropriate follow up questions I think. If someone tasks you to deal with 6 TiB of data, it is very appropriate to ask enough questions until you can provide a good solution, far better than to assume the questions are unknowable and blindly architect for all use cases.

throwaway_20357

1 replies

5h33m

2024-05-27 12:54:52 UTC

It depends on what you want to do with the data. It can be easier to just stick nicely-compressed columnar Parquets in S3 (and run arbitrarily complex SQL on them using Athena or Presto) than to try to achieve the same with shell-scripting on CSVs.

fock

0 replies

5h6m

2024-05-27 13:22:32 UTC

how exactly is this solution easier than putting the very Parquet files on a classic filesystem. Why does the easy solution require an amazon-subscription?

rgrieselhuber

1 replies

3h59m

2024-05-27 14:28:43 UTC

This is a great test / question. More generally, it tests knowledge with basic linux tooling and mindset as well as experience level with data sizes. 6TiB really isn't that much data these days, depending on context and storage format, etc. of course.

deepsquirrelnet

0 replies

3h23m

2024-05-27 15:04:47 UTC

It could be a great question if you clarify the goals. As it stands it’s “here’s a problem, but secretly I have hidden constraints in my head you must guess correctly”.

The OPs desired solution could have been found from probably some of those other candidates if asked “here is the challenge, solve in most McGuyver way possible”. Because if you change the second part, the correct answer changes.

“Here is a challenge, solve in the most accurate, verifiable way possible”

“Here is a challenge, solve in a way that enables collaboration”

“Here is a challenge, 6TiB but always changing”

^ These are data science questions much more than the question he was asking. The answer in this case is that you’re not actually looking for a data scientist.

marginalia_nu

1 replies

6h12m

2024-05-27 12:16:00 UTC

Problem is possibly that most people with that sort of hands-on intuition for data don't see themselves as data scientists and wouldn't apply for such a position.

It's a specialist role, and most people with the skills you seek are generalists.

deepsquirrelnet

0 replies

3h38m

2024-05-27 14:50:27 UTC

Yeah it’s not really what you should be hiring a data scientist to do. I’m of the opinion that if you don’t have a data engineer, you probably don’t need a data scientist. And not knowing who you need for a job causes a lot of confusion in interviews.

wslh

0 replies

6h31m

2024-05-27 11:57:17 UTC

In my context 99% of the problem is the ETL, nothing to do with complex technology. I see people stuck when they need to get this from different sources in different technologies and/or APIs.

wg0

0 replies

3h53m

2024-05-27 14:35:03 UTC

I have lived through the hype of Big data it was a time of HDFS+HTable I guess and Hapoop etc.

One can't go wrong with DuckDB+SQLite+Open/Elasticsearch either with 6 to 8 even 10 TB of data.

[0]. https://duckdb.org/

torginus

0 replies

2h36m

2024-05-27 15:52:21 UTC

It's astonishing how shit the cloud is compared to boring-ass pedestrian technology.

For example, just logging stuff into a large text file is so much easier, performant and searchable that using AWS CloudWatch, presumably written by some of the smartest programmers who ever lived.

On another note I was once asked to create a big data-ish object DB, and me, knowing nothing about the domain, and a bit of benchmarking, decided to just use zstd-compressed json streams with a separate index in an sql table. I'm sure any professional would recoil at it in horror, but it could do literally gigabytes/sec retrieval or deserialization on consumer grade hardware.

tonetegeatinst

0 replies

20m

2024-05-27 18:08:13 UTC

I'm not even in data science, but I am a slight data hoarder. And heck even I'd just say throw that data on a drive and have a backup in the cloud and on a cold hard drive.

rr808

0 replies

5h52m

2024-05-27 12:36:35 UTC

If you look at the article the data space is more commonly 10GB which matches my experience. For these sizes definitely simple tools are enough.

michaelcampbell

0 replies

3h42m

2024-05-27 14:45:57 UTC

My smartphone cannot store 1TiB. <shrug>

lizknope

0 replies

3h34m

2024-05-27 14:54:27 UTC

I'm on some reddit tech forums and people will say "I need help storing a huge amount of data!" and people start offering replies for servers that store petabytes.

My question is always "How much data do you actually have?" Many times you they reply with 500GB or 2TB. I tell that that isn't much data when you can get 1TB micro SD card the size of a fingernail or a 24TB hard drive.

My feeling is that if you really need to store petabytes of data that you aren't going to ask how to do it on reddit. If you need to store petabytes you will have an IT team and substantial budget and vendors that can figure it out.

kbolino

0 replies

5h13m

2024-05-27 13:14:57 UTC

Even if a 6 terabyte CSV file does fit in RAM, the only thing you should do with it is convert it to another format (even if that's just the in-memory representation of some program). CSV stops working well at billions of records. There is no way to find an arbitrary record because records are lines and lines are not fixed-size. You can sort it one way and use binary search to find something in it in semi-reasonable time but re-sorting it a different way will take hours. You also can't insert into it while preserving the sort without rewriting half the file on average. You don't need Hadoop for 6 TB but, assuming this is live data that changes and needs regular analysis, you do need something that actually works at that size.

jrm4

0 replies

4h22m

2024-05-27 14:05:56 UTC

This feels representative of so many of our problems in tech, overengineering, over-"producting," over-proprietary-ing, etc.

Deep centralization at the expense of simplicity and true redundancy; like renting a laser cutter when you need a boxcutter, a pair of scissors, and the occasional toenail clipper.

jandrewrogers

0 replies

2h34m

2024-05-27 15:54:27 UTC

As a point of reference, I routinely do fast-twitch analytics on tens of TB on a single, fractional VM. Getting the data in is essentially wire speed. You won't do that on Spark or similar but in the analytics world people consistently underestimate what their hardware is capable of by something like two orders of magnitude.

That said, most open source tools have terrible performance and efficiency on large, fast hardware. This contributes to the intuition that you need to throw hardware at the problem even for relatively small problems.

In 2024, "big data" doesn't really start until you are in the petabyte range.

itronitron

0 replies

45m

2024-05-27 17:43:26 UTC

> "6 TiB of data"

is not somewhat detailed requirements, as it depends quite a bit on the nature of the data.

hotstickyballs

0 replies

4h34m

2024-05-27 13:54:03 UTC

And how many data scientists are familiar with using awk scripts? If you’re the only one then you’ll have failed at scaling the data science team.

hipadev23

0 replies

3h13m

2024-05-27 15:15:25 UTC

Huh? How are you proposing loading a 6TB CSV into memory multiple times? And then processing with awk, which generally streams one a line at a time.

Obviously we can get boxes with multiple terabytes of RAM for $50-200/hr on-demand but nobody is doing that and then also using awk. They’re loading the data into clickhouse or duckdb (at which point the ram requirement is probably 64-128GB)

I feel like this is an anecdotal story that has mixed up sizes and tools for dramatic effect.

dfgdfg34545456

0 replies

3h35m

2024-05-27 14:52:54 UTC

The problem with your question is that they are there to show off their knowledge. I failed a tech interview once, question was build a web page/back end/db that allows people to order let's say widgets, that will scale huge. I went the simpleton answer route, all you need is Rails, a redis cache and an AWS provisioned relational DB, solve the big problems later if you get there sort of thing. Turns out they wanted to hear all about microservices and sharding.

dahart

0 replies

3h1m

2024-05-27 15:27:19 UTC

Wait, how would you split 6 TiB across 6 phones, how would you handle the queries? How long will the data live, do you need to handle schema changes, and how? And what is the cost of a machine with 15 or 20 TiB of RAM (you said it fits in memory multiple times, right?) - isn’t the drive cost irrelevant here? How many requests per second did you specify? Isn’t that possibly way more important than data size? Awk on 6 TiB, even in memory, isn’t very fast. You might need some indexing, which suddenly pushes your memory requirement above 6 TiB, no? Do you need migrations or backups or redundancy? Those could increase your data size by multiples. I’d expect a question that specified a small data size to be asking me to estimate the real data size, which could easily be 100 TiB or more.

citizenpaul

0 replies

1h21m

2024-05-27 17:07:38 UTC

The funny thing is that is exactly the place I want to work at. I've only found one company so far and the owner sold during the pandemic. So far my experience is that amount of companies/people that want what you describe is incredibly low.

I wrote a comment on here the other day that some place I was trying to do work for was using $11k USD a month on a BigQuery DB that had 375MB of source data. My advice was basically you need to hire a data scientist that knows what they are doing. They were not interested and would rather just band-aid the situation for a "cheap" employee. Despite the fact their GCP bill could pay for a skilled employee.

As I've seen it for the last year job hunting most places don't want good people. They want replaceable people.

buremba

0 replies

1h58m

2024-05-27 16:30:15 UTC

I can’t really think of a product with the requirement of max 6TiB data. If the data is big as TiB, most products have 100x TiB rather than a few ones.

bee_rider

0 replies

5h39m

2024-05-27 12:48:49 UTC

There’d still have to be some further questions, right? I guess if you store it on the interview group’s cellphones you’ll have to plan on what to do if somebody leaves or the interview room is hit by a meteor, if you plan to store it in ram on a server you’ll need some plan for power outages.

KronisLV

0 replies

3h23m

2024-05-27 15:05:40 UTC

The winner of course was the guy who understood that 6TiB is what 6 of us in the room could store on our smart phones, or a $199 enterprise HDD (or three of them for redundancy), and it could be loaded (multiple times) to memory as CSV and simply run awk scripts on it.

If it's not a very write heavy workload but you'd still want to be able to look things up, wouldn't something like SQLite be a good choice, up to 281 TB: https://www.sqlite.org/limits.html

It even has basic JSON support, if you're up against some freeform JSON and not all of your data neatly fits into a schema: https://sqlite.org/json1.html

A step up from that would be PostgreSQL running in a container: giving you the support for all sorts of workloads, more advanced extensions for pretty much anything you might ever want to do, from geospatial data with PostGIS, to something like pgvector, timescaledb etc., while still having a plethora of drivers and still not making your drown in complexity and having no issues with a few dozen/hundred TB of data.

Either of those would be something that most people on the market know, neither will make anyone want to pull their hair out and they'll give you the benefit of both quick data writes/retrieval, as well as querying. Not that everything needs or can even work with a relational database, but it's still an okay tool to reach for past trivial file storage needs. Plus, you have to build a bit less of whatever functionality you might need around the data you store, in addition to there even being nice options for transparent compression.

EdwardDiego

0 replies

4h48m

2024-05-27 13:40:20 UTC

If you were hiring me for a data engineering role and asked me how to store and query 6 TiB, I'd say you don't need my skills, you've probably got a Postgres person already.

7thaccount

0 replies

5h3m

2024-05-27 13:24:54 UTC

I am a big fan of these simplistic solutions. In my own area, it was incredibly frustrating as what we needed was a database with a smaller subset of the most recent information from our main long-term storage database for back end users to do important one-off analysis with. This should've been fairly cheap, but of course the IT director architect guy wanted to pad his resume and turn it all into multi-million project with 100 bells and whistles that nobody wanted.

6510

0 replies

3h57m

2024-05-27 14:31:24 UTC

I dont know anything but when doing that I always end up next Thursday having the same with 4TB and the next with 17 at which point I regret picking a solution that fit so exactly.

blagie

41 replies

8h20m

2024-05-27 10:07:46 UTC

Overall, I agree with much of this post, but there are several caveats:

1) Mongo is a bad point of reference.

The one lesson I've learned is that there is nothing Mongo does which postgresql doesn't do better. Big data solutions aren't nosql / mongo, but usually things like columnar databases, map/reduce, Cassandra, etc.

2) Plan for success

95% of businesses never become unicorns, but that's the goal for most (for the 5% which do). If you don't plan for it, you won't make it. The reason to architect for scalability when you have 5 customers is so if that exponential growth cycle hits, you can capitalize on it.

That's not just architecture. To have any chance of becoming a unicorn, every part of the business needs to be planned for now and for later: How do we make this practical / sustainable today? How do we make sure it can grow later when we have millions of customers? A lot of this can be left as scaffolding (we'll swap in [X], but for now, we'll do [Y]).

But the key lessons are correct:

- Most data isn't big. I can fit data about every person in the world on a $100 Chromebook. (8 billion people * 8 bits of data = 8GB)

- Most data is rarely queried, and most queries are tiny. The first step in most big data jobs I've done is taking terabytes of data and shrinking it down to the GB, MB, or oven KB-scale data I need. Caveat: I have no algorithm for predicting what I'll need in the future.

- Cost of data is increasing with regulatory.

davedx

10 replies

7h38m

2024-05-27 10:50:08 UTC

2) Plan for success 95% of businesses never become unicorns, but that's the goal for most (for the 5% which do). If you don't plan for it, you won't make it.

That's exactly what every architecture astronaut everywhere says. In my experience it's completely untrue, and actually "planning for success" more often than not causes huge drags on productivity, and even more important for startups, on agility. Because people never just make plans, they usually implement too.

Plan for the next 3 months and you'll be much more agile and productive. Your startup will never become a unicorn if you can't execute.

newaccount74

1 replies

7h19m

2024-05-27 11:09:17 UTC

The biggest problem with planning for scale is that engineers often have no idea what problems they will actually run into when they scale and they build useless shit that slows them down and doesn't help later at all.

I've come to the conclusion that the only strategy that works reliably is to build something that solves problems you have NOW rather than trying to predict the future.

fuzzy2

0 replies

7h9m

2024-05-27 11:19:38 UTC

Exactly this. Not only would they not know the tech challenges, they also wouldn’t know the business/domain challenges.

CuriouslyC

1 replies

6h7m

2024-05-27 12:21:33 UTC

There's writing code to handle every eventuality, and there's considering 3-4 places you MIGHT pivot and making sure you aren't making those pivots harder than they need to be.

blagie

0 replies

4h55m

2024-05-27 13:33:25 UTC

This is exactly what I try to do and what I've seen successful systems do.

Laying out adjacent markets, potential pivots, likely product features, etc. is a weekend-long exercise. That can help define both where the architecture needs to be flexible, and just as importantly, *where it does not*.

Over-engineering happens when you plan / architect for things which are unlikely to happen.

smrtinsert

0 replies

7h8m

2024-05-27 11:20:01 UTC

The success planners almost always seem to be the same ones pushing everyone to "not overengineer". Uhhhh..

littlestymaar

0 replies

4h23m

2024-05-27 14:05:07 UTC

This. If you plan for the time you'll be a unicorn, you will never get anything done in the first place, let alone being a unicorn. When you plan for the next 3 month, then hopefully in three month you're still here to plan for the next quarter again.

blagie

0 replies

4h58m

2024-05-27 13:29:51 UTC

That's exactly what every architecture astronaut everywhere says. In my experience it's completely untrue, and actually "planning for success" more often than not causes huge drags on productivity, and even more important for startups, on agility. Because people never just make plans, they usually implement too.

That's not my experience at all.

Architecture != implementation

Architecture astronauts will try to solve the world's problems in v0. That's very different from having an architectural vision and building a subset of it to solve problems for the next 3 months. Let me illustrate:

* Agile Idiot: We'll stick it all in PostgreSQL, however it fits, and meet our 3-month milestone. [Everything crashes-and-burns on success]

* Architecture Astronaut: We'll stick it all in a high-performance KVS [Business goes under before v0 is shipped]

* Success: We have one table which will grow to petabytes if we reach scale. We'll stick it all in postgresql for now, but maintain a clean KVS abstraction for that one table. If we hit success, we'll migrate to [insert high-performance KVS]. All the other stuff will stay in postgresql.

The trick is to have a pathway to success while meeting short-term milestones. That's not just software architecture. That's business strategy (clean beachhead, large ultimate market), and every other piece of designing a successful startup. There should be a detailed 3-month plan, a long-term vision, and a rough set of connecting steps.

abetusk

0 replies

4h34m

2024-05-27 13:54:22 UTC

Another way to say that is that "planning for success" is prematurely optimizing for scale.

Scaling up will bring its own challenges, with many of them difficult to foresee.

Spooky23

0 replies

5h51m

2024-05-27 12:37:33 UTC

The exception is when you have people with skills in particular tools.

The suggestion upthread to use awk is awesome if you’re a bunch of Linux grey beards.

But if you have access to people with particular skills or domain knowledge… spending extra cash on silly infrastructure is (within reason) way cheaper than having that employee be less productive.

MOARDONGZPLZ

0 replies

7h24m

2024-05-27 11:03:59 UTC

In my experience the drag caused from the thinking to plan for scalability early has been so much greater than the effort to rearchitect things when and if the company becomes a unicorn that one is significantly more likely to become a unicorn if they simply focus on execution and very fast iteration and save the scalability until it’s actually needed (and they can hire a team of whomever to effect this change with their newly minted unicorn cachet).

notachatbot1234

9 replies

7h44m

2024-05-27 10:44:39 UTC

- Most data isn't big. I can fit data about every person in the world on a $100 Chromebook. (8 billion people * 8 bits of data = 8GB)

Nitpick but I cannot help myself: 8 bits are not even enough for a unique integer ID per person, that would require 8 bytes per person and then we are at 60GB already.

I agree with pretty much anything else you said, just this stood out as wrong and Duty Calls.

iraqmtpizza

6 replies

7h39m

2024-05-27 10:49:35 UTC

meh. memory address is the ID

L-four

5 replies

7h13m

2024-05-27 11:15:35 UTC

Airline booking numbers used to just be the sector number of your booking record on the mainframes HDD.

devsda

1 replies

3h55m

2024-05-27 14:33:17 UTC

This is such a simple scheme.

I wonder how they dealt with common storage issues like backups and disks having bad sectors.

giantrobot

0 replies

1h28m

2024-05-27 17:00:14 UTC

They're likely record based formatting rather than file based. At the high level the code is just asking for a record number from a data set. The data set is managed including redundancy/ECC by the hardware of that storage device.

switch007

0 replies

7h2m

2024-05-27 11:26:02 UTC

My jaw just hit the floor. What a fascinating fact!

rrr_oh_man

0 replies

6h45m

2024-05-27 11:42:47 UTC

That’s why they were constantly recycled?

mauriciolange

0 replies

6h45m

2024-05-27 11:43:20 UTC

source?

amenhotep

1 replies

7h13m

2024-05-27 11:15:02 UTC

Sure it is. You just need a one to one function from person to [0, eight billion]. Use that as your array index and you're golden. 8 GB is overkill, really, you could pack some boolean datum like "is over 18" into bits within the bytes and store your database in a single gigabyte.

Writing your mapping function would be tricky! But definitely theoretically possible.

blagie

0 replies

4h52m

2024-05-27 13:36:23 UTC

I'm old enough to have built systems with similar techniques. We don't do that much anymore since we don't need to, but it's not rocket science.

We had spell checkers before computers had enough memory to fit all words. They'd probabilistically find almost all incorrect words (but not suggest corrections). It worked fine.

brtkdotse

7 replies

7h56m

2024-05-27 10:32:30 UTC

95% of businesses never become unicorns, but that's the goal for most

Is it really the general case or is it just a HN echo chamber meme?

My pet peeve is that patterns used by companies that in theory could become global unicorns are mimicked by companies where 5000 paying customers would mean an immense success

threeseed

2 replies

7h25m

2024-05-27 11:02:50 UTC

HN is the worst echo chamber around.

Obsessed with this "you must use PostgreSQL for every use case" nonsense.

And that anyone who actually has unique data needs is simply doing it for their resume or are over-engineering.

paulryanrogers

1 replies

7h3m

2024-05-27 11:25:14 UTC

Obsessed with this "you must use PostgreSQL for every use case" nonsense.

Pg fans are certainly here asking "why not PG?". Yet so are fans of other DBs; like DuckDB, CouchDB, SQLite, etc.

internet101010

0 replies

1h30m

2024-05-27 16:58:15 UTC

I don't see so much DuckDB and CouchDB proselytizing but the SQLite force always out strong. I tend to divide the Postgres vs. SQLite decision on if the data in question is self-contained. Like am I pulling data from elsewhere (Postgres) or am I creating data within the application that is only used for the functionality of said application (SQLite).

blagie

1 replies

7h41m

2024-05-27 10:47:27 UTC

It's neither.

Lifestyle companies are fine, if that's what you're aiming for. I know plenty of people who run or work at ≈1-30 person companies with no intention to grow.

However, if you're going for high-growth, you need to plan for success. I've seen many potential unicorns stopped by simple lack of planning early on. Despite all the pivots which happen, if you haven't outlined a clear path from 1-3 people in a metaphorical garage to reaching $1B, it almost never happens, and sometimes for stupid reasons.

If your goal is 5000 paying customers at $100 per year and $500k in annual revenues, that can lead to a very decent life. However, it's an entire different ballgame: (1) Don't take in investment (2) You probably can't hire more than one person (3) You need a plan for break-even revenue before you need to quit your job / run out of savings. (4) You need much greater than the 1-in-10 odds of success.

And it's very possible (and probably not even hard) to start a sustainable 1-5 person business with >>50% odds of success, especially late career:

- Find a niche you're aware of from your job

- Do ballpark numbers on revenues. These should land in the $500k-$10M range. Less, and you won't sustain. More, and there will be too much competition.

- Do it better than the (likely incompetent or non-existent) people doing it now

- Use your network of industry contacts to sell it

That's not a big enough market you need to worry about a lot of competition, competitors with VC funding, etc. Especially ones with tall moats do well -- pick some unique skillset, technology, or market access, for example.

However, IF you've e.g. taken in VC funding, then you do need to plan for growth, and part of that is planning for the small odds your customer base (and ergo, your data) does grow.

IneffablePigeon

0 replies

5h41m

2024-05-27 12:47:18 UTC

If you’re in b2b 5000 customers can be a lot more revenue than that. 10-100x, depending hugely on industry and product.

davedx

0 replies

7h37m

2024-05-27 10:51:28 UTC

It's definitely an echo chamber. Most companies definitely do not want to become "unicorns" - most SME's around the world don't even know what a "unicorn" is, let alone be in an industry/sector where it's possible.

Does a mining company want to become a "unicorn"?

A fish and chip shop?

Even within tech there is an extremely large number of companies whose goals are to steadily increase profits and return them to shareholders. 37 Signals is the posterchild there.

Maybe if you're a VC funded startup then yeah.

babel_

0 replies

6h58m

2024-05-27 11:29:49 UTC

Many startups seem to aim for this, naturally it's difficult to put actual numbers to this, and I'm sure many pursue multiple aims in the hope one of them sticks. Since unicorns are really just describing private valuation, really it's the same as saying many aim to get stupendously wealthy. Can't put a number on that, but you can at least see it's a hope for many, though "goal" is probably making it seem like they've got actually achievable plans for it... That, at least, I'm not so convinced of.

Startups are, however, atypical from new businesses, ergo the unicorn myth, meaning we see many attempts to follow such a path that likely stands in the way of many new businesses from actually achieving the more real goals of, well, being a business, succeeding in their venture to produce whatever it is and reach their customers.

I describe it as a unicorn "myth" as it very much behaves in such a way, and is misinterpreted similarly to many myths we tell ourselves. Unicorns are rare and successful because they had the right mixture of novel business and the security of investment or buyouts. Startups purportedly are about new ways of doing business, however the reality is only a handful really explore such (e.g. if it's SaaS, it's probably not a startup), meaning the others are just regular businesses with known paths ahead (including, of course, following in the footsteps of prior startups, which really is self-refuting).

With that in mind, many of the "real" unicorns are realistically just highly valued new businesses (that got lucky and had fallbacks), as they are often not actually developing new approaches to business, whereas the mythical unicorns that startups want to be are half-baked ideas of how they'll achieve that valuation and wealth without much idea of how they do business (or that it can be fluid, matching their nebulous conception of it), just that "it'll come", especially with "growth".

There is no nominative determinism, and all that, so businesses may call themselves startups all they like, but if they follow the patterns of startups without the massive safety nets of support and circumstance many of the real unicorns had, then a failure to develop out the business proper means they do indeed suffer themselves by not appreciating 5000 paying customers and instead aim for "world domination", as it were, or acquisition (which they typically don't "survive" from, as an actual business venture). The studies have shown this really does contribute to the failure rate and instability of so-called startups, effectively due to not cutting it as businesses, far above the expected norm of new businesses...

So that pet peeve really is indicative of a much more profound issue that, indeed, seems to be a bit of an echo chamber blind spot with HN.

After all, if it ought to have worked all the time, reality would look very different from today. Just saying how many don't become unicorns (let alone the failure rate) doesn't address the dissonance from then concluding "but this time will be different". It also doesn't address the idea that you don't need to become a "unicorn", and maybe shouldn't want to either... but that's a line of thinking counter to the echo chamber, so I won't belabour it here.

threeseed

2 replies

7h35m

2024-05-27 10:53:21 UTC

> nothing Mongo does which postgresql doesn't do better

a) It has a built-in and supported horizontal scalability / HA solution.

b) For some use cases e.g. star schemas it has significantly better performance.

> Big data solutions aren't nosql

Almost all big data storage solutions are NoSQL.

ozkatz

0 replies

5h58m

2024-05-27 12:29:52 UTC

Almost all big data storage solutions are NoSQL.

I think it's important to distinguish between OLAP AND OLTP.

For OLAP use cases (which is what this post is mostly about) it's almost 100% SQL. The biggest players being Databricks, Snowflake and BigQuery. Other tools may include AWS's tools (Glue, Athena), Trino, ClickHouse, etc.

I bet there's a <1% market for "NoSQL" tools such as MongoDB's "Atlas Data Lake" and probably a bunch of MapReduce jobs still being used in production, but these are the exception, not the rule.

For OLTP "big data", I'm assuming we're talking about "scale-out" distributed databases which are either SQL (e.g. cockroachdb, vitess, etc) SQL-like (Casandra's CQL, Elasticsearch's non-ANSI SQL, Influx' InfluxQL) or a purpose-built language/API (Redis, MongoDB).

I wouldn't say OLTP is "almost all" NoSQL, but definitely a larger proportion compared to OLAP.

blagie

0 replies

4h48m

2024-05-27 13:40:13 UTC

Almost all big data storage solutions are NoSQL.

Most I've seen aren't. NoSQL means non-relational database. Most big data solutions I've seen will not use a database at all. An example is hadoop.

Once you have a database, SQL makes a lot of sense. There are big data SQL solutions, mostly in the form of columnar read-optimized databases.

On the above, a little bit of relational can make a huge performance difference, in the form of, for example, a big table with compact data with indexes into small data tables. That can be algorithmically a lot more performant than the same thing without relations.

boxed

2 replies

6h49m

2024-05-27 11:39:11 UTC

I see people planning for success to the point of guaranteeing failure, much more than people who suddenly must try to handle success in panic.

It's a second system syndrome + survivor bias thing I think: people who had to clean up the mess of a good MVP complaining about what wasn't done before. But the companies that DID do that planning and architecting before did not survive to be complained about.

CuriouslyC

1 replies

5h53m

2024-05-27 12:35:30 UTC

It's not either or. There are best practices that can be followed regardless with no time cost up front, and there is taking some time to think about how your product might evolve (which you really should be doing anyhow) then making choices with your software that don't make the evolution process harder than it needs to be.

Layers of abstraction make code harder to reason about and work with, so it's a lose lose when trying to iterate quickly, but there's also the idea of architectural "mise en place" vs "just dump shit where it's most convenient right now and don't worry about later" which will result near immediate productivity losses due to system incoherence and disorganization.

boxed

0 replies

5h14m

2024-05-27 13:14:17 UTC

I'm a big fan of "optimize for deletion" (aka leaf-heavy) code. It's good for reasoning when the system is big, and it's good for growing a code base.

It's a bit annoying how the design of Django templates works against this by not allowing free functions...

underwater

1 replies

7h38m

2024-05-27 10:50:11 UTC

To have any chance of becoming a unicorn, every part of the business needs to be planned for now and for later

I think that in practice that’s counterproductive. A startup has a limited runway. If your engineers are spending your money on something that doesn’t pay off for years then they’re increasing the chance you’ll fail before it matters.

blagie

0 replies

4h51m

2024-05-27 13:37:25 UTC

You're confusing planning with implementation.

Planning is a weekend, or at most a few weeks.

zemo

0 replies

5h40m

2024-05-27 12:48:06 UTC

The reason to architect for scalability when you have 5 customers is so if that exponential growth cycle hits, you can capitalize on it.

If you have a product gaining that much traction, it’s usually because of some compound effect based on the existence and needs of its userbase. If on the way up you stumble to add new users, the userbase that’s already there is unlikely to go back to the Old Thing or go somewhere else (because these events are actually rare). For a good while using Twitter meant seeing the fail whale every day. Most people didn’t just up and leave, and nothing else popped up that could scale better that people moved to. Making a product that experiences exponential growth in that way is pretty rare, and struggling to scale those cases and having a period of availability degradation is common. What products hit an exponential growth situation failed because they couldn’t scale?

nemo44x

0 replies

5h50m

2024-05-27 12:38:19 UTC

Mongo allows a developer to burn down a backlog faster than anything else. That’s why it’s so popular. The language drivers interface with the database which just says yes. And whatever happens later is someone else’s problem. Although it’s a far more stable thing today.

Toine

0 replies

6h50m

2024-05-27 11:37:51 UTC

To have any chance of becoming a unicorn, every part of the business needs to be planned for now and for later

Sources ?

OJFord

0 replies

7h38m

2024-05-27 10:49:53 UTC

95% of businesses never become unicorns, but that's the goal for most (for the 5% which do).

I think you're missing quite a few 9s!

ahartmetz

7 replies

8h34m

2024-05-27 09:54:36 UTC

I guess that hype cycle ended at the plateau of being dead. A not uncommon outcome in this incredibly fashion-driven industry.

silvestrov

6 replies

8h7m

2024-05-27 10:21:16 UTC

It has just been rebranded as AI.

AI also use all the data, just with a magick neural network to figure out what it all means.

Cyberdog

2 replies

7h13m

2024-05-27 11:15:40 UTC

Assuming you're serious for a moment, I don't think AI is really a practical tool for working with big data.

- The "hallucination" factor means every result an AI tells you about big data is suspect. I'm sure some of you who really understand AI more than the average person can "um akshually" me on this and tell me how it's possible to configure ChatGPT to absolutely be honest 100% of the time but given the current state of what I've seen from general-purpose AI tools, I just can't trust it. In many ways this is worse than MongoDB just dropping data since at least Mongo won't make up conclusions about data that's not there.

- At the end of the day - and I think we're going to be seeing this happen a lot in the future with other workflows as well - you're using this heavy, low-performance general-purpose tool to solve a problem which can be solved much more performatively by using tools which have been designed from the beginning to handle data management and analysis. The reason traditional SQL RDBMSes have endured and aren't going anywhere soon is partially because they've proven to be a very good compromise between general functionality and performance for the task of managing various types of data. AI is nowhere near as good of a balance for this task in almost all cases.

All that being said, the same way Electron has proven to be a popular tool for writing widely-used desktop and mobile applications, performance and UI concerns be damned all the way to hell, I'm sure we'll be seeing AI-powered "big data" analysis tools very soon if they're not out there already, and they will suck but people will use them anyway to everyone's detriment.

vitus

0 replies

6h13m

2024-05-27 12:15:31 UTC

The "hallucination" factor means every result an AI tells you about big data is suspect.

AI / ML means more than just LLM chat output, even if that's the current hype cycle of the last couple of years. ML can be used to build a perfectly serviceable classifier, or predictor, or outlier detector.

It suffers from the lack of explainability that's always plagued AI / ML, especially as you start looking at deeper neural networks where you're more and more heavily reliant on their ability to approximate arbitrary functions as you add more layers.

you're using this heavy, low-performance general-purpose tool to solve a problem which can be solved much more performatively by using tools which have been designed from the beginning to handle data management and analysis

You are not wrong here, but one challenge is that sometimes even your domain experts do not know how to solve the problem, and applying traditional statistical methods without understanding the space is a great way of identifying spurious correlations. (To be fair, this applies in equal measure to ML methods.)

silvestrov

0 replies

6h29m

2024-05-27 11:59:17 UTC

A comment from the old post: https://news.ycombinator.com/item?id=34696065

I used to joke that Data Scientists exist not to uncover insights or provide analysis, but merely to provide factoids that confirm senior management's prior beliefs.

I think AI is used for the same purpose in companies: signal to the world that the company is using the latest tech and internally for supporting existing political beliefs.

So same job. Hallucination is not a problem here as the AI conclusions are not used when they don't align to existing political beliefs.

quonn

1 replies

7h55m

2024-05-27 10:33:11 UTC

The overlap in terms of the used technologies, the required skills, the actual products and the target market is minimal. AI is not mostly Hadoop, it's not MapReduce, the hardware is different, the software is different, the skillset is very different and a chatbot or image generator is very different from a batch job producing an answer to a query.

renegade-otter

0 replies

5h26m

2024-05-27 13:02:10 UTC

But the underlying problem is the same - companies that use Big Data tech are clueless about data management. You can use unicorns - it's not going to do anything. "Garbage in, garbage out" is a timeless principle.

WesolyKubeczek

0 replies

7h26m

2024-05-27 11:01:54 UTC

Given how often it hallucinates, it should be rebranded to "high data".

zurfer

6 replies

8h1m

2024-05-27 10:26:44 UTC

This is not fully correct.

Originally big data was defined by 3 dimensions:

- Volume (mostly what the author talks about) [solved]

- Velocity, how fast data is processed etc [solved, but expensive]

- Variety [not solved]

Big Data today is not: I don't have enough storage or compute.

It is: I don't have enough cognitive capacity to integrate and make sense of it.

maayank

4 replies

7h15m

2024-05-27 11:12:47 UTC

What do you mean by 'variety'?

zurfer

0 replies

5h55m

2024-05-27 12:33:02 UTC

the other comments get it.

It means that data comes in a ton of different shapes with poorly described schemas (technically and semantically).

From the typical CSV export out of an ERP system to a proprietary message format from your own custom embedded device software.

threeseed

0 replies

6h55m

2024-05-27 11:33:32 UTC

Data that isn't internally managed database exports.

One of the reasons big data systems took off was because enterprises had exports out of third party systems that they didn't want to model since they didn't own it. As well as a bunch of unstructured data e.g. floor plans, images, logs, telemetry etc.

nairboon

0 replies

1h3m

2024-05-27 17:25:11 UTC

it doesn't fit the relational model, e.g. you have some tables, but also tons of different types of images, video, sounds, raw text, etc.

boesboes

0 replies

7h4m

2024-05-27 11:23:51 UTC

not op, but I think they mean the data is complex, heterogeneous and noisy. You won't be able to extract meaning trivially from it, you need something to find the (hidden) meaning in the data.

So AI currently, probably ;)

vishnugupta

0 replies

4h52m

2024-05-27 13:36:20 UTC

I first heard of this 3 V's in a Michael Stonebreaker's talk[1]. For the uninitiated he's a legend in DBMS space, Turing award winner[2].

Highly recommend this and related talks by him, most of them are in YouTube.

[1] https://www.youtube.com/watch?v=KRcecxdGxvQ

[2] https://amturing.acm.org/award_winners/stonebraker_1172121.c...

vegabook

4 replies

8h8m

2024-05-27 10:20:30 UTC

my experience is that while data keeps growing at an exponential rate, its information content does not. In finance at least, you can easily get 100 million data points per series per day if you want everything, and you might be dealing with thousands of series. That sample rate, and the number of series, is usually 99.99% redundant, because the eigenvalues drop off almost to zero very quickly after about 10 dimensions, and often far fewer. There's very little reason to store petabytes of ticks that you will never query. It's much more reasonable in many cases to do brutal (and yes, lossy) dimensionality reduction _at ingests time_, store the first few principal components + outliers, and monitor eigenvalue stability (in case some new, previously negligable, factor, starts increasing in importance). It results in a much smaller dataset that is tractable and in many cases revelatory, because it's actually usable.

CoastalCoder

2 replies

5h56m

2024-05-27 12:32:40 UTC

Could you point to something explaining that eigenvalue / dimensions topic?

It sounds interesting, but it's totally new to me.

mk67

0 replies

3h43m

2024-05-27 14:44:59 UTC

https://en.wikipedia.org/wiki/Principal_component_analysis

beng-nl

0 replies

3h36m

2024-05-27 14:52:03 UTC

Nog OP, but i think they are referring to the fact that you can use PCA (principal component analysis) on a matrix of datapoints to approximate it. Works out of the box in scikit-learn.

You can do (lossy) compression on rows of vectors (treated like a matrix) by taking the top N eigenvectors (largest N eigenvalues) and using them to approximate the original matrix with increasing accuracy (as N grows) by some simple linear operations. If the numbers are highly correlated, you can get a huge amount of compression with minor loss this way.

Personally I like to use it to visualize linear separability of a high dimensioned set of vectors by taking a 2-component PCA and plotting them as x/y values.

bartart

0 replies

2h1m

2024-05-27 16:27:33 UTC

That’s very interesting, so thank you — how do you handle if the eigenvectors change over time?

lokimedes

3 replies

6h59m

2024-05-27 11:29:20 UTC

I was a researcher at the Large Hadron Collider around the time “Big Data” became a thing. We had one of the use cases where analyzing all the data made sense, since it boiled down to frequentist statistics, the more data, the better. Yet even with a global network of supercomputers at our disposal, we funnily figured out that fast local storage was better than waiting for huge jobs to finish. So, surprise, surprise, every single grad student managed somehow to boil the relevant data for her analysis down to exactly 1-5 TB, without much loss in analysis flexibility. There must be like a law of convenience here, that rivals Amdahl’s scaling law.

msl09

0 replies

5h59m

2024-05-27 12:28:57 UTC

I think that your law of convenience is spot on. One thing that got by talking with commercial systems devs is that they are always under pressure by their clients to make their systems as cheap as possible, reducing the database stored and the size of the computations is one great way to minimize the client's monthly bill.

marcosdumay

0 replies

3h44m

2024-05-27 14:44:12 UTC

Let me try one:

"If you can't do your statistical analysis in 1 to 5 TB of data, your methodology is flawed"

This is probably more about human limitations than math. There's a clear ceiling in how much flexibility we can use. That will also change with easier ways to run new kinds of analysis, but it increases with the logarithm of the amount of things we want to do.

civilized

0 replies

5h43m

2024-05-27 12:44:49 UTC

I think there is a law of convenience, and it also explains why many technologies improve at a consistent exponential rate. People are very good at finding convenient ways to make something a little better each year, but every idea takes some minimal time to execute.

gbin

3 replies

7h58m

2024-05-27 10:30:39 UTC

IMHO the main driver for big data was company founders egos. Of course your company will explode and will be a planet scale success!! We need to design for scale! This is really a tragic mistake while your product only needs one SQLite DB until you reach series C.... All the energy should be focused on the product, not its scale yet.

threeseed

0 replies

7h30m

2024-05-27 10:58:03 UTC

No. Big data was driven by people who had big data problems.

It started with Hadoop which was inspired by what existed at Google and became popular in enterprises all around the world who wanted a cheaper/better way to deal with their data than Oracle.

Spark came about as a solution to the complexity of Hive/Pig etc. And then once companies were able to build reliable data pipelines we started to see AI being able to be layered on top.

jandrewrogers

0 replies

2h9m

2024-05-27 16:19:34 UTC

It depends on the kind of data you work with. Many kinds of important data models -- geospatial, sensing, telemetry, et al -- can hit petabyte volumes at "hello world".

Data models generated by intentional human action e.g. clicking a link, sending a message, buying something, etc are universally small. There is a limit on the number of humans and the number of intentional events they can generate per second regardless of data model.

Data models generated by machines, on the other hand, can be several orders of magnitude higher velocity and higher volume, and the data model size is unbounded. These are often some of the most interesting and under-utilized data models that exist because they can get at many facts about the world that are not obtainable from the intentional human data models.

antupis

0 replies

7h49m

2024-05-27 10:39:06 UTC

Well generally yes although there are a couple of exceptions like IoT and GIS stuff where is very common to see 10TB+ datasets.

dventimi

3 replies

7h10m

2024-05-27 11:18:36 UTC

Question for the Big Data folks: where do sampling and statistics fit into this, if at all? Unless you're summing to the penny, why would you ever need to aggregate a large volume of data (the population) rather than a small volume of data (a sample)? I'm not saying there isn't a reason. I just don't know what it is. Any thoughts from people who have genuine experience in this realm?

gregw2

0 replies

4h18m

2024-05-27 14:10:22 UTC

Good question. I am not an expert but here’s my take from my time in this space.

Big data folks typically do sampling and such, but that doesn’t eliminate the need for a big data environment where such sampling can occur. Just as a compiler can’t predict every branch that could happen at compile time (sorry VLIW!) and thus CPUs need dynamic branch predictors, so too a sampling function can’t be predicted in advance of an actual dataset.

In a large dataset there are many ways the sample may not represent the whole. The real world is complex. You sample away that complexity at your peril. You will often find you want to go back to the original raw dataset.

Second, in a large organization, sampling alone presumes you are only focused on org-level outcomes. But in a large org there may be individuals who care about the non-aggregated data relevant to their small domain. There can be thousands of such individuals. You do sample the whole but you also have to equip people at each level of abstraction to do the same. The cardinality of your data will in some way reflect the cardinality of your organization and you can’t just sample that away.

disgruntledphd2

0 replies

7h1m

2024-05-27 11:27:18 UTC

Sampling is almost always heavily used here, because it's Ace. However, if you need to produce row level predictions then you can't sample as you by definition need the role level data.

However you can aggregate user level info into just the features you need which will get you a looooonnnnggggg way.

banku_brougham

0 replies

4h8m

2024-05-27 14:19:49 UTC

There is a problem on website data, where new features are only touching a subset of customers and you need results for every single one.

You wont be partitioned for this case, but the compute you need is just filtering out this set.

But sampling wont get what you want especially if you are doing QC at the business team level about whether the CX is behaving as expected.

stakhanov

2 replies

7h0m

2024-05-27 11:27:55 UTC

The funny thing about "big data" was that it came with a perverse incentive to avoid even the most basic and obvious optimizations on the software level, because the hardware requirement was how you proved how badass you were.

Like: "Look, boss, I can compute all those averages for that report on just my laptop, by ingesting a SAMPLE of the data, rather than making those computations across the WHOLE dataset".

Boss: "What do you mean 'sample'? I just don't know what you're trying to imply with your mathmo engineeringy gobbledigook! Me having spent those millions on nothing can clearly not be it, right?"

bpodgursky

0 replies

2h18m

2024-05-27 16:10:14 UTC

This is a pretty snarky outside view and just not actually true (I spent the first part of my career trying to reduce compute spend as a data engineer).

It was extremely difficult to get > 64gb on a machine for a very long time, and implementation complexity gets hard FAST when you have a hard cap.

And it's EXTREMELY disruptive to have a process that fails every 1/50 times, when data is slightly too large, because your team will be juggling dozens of these routine crons, and if each of them breaks regularly, you do nothing but dumb oncall trying to trim bits off of each strong.

No, Hadoop and MapReduce were not hyperefficient, but it was OK if you write it correctly, and having something that ran reliably is WAY more valuable than boutique bit-optimized C++ crap that nobody trusts or can maintain and fails every thursday with insane segfaults.

(nowdays, just use Snowflake. but it was a reasonable tool for the time).

Spooky23

0 replies

5h47m

2024-05-27 12:41:12 UTC

It came with a few cohorts of Xooglers cashing their options out.

The amount of salesman hype and chatter about big data, followed by the dick measuring contests about whose data was big enough to be worthy was intense for awhile.

deadbabe

2 replies

6h1m

2024-05-27 12:27:16 UTC

Something similar will happen with generative AI someday.

AI scientists will propose all sorts of elaborate complex solutions to problems using LLMs, and the dismissive responsive will be “Your problem is solvable with a couple if statements.”

Most people just don’t have problems that require AI.

doubloon

1 replies

4h56m

2024-05-27 13:32:35 UTC

Thats one of the first things Andrew Ng said in his old ML course

deadbabe

0 replies

3h55m

2024-05-27 14:33:12 UTC

This is why I personally just can’t find motivation to even pay attention to most AI developments. It’s a toy, it does some neat things, but there’s no problem I’ve heard of or encountered where LLM style AI was the only tool for the job, or even the best tool. The main use seems to be content creation and manipulation at scale, which the vast majority of companies simply don’t have to deal with.

Similarly, a lot of companies talk about how they have tons of data, but there’s never any real application or game changing insight from it. Just a couple neat tricks and product managers patting themselves on the back.

Setting up a good database is probably the peak of a typical company’s tech journey.

teleforce

1 replies

7h26m

2024-05-27 11:02:11 UTC

Not dead, it's just having it's winter time not unlike AI winter and once it has its similar "chatbot" moment, all will be well.

My take on the killer application is the climate change for example earthquakes monitoring. For a case study China has just finished building world's largest earthquake monitoring system with the cost of around USD1 Billion across the country with 15K stations [1]. Somehow at the moment is just monitoring existing earthquakes. But let's say there is a big data analytics technique can reliably predicts impending earthquake within a few days, that can probably safe many people and China now still hold the records of the largest mortality and casualty numbers due to earthquakes. Is it probable, the answer is a positive yes based on our work and initial results it's already practical but in order to do that we need integration with comprehensive in-situ IoT networks with regular and frequent data sampling similar to that of China.

Secondly, China also has the largest radio astronomy telescopes and these telescopes together with other radio telescopes collaborate in real-time through e-VLBI to form a virtual giant radio telescopes as big as the earth to monitor distance stars and galaxy. This is how the black hole got its first image but at the time due to logistics one of the telescope remote disks cannot be shipped to the main processing centers in US [2]. At that moment they are not using real-time e-VLBI onky VLBI, and it tooks them several months just to get the complete sets of the black holes observation data. With e-VLBI everything is real-time and with automatic processing it will be hours instead of month. These radio telescopes can also be used for other purposes like monitoring climate change in addition to imaging black holes, their data is astronomical pardon the pun [3].

[1] Chinese Nationwide Earthquake Early Warning System and Its Performance in the 2022 Lushan M6.1 Earthquake:

https://www.mdpi.com/2072-4292/14/17/4269

[2] How Scientists Captured the First Image of a Black Hole:

https://www.jpl.nasa.gov/edu/news/2019/4/19/how-scientists-c...

[3] Alarmed by Climate Change, Astronomers Train Their Sights on Earth:

https://www.nytimes.com/2024/05/14/science/astronomy-climate...

Shrezzing

0 replies

7h9m

2024-05-27 11:19:37 UTC

I think these examples still loosely fits the author's argument:

There are some cases where big data is very useful. The number of situations where it is useful is limited

Even though there are some great use-cases, the overwhelming majority organisations, institutions, and projects will never have a "let's query ten petabytes" scenario that forces them away from platforms like Postgres.

Most datasets, even at very large companies, fit comfortably into RAM on a server - which is now cost-effective, even in the dozens of terabytes.

fijiaarone

1 replies

5h55m

2024-05-27 12:33:15 UTC

The problem with big data is that people don’t have data, they have useless noise.

For lack of data, they generate random bytes collected on every mouse movement on every page, and every packet that moves through their network. It doesn’t tell them anything because the only information that means anything is who clicks that one button on their checkout page after filling out the form with their information or that one request that breaches their system.

That’s why big data is synonymous with meaningless charts on pointless dashboards sold to marketing and security managers who never look at them anyway

It’s like tracking the wind and humidity and temperature and barometer data every tenth of a second every square meter.

It won’t help you predict the weather any better than stepping outside and looking at the sky a couple times a day.

thfuran

0 replies

5h32m

2024-05-27 12:55:44 UTC

It absolutely would. You can't build a useful model off of occasionally looking outside.

clkao

1 replies

8h47m

2024-05-27 09:40:43 UTC

previous discussion: https://news.ycombinator.com/item?id=34694926

spicyusername

0 replies

5h59m

2024-05-27 12:29:06 UTC

I find it interesting that this comment section and that comment section seem to focus on different things, despite being triggered by the same input.

LightFog

1 replies

6h47m

2024-05-27 11:41:26 UTC

When working in a research lab we used to have people boast that their analysis was so big it ‘brought down the cluster’ - which outed them pretty quickly to the people who knew what they were doing.

kjkjadksj

0 replies

5h9m

2024-05-27 13:19:29 UTC

Must have been abusing the head node if they did that

xiaodai

0 replies

3h28m

2024-05-27 15:00:17 UTC

Big memory is eating big data.

unholyguy001

0 replies

3h30m

2024-05-27 14:57:51 UTC

One of the problems with the article is BigQuery becomes astronomically expensive at mid to high volumes. So there is a strong incentive to keep the data in BigQuery manageable or to even move off it as data volumes get higher.

Also larger enterprises don’t even use gcp all that much to begin with

tored

0 replies

3h30m

2024-05-27 14:58:42 UTC

Former prime minister of Sweden Stefan Löfven got wind about Big data buzzword back in 2015 and used it in one of his speeches, that this is the future, however he used the Swedish translation of Big data, stordata. That generated some funny quips about where is lilldata.

https://www.aftonbladet.se/senastenytt/ttnyheter/inrikes/a/8...

tobilg

0 replies

7h48m

2024-05-27 10:40:31 UTC

I witness the overengineering regarding "big" data tools and pipelines since many years... For a lot of use cases, data warehouses and data lakes are only in the gigabytes or single-digit terabytes range, thus their architecture could be much more simplified, e.g. running DuckDB on a decent EC2 instance.

In my experience, doing this will yield the query results faster than some other systems even starting the query execution (yes, I'm looking at you Athena)...

I even think that a lot of queries can be run from a browser nowadays, that's why I created https://sql-workbench.com/ with the help of DuckDB WASM (https://github.com/duckdb/duckdb-wasm) and perspective.js (https://github.com/finos/perspective).

surfingdino

0 replies

5h2m

2024-05-27 13:26:07 UTC

In my experience the only time I worked on a big data project was the public Twitter firehose. The team built an amazing pipeline and it did actually deal with masses of data. Any other team I've been on were delusional and kept building expensive and overcomplicated solutions that could be replaced with a single Postgres instance. The most overcomplicated system I've seen could not process 24hrs-worth of data in under 24 hours... I was happy to move on when an opportunity presented itself.

siliconc0w

0 replies

2h35m

2024-05-27 15:53:32 UTC

Storing data in object and querying from compute caching what you can basically scales until your queries are too expensive for a single node.

schindlabua

0 replies

19m

2024-05-27 18:09:15 UTC

Used to work at a company that produced 20 gigs of analytics every day which is probably the biggest data I'll ever work on. My junior project was writing some data crunching jobs that did aggregations batched and in real time, and store the result in parquet blobs in azure.

My boss was smart enough to have stakeholder meetings where they regularly discussed what to keep and what to throw away, and with some smart algorithms we were able to compress all that data down into like 200MB per day.

We loaded the last 2 months into an sql server and the last 2 years further aggregated into another, and the whole company used the data in excel to do queries on it in reasonable time.

The big data is rotting away on tape storage in case they ever need it in the future.

My boss got a lot of stuff right and I learned a lot, though I only realized that in hindsight. Dude was a bad manager but he knew his data.

rr808

0 replies

4h36m

2024-05-27 13:52:34 UTC

You're never going to get a fang job with an approach like that. And basically most developers are working towards that goal.

ricardo81

0 replies

4h29m

2024-05-27 13:58:50 UTC

Is there a solid definition of big data nowadays?

It seems to conflate somewhat with SV companies completely dismantling privacy concerns and hoovering up as much data as possible. Lots of scenarios I'm sure. I'm just thinking of FAANG in the general case.

renegade-otter

0 replies

5h29m

2024-05-27 12:58:47 UTC

Big Data is not dead - it has been reborn as AI, which is essentially Big Data 2.0.

And just in the same fashion, there was massive hype around Big Data 1.0. From 2013: https://hbr.org/2013/12/you-may-not-need-big-data-after-all

Everyone has so much data that they must use AI in order to tame it. The reality is, however, is that most of their data is crap and all over the place, and no amount of Big Data 1.0 or 2.0 is ever going to fix it.

oli5679

0 replies

7h8m

2024-05-27 11:19:49 UTC

BigQuery has a generous 1TB/month free tier, and $6/tb afterwards. If you have small data, it's a pragmatic option, just make sure to use partitioning and sensible query patterns to limit the number of full-data scans, as you approach 'medium data' region.

There are some larger data-sizes, and query patterns, where either BigQuery Capacity compute pricing, or another vendor like Snowflake, becomes more economical.

https://cloud.google.com/bigquery/pricing

    BigQuery offers a choice of two compute pricing models for running queries:

    On-demand pricing (per TiB). With this pricing model, you are charged for the number of bytes processed by each query. The first 1 TiB of query data processed per month is free.

    Queries (on-demand) - $6.25 per TiB - The first 1 TiB per month is free.

nottorp

0 replies

6h9m

2024-05-27 12:19:07 UTC

How many workloads need more than 24TB of RAM or 445 CPU cores?

<cough> Electron?

nextworddev

0 replies

4h19m

2024-05-27 14:09:24 UTC

Big data is there so that it can justify Databricks and Snowflake valuations /s

ml-anon

0 replies

7h8m

2024-05-27 11:20:21 UTC

Big data was always a buzzword that was weirdly coopted by database people in a way which makes 0 sense. Of course there are vanishingly small number of use cases where we need fast random access to any possible record to look up.

However what ML systems and in particular LLMs rely on having access to millions (if not billions or trillions) of examples. The underlying infra of which is based on some of these tools.

Big Data isn't dead, just this weird idea that the tools and usecases around querying databases has been finally recognised as being mostly useless to most people. It is and always has been about training ML models.

mavili

0 replies

2h50m

2024-05-27 15:38:05 UTC

Big Data hasn't been about storage, I thought it was always about processing. Guy obviously knows his stuff but I got the impression he stressed more about storage and how that's cheap and easy these days. When he does mention processing/computing, he mentions that most of the time people end up only querying recent data (ie small chunk of actual data they hold) but that bears the question: is querying only small chunk of data what businesses need, or are they doing it because querying the whole dataset is just not manageable? In other words, if processing all data at once was as easy as querying the most recent X percent, would most businesses still choose to only query the small chunk? I think there lies the answer whether Big Data (processing) is needed or not.

markus_zhang

0 replies

3h49m

2024-05-27 14:38:58 UTC

I think one problem that arises from practical work is: Databases seem to be biased towards either transactional (including fetching single records) or aggregational workload, but in reality both are used extensively. This also brings difficulty in data modelling, when we DEs are mostly thinking about aggregating data while our users also want to investigate single records.

Actually, now that I think about it, we should have two products for the users, one let them to query single records as fast as possible without hitting the production OLTP transactional database, even from really big data (find one record from PB level data in seconds), one to power the dashboards that ONLY show aggregation. Is Lakehouse a solution? I have never used it.

maartet

0 replies

3h32m

2024-05-27 14:55:51 UTC

Reminds me of this gem from the previous decade: https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html (Don't use Hadoop - your data isn't that big)

Which, of course, was discussed on HN back then: https://news.ycombinator.com/item?id=6398650

lukev

0 replies

3h48m

2024-05-27 14:40:16 UTC

This is a good post, but it's somewhat myopically focused on typical "business" data.

The most interesting applications for "big data" are all (IMO) in the scientific computing space. Yeah, your e-commerce business probably won't ever need "big data" but load up a couple genomics research sets and you sure will.

idunnoman1222

0 replies

2h29m

2024-05-27 15:58:58 UTC

I love how the way he talks about this. His paycheque from big-data is what is dead. His service offering was always bullshit.

iamleppert

0 replies

5h42m

2024-05-27 12:46:40 UTC

What he means to say, is the grift is dead. All the best fish have been fished in that data lake (pun intended), leaving most waiting on the line to truly catch an appealing mackerel. Most of the big data people I know have moved on to more lucrative grifts like crypto and (more recently) AI.

There are still bags to be made if you can scare up a CTO or one of his lieutenants working for a small to medium size Luddite company. Add in storage on the blockchain and a talking AI parrot if you want some extra gristle in your grift.

Long live the tech grifters!

gigatexal

0 replies

5h10m

2024-05-27 13:17:56 UTC

DuckDb is nothing short of amazing. The only thing is when the dataset is bigger than system ram it falls apart. Spilling to disk is still broken.

geertj

0 replies

7h30m

2024-05-27 10:57:44 UTC

I agree with the article that most data sets comfortably fit into a single traditional DB system. But I don’t think that implies that big data is dead. To me big data is about storing data in a columnar storage format with a weak schema, and using a query system based on partitioning and predicate push down instead of indexes. This allows the data to be used in an ad-hoc way by data science or other engineers to answer questions you did not have when you designed the system. Most setups would be relatively small, but could be made to scale relatively well using this architecture.

fredstar

0 replies

2h4m

2024-05-27 16:24:01 UTC

I am in a software services company for more than 15 years. And to be honest, a lot of these big topics have always been some kind of sales talk or door opener. You write a white paper, nominate an 'expert' in your team and use these things in conversations with clients. Sure some trends are way more real and useful then others. But for me the article hits the nail on its head.

estheryo

0 replies

8h14m

2024-05-27 10:14:31 UTC

“MOST PEOPLE DON’T HAVE THAT MUCH DATA” That's really true

donatj

0 replies

5h26m

2024-05-27 13:02:12 UTC

The article only touches on it for a moment but GDPR killed big data. The vast majority of the data that any regular business would have and could be considered big almost certainly contained PII in one form or another. It became too much of a liability to keep that around.

With GDPR, we went from keeping everything by default unless a customer explicitly requested it gone to deleting it all automatically after a certain number of days after their license expires. This makes opaque data lakes completely untenable.

Don't get me wrong, this is all a net positive. The customers data is physically removed and they don't have to worry about future leaks or malicious uses, and we get a more efficient database. The only people really fussed were the sales team trying to lure people back with promises that they could pick right back up where they left off.

debarshri

0 replies

6h52m

2024-05-27 11:35:51 UTC

My first job was doing hadoop stuff around ~2011. I think one of the biggest motivators for big data or rather hadoop adoption was that it was open source. Back then most of the data warehousing platforms dominated by oracle, netezza, EMC, teradata etc. were super expensive on per GB basis. It was followed by lot of success stories about how facebook save $$$ or you could use google's mapreduce for free etc.

Everyone was talking about data being the new "oil".

Enterprise could basically deploy petabyte scale warehouse run HBase or Hive on top of it and build makeshift data-warehouses. It was also when the cloud was emerging, people started creating EMR clusters and deploy workloads there.

I think it was solution looking for problem. And the problem existed only for a handful of companies.

I think somehow, how cloud providers abstracted lot of these tools and databases, gave a better service and we kind of forgot about hadoop et al.

corentin88

0 replies

6h41m

2024-05-27 11:47:07 UTC

No mention of Firebase? That might explain the slow decrease of MongoDB.

coldtea

0 replies

6h2m

2024-05-27 12:25:57 UTC

Selling "Big data" tooling and consulting was a nice money making scheme for while it lasted.

cheptsov

0 replies

4h8m

2024-05-27 14:19:43 UTC

A good clickbait title. One should credit the author for that.

As to the topic, IMO, there is a contradiction. The only way to handle big data is to divide it into chunks that aren’t expensive to query. In that sense, no data is "big" as long as it’s handled properly.

Also, about big data being only a problem for 1 percent of companies: it's a ridiculous argument implying that big data was supposed to be a problem for everyone.

I personally don’t see the point behind the article, with all due respect to the author.

I also see many awk experts here who have never been in charge of building enterprise data pipelines.

breckognize

0 replies

3h9m

2024-05-27 15:19:16 UTC

99.9%+ of data sets fit on an SSD, and 99%+ fit in memory. [1]

This is the thesis for https://rowzero.io. We provide a real spreadsheet interface on top of these data sets, which gives you all the richness and interactivity that affords.

In the rare cases you need more than that, you can hire engineers. The rest of the time, a spreadsheet is all you need.

[1] I made these up.

aorloff

0 replies

42m

2024-05-27 17:46:36 UTC

What is dead is the notion that some ever expanding data lake is a mine full of gems and not a cost center.

angarg12

0 replies

3h22m

2024-05-27 15:06:41 UTC

Previous years I would have completely agreed with this post. Nowadays with the AI and ML craze I'm not so sure. I've seen plenty of companies using using vast amounts of data to train ML models to incorporate to their products. Definitely more data that can be handle by a traditional DB, and well into Big Data territory.

This isn't a value judgement about whether that's a good idea, just an observation from talking with many tech companies doing ML. This definitely feels like a bubble that will burst in due time, but for now ML is turbocharging Big Data.

WesolyKubeczek

0 replies

6h37m

2024-05-27 11:51:32 UTC

I remember going on one of "big data" conferences back in 2015, when it was the buzzword of the day.

The talks were all concentrated around topics like: ingesting and writing the data as quickly as possible, sharding data for the benefit of ingesting data, and centralizing IoT data from around the whole world.

Back then I had questions which were shrugged off — back in the day it seemed to me — as extremely naïve, as if they signified that I was not the "in" crowd somehow for asking them. The questions were:

1) Doesn't optimizing highly for key-value access mean you need that you need to anticipate, predict, and implement ALL of the future access patterns? What if you need to change your queries a year in? The most concrete answer I got was that of course a good architect needs to know and design for all possible ways the data will be queried! I was amazed at either the level of prowess of the said architects — such predictive powers that I couldn't ever dream of attaining! — or the level of self-delusion, as the cynic in me put it.

2) How can it be faster if you keep shoving intermediate processing elements into your pipeline? It's not like you just mindlessly keep adding queues upon queues. That had never been answered. The processing speeds of high-speed pipelines may be impressive, but if some stupid awk over CSV can do it just as quickly on commodity hardware, something _must_ be wrong.

Shrezzing

0 replies

7h28m

2024-05-27 11:00:42 UTC

This is a quite good allegory for the way AI is currently discussed (perhaps the outcome will be different this time round). Particularly the scary slide[1] with the up-and-to-the-right graph, which is used in a near identical fashion today to show an apparently inevitable march of progress in the AI space due to scaling laws.

[2]https://motherduck.com/_next/image/?url=https%3A%2F%2Fweb-as...

RyanHamilton

0 replies

5h37m

2024-05-27 12:51:02 UTC

For 10 years he sold companies on Big Data they didn't need and he only just realised most people don't have big data. Now he's switched to small data tools we should use/buy that. Is it harsh to say either a) 10 Years = He isn't good at his job b) Jordan would sell whatever he get's paid to.

GuB-42

0 replies

6h3m

2024-05-27 12:25:42 UTC

AI is the new "big data". In fact, AI as it is done today is nothing without at least terabytes of data.

What the article talks about is more like a particular type of database architecture (often collectively called "NoSQL") that was a fad a few years ago, and as all fads, it went down. It doesn't mean having lots of data is useless, or that NoSQL is useless, just that it is not the solution to every problem. And also that there is a reason why regular SQL databases have been in use since the 70s: except in specific situation most people don't encounter, they just work.

BiboAlpha

0 replies

3h3m

2024-05-27 15:24:49 UTC

One of the major challenges with Big Data is that it often fails to deliver real business value, instead offering only misleading promises.