HN comments for: ROOT: analyzing petabytes of data scientifically

captainmuon

18 replies

1d4h

2024-06-01 13:34:24 UTC

A blast from the past, I used to work in particle physics and used ROOT a lot. I had a love/hate relationship with it. On the one hand, it had a lot of technical debt and idiosyncrasies. But on the other hand, there are a bunch of things that are easier in ROOT than in more "modern" options like matplotlib. For example, anything that has to do with histograms. Or highly structured data (where your 'columns' contain objects with fields). Or just plotting functions (without having to allocate arrays for the x and y values). I also like the very straightforward object-oriented API. It feels like old-school C++ or Java, as opposed to pandas/matplotlib which has a lot of method chaining, abuse of [] syntax and other magic. It is not elegant, and quite verbose, but that is probably a good thing when doing a scientific analysis.

I left about 5 years ago, and ROOT was in a process of change. They already ripped out the old CINT interpreter and moved to a clang-based codebase, and now you can run your analyses in Jupyter as far as I know (in C++ or Python). I heard the code quality has improved a lot, too.

ilrwbwrkhv

8 replies

1d4h

2024-06-01 14:11:31 UTC

I wonder if Haskell would also be a good fit for writing something like this.

shrimp_emoji

5 replies

1d4h

2024-06-01 14:12:41 UTC

No.

mynameisvlad

3 replies

1d2h

2024-06-01 16:02:18 UTC

This is a technical community. You really have to do better than a one word dismissal without any reasoning.

In other words, why do you think it’s not a good fit?

sfpotter

1 replies

20h47m

2024-06-01 21:41:36 UTC

I think the response gets right to the point!

Using something like Haskell for ROOT is ridiculous for a lot of obvious reasons. A simple and dismissive "no" invites the cautious reader to discover them on their own rather than waste engaging in a protracted debate. Maybe it's better to reject the idea out of hand and spend our time elsewhere.

mynameisvlad

0 replies

3h18m

2024-06-02 15:10:11 UTC

That’s just not how technical discussions work. Not everyone knows what you know and the point of this community is to share knowledge not gatekeep it behind some “discovering it yourself” bullshit. The fastest thing to do is not dismissing it with no explanation but rather explaining for all the readers why that is the case. Because if one person doesn’t know I can guarantee that there’s plenty out there who are just as interested to know. And it’s a waste of everyone’s time to have each person independently come to the same conclusion when it’s apparently easily explainable.

You’re free to not do any of that, of course, but be prepared to defend the fact that you’d prefer not engaging in discussion and instead just shallowly dismiss something.

dekhn

0 replies

18h40m

2024-06-01 23:48:37 UTC

There's a number of reasons for this. The first is that the quant physics community has never really adopted functional programming. It's not particularly obvious to scientists, who typically want to express their computation the way they want to- something that C, C++, and Fortran are all long-established at doing. The second is that much of physics depends on old libraries written over the last 30-40 years, and it's easiest to use them from a language that the library is written in, or one that has a highly similar interface (for example, Python is similar enough to C++ that many foreign function interfaces are literally just direct wrappers). The third is that types (other than simple scalars, arrays, and trees/graphs) have never been a high priority in quant physics. The fourth is that undergrad education outside CS rarely teaches students Haskell, while most undergrads in a quant field graduate knowing some amount of Python.

It's much more likely the physics community would adopt Julia, or maybe Rust, and even that has been pretty slow.

(nothing I said above should be construed as taking a position about the suitability of any specific language or lack thereof for doing scientific computing. I have opinions, but I am attempting to explain the reason factually with a minimum of bias)

hackable_sand

0 replies

1d2h

2024-06-01 15:49:56 UTC

Could it though?

tikhonj

0 replies

1d2h

2024-06-01 16:17:52 UTC

Haskell would be great for designing the interface of a library like this, but not for implementing it. It would definitely not look like "old-school C++ or Java" but, well, that's the whole point :P

I haven't used ROOT so I don't know how well it would work to write bindings for it in Haskell; it can be hard to provide a good interface to an implementation that was designed for a totally different style of use. Possible, just difficult.

goy

0 replies

1d2h

2024-06-01 15:49:33 UTC

I think having Haskell bindings to it will be quite valuable .For implementation of core structures, though, it's better to stick to C++ to max out on performance and have a finer control on resource usage. Haskell isn't particularly good at that.

EDIT: there's one at https://hackage.haskell.org/package/HROOT

casualscience

3 replies

23h13m

2024-06-01 19:15:36 UTC

The best thing about root was how it handled data loading. TTree's, with their column based slicing on disk, are such a good idea. Ever since I graduated and moved into industry, I've been looking for something that works the same way.

moelf

1 replies

21h25m

2024-06-01 21:03:09 UTC

Apache arrow and parquet all work this way. Even HDF5 in column mode isn't completely bad.

TTree is succeeded by RNTuple, which is basically CERN's take on Apache Arrow, they're incredibly similar

amelius

0 replies

20h13m

2024-06-01 22:15:07 UTC

Is this a kind of lazy loading?

dekhn

0 replies

18h46m

2024-06-01 23:42:02 UTC

I was hosting one of the leads of ROOT at Google and we got to talking about ROOT. I mentioned sstables and columnio and he said "oh, yeah, we've been doing that for years".

BiteCode_dev

2 replies

1d3h

2024-06-01 15:18:31 UTC

Honestly now with chatgpt, matplotlib terrible API is less of a problem.

typon

0 replies

23h28m

2024-06-01 19:00:17 UTC

This is a great example of why the age of truly terrible software is going to be ushered in as LLMS get better.

When the cost of complexity of interacting with an API is paid by the LLM, optimizing this particular part of software design (also one of the hardest to get right) will be less fashionable.

OutOfHere

0 replies

1d2h

2024-06-01 15:54:33 UTC

That's true, but still, there are things you just can't do in matplotlib that you can do better in other GPT-aware packages like plotly.

ephimetheus

0 replies

23h36m

2024-06-01 18:52:34 UTC

We all have a love/hate relationship with it. It’s a bit like Stockholm syndrome.

cozzyd

0 replies

23h13m

2024-06-01 19:14:58 UTC

Because matplotlib is not so histogram focused (I guess because the kids these days have plenty of r RAM), people always show these abominable scatter plots that have so many points on top of each other that they're useless. Yuck.

mjtlittle

11 replies

1d9h

2024-06-01 08:35:03 UTC

Didnt know there was a cern tld

sneak

8 replies

1d9h

2024-06-01 08:42:07 UTC

Yes, the root zone is terribly polluted now. Unfortunately there’s no way to unring that bell, people depend on a lot of these new domains now.

It was a huge mistake, borne out of greed and recklessness.

https://en.wikipedia.org/wiki/ICANN#Notable_events

jesprenj

3 replies

1d9h

2024-06-01 08:44:39 UTC

I guess ICANN needs to get money somehow.

lambdaxyzw

2 replies

1d9h

2024-06-01 09:13:43 UTC

Why can't it just get funding from the government?

rnhmjoj

0 replies

1d8h

2024-06-01 10:25:23 UTC

Aren't they already getting an outrageous amount of money for essentially supervising a txt file?

j16sdiz

0 replies

18h42m

2024-06-01 23:46:09 UTC

Which government?

Biganon

1 replies

1d9h

2024-06-01 09:14:41 UTC

I fail to see the problem with those new TLDs.

oefrha

0 replies

1d8h

2024-06-01 09:32:19 UTC

Certain gTLDs have been borderline scams. The most infamous one might be .sucks, an extortion scheme charging an annual protection fee of $$$, complete with the pre-registration process when you could buy <yourtrademark>.sucks for $$$$ before it’s snatched up by your enemies.

They also screwed up some old URL/email parsers/sniffers hardcoding TLDs. Largely the fault of bad assumptions to begin with.

Other than the above, I don’t see much of a problem. Whatever problems people like to point out about gLTDs already existed with numerous sketchy ccTLDs, like .io. Guess what, the latest hotness .ai is also one of those.

9dev

1 replies

1d8h

2024-06-01 09:31:32 UTC

I still wonder why we need that arbitrary restriction anyway?

8organicbits

0 replies

1d8h

2024-06-01 10:24:03 UTC

If we allowed all possible TLDs, then we'd need a default organization to administer them. The current setup requires an organization to control each TLD, which allows us to grant control to countries or large organizations. The web should be decentralized, which means TLD ownership should be spread across multiple organizations. More TLDs with more distinct owners is a better situation than one default.

ragebol

0 replies

1d9h

2024-06-01 09:06:23 UTC

Handy if they host conferences, for people worried about too many TLDs perhaps.

https://con.cern is not yet used, so...

SiempreViernes

0 replies

1d9h

2024-06-01 08:37:00 UTC

Yeah... according to wikipedia they've had it since 2014, but even now a lot of their pages are on .ch

elashri

8 replies

1d5h

2024-06-01 12:56:27 UTC

There are no many reasons why new analyses should default to using ROOT instead of more user friendly and sane options like uproot [1]. Maybe some people have some legacy workflow or their experiments have many custom patches on top of ROOT (common practice) for other things but for physics analysis you might be self torturing yourself.

Also I really like their 404 page [2]. And no it is not about room 404 :)

[1] https://github.com/scikit-hep/uproot5

[2] https://root.cern/404/

moelf

7 replies

1d5h

2024-06-01 13:23:22 UTC

One common criticism of uproot is that it's not flexible when per-row computation gets complicated because for-loops in Python is too slow. For that one can either use Numba (when it works), or, here's the shameless plug, use Julia: https://github.com/JuliaHEP/UnROOT.jl

Past HN discussion on Julia for particle physics: https://news.ycombinator.com/item?id=38512793

szvsw

4 replies

1d3h

2024-06-01 14:46:09 UTC

A great alternative to numba for accelerated Python is Taichi. Trivial to convert a regular python program into a taichi kernel, and then it can target CUDA (and a variety of other options) as the backend. No need to worry about block/grid/thread allocation etc. at the same time, it’s super deep with great support for data classes, custom memory layouts for complexly nested classes, etc etc, comes with autograd, etc. I’m a huge fan - makes writing code that runs on the GPU and integrates with your python libraries an absolute breeze. Super powerful. By far the best tool in the accelerated python toolbox IMO.

OutOfHere

3 replies

1d2h

2024-06-01 15:56:55 UTC

Negative, as Taichi doesn't even support Python 3.12, and it's unclear if it ever will. Why would I limit myself to an old version of Python?

almostgotcaught

2 replies

1d2h

2024-06-01 16:15:32 UTC

Hn people are so haughty

https://github.com/taichi-dev/taichi/pull/8522

OutOfHere

1 replies

1d2h

2024-06-01 16:24:01 UTC

The haughtiness is not for nothing. Since Dec 2023, they made a lame excuse that Pytorch didn't support 3.12: https://github.com/taichi-dev/taichi/issues/8365#issuecommen...

Later, even when Pytorch added support for 3.12, nothing changed (so far) in Taichi.

almostgotcaught

0 replies

1d1h

2024-06-01 17:03:00 UTC

they made a lame excuse that Pytorch didn't support 3.12

how is this a lame excuse

but it fails on a bunch of PyTorch-related tests. We then figured out that PyTorch does not have Python 3.12 support

they have a dep that was blocking them from upgrading. you would have them do what? push pytorch to upgrade?

Later, even when Pytorch added support for 3.12, nothing changed (so far) in Taichi.

my friend that "Later" is feb/march of this year ie 2-3 months ago. exactly how fast would you like for this open source project to service your needs? not to mention there is a PR up for the bump.

I stand by my original comment.

elashri

1 replies

1d4h

2024-06-01 13:59:39 UTC

That'a true and Julia might be a solution but I don't see the adoption happening anytime soon.

But this particular problem (per row computation) have different options to tackle now in hep-python ecosystem. One approach is to leverage array programming with NumPy to vectorize operations as much as possible. By operating on entire arrays rather than looping over individual elements, significant speedups can often be achieved.

Another possibility is to use a library like Awkward Array, which is designed to work with nested, variable-sized data structures. Awkward Array integrates well with uproot and provides a powerful and flexible framework for performing fast computations on i.e jagged arrays.

moelf

0 replies

1d4h

2024-06-01 14:17:17 UTC

Uproot already returns you Awkward array, so both things you mentioned are different ways of saying the same thing. The irreducible complexity of data analysis is there no matter how you do it, and "one-vector-at-a-time" sometimes feel like shoehorning (other terms people come up with include vector-style mental gymnastics).

For the record, vector-style programming is great when it works, I mean Julia even has a dedicated syntax for broadcasting. I'm saying when the irreducible complexity arrives, you don't want to NOT be able to just write a for-loop

Just a recent example, a double-for loop looks like this in Awkward array: https://github.com/Moelf/UnROOT_RDataFrame_MiniBenchmark/blo... -- the result looks "neat" as in a piece of art.

SiempreViernes

7 replies

1d9h

2024-06-01 08:35:23 UTC

Ah, root... every day it happens I am thankful I don't have to used a version older than 6.

YakBizzarro

4 replies

1d9h

2024-06-01 08:53:25 UTC

Root was zone of the reasons to decide to not study particle physics

oefrha

1 replies

1d9h

2024-06-01 09:00:11 UTC

You don’t have to. I worked on data analysis (mostly cleaning and correction) for CMS (one of the two main experiments at LHC) for a while and didn’t have to touch it. Disclaimer: I was a high energy theorist, but did the aforementioned experimental work early in my PhD for funding.

aoanla

0 replies

1d8h

2024-06-01 09:39:30 UTC

I mean, most of the researchers I know at least use PyRoot (or the Julia equivalent) as much as possible, rather than actually interacting with Root itself. Which probably saves their sanity...

tempay

0 replies

1d5h

2024-06-01 12:59:08 UTC

These days you can mostly avoid it. The Python HEP ecosystem is now pretty advanced so you can even read ROOT files without needing root itself. See:

https://scikit-hep.org/

brnt

0 replies

1d8h

2024-06-01 10:26:00 UTC

I did my master and PhD around the time numpy/scipy got competitive for a lot of analysis (for me a complete replacement) but the Python binding for root weren't there or in beta. Root-the-data+format remained however the main output of Geant4, so I set up a tiny Python wrapper around a root script that would dump any .root contents and load it up in a numpy file.

My plots looked a lot nicer ;)

twixfel

1 replies

1d9h

2024-06-01 09:28:12 UTC

I'm still waiting for the interface-breaking, let's-finally-make-root-good, version 7, which I think I first heard about in 2016 or so... true vapourware.

amadio

0 replies

21h40m

2024-06-01 20:48:05 UTC

ROOT 7 is coming. Things are being discussed this year about it, the target is for HL-LHC. See link below. https://indico.cern.ch/event/1369601/contributions/5867782/a...

leohonexus

5 replies

1d9h

2024-06-01 09:17:22 UTC

Very cool to see large-scale software projects used for scientific discoveries.

Another example: Gravitational waves were found with GStreamer at LIGO: https://lscsoft.docs.ligo.org/gstlal/

jakjak123

1 replies

1d6h

2024-06-01 11:34:49 UTC

Gravitational waves were found with GStreamer at LIGO: https://lscsoft.docs.ligo.org/gstlal/

Say WHAT now?!

semi-extrinsic

0 replies

22h45m

2024-06-01 19:42:53 UTC

They even have a "gstlal-ugly" package!

aulin

1 replies

1d8h

2024-06-01 10:17:08 UTC

Well these are two very different examples. One, ROOT, is a powerful data analysis framework that as powerful as it is failed to be general and easy to use enough to ever get out the HEP world.

The other one, gstreamer, is a beautifully designed platform with an architecture so nice it can be easily abstracted and reused in completely different scenarios, even ones that probably never occurred to the authors.

im3w1l

0 replies

1d5h

2024-06-01 13:06:35 UTC

Gstreamer must have been a winamp clone right?

hkwerf

0 replies

1d8h

2024-06-01 10:03:05 UTC

Here it's more the other way around. CERN needs a data analysis framework, so CERN develops, maintains and publishes it for other users.

That being said, I don't know whether it's actually a good idea for someone external to actually use it. My experience may be a little outdated, but it's quite clunky and dated. The big advantage of using it for CERN or particle physics stuff is that it's basically a standard, so it's easy to collaborate internally.

codecalec

4 replies

1d6h

2024-06-01 11:33:02 UTC

Root is definitely the backbone of a ton of work done in experimental particle physics but it is also the nightmare of new graduate students. It's affectively engrained into particle physics and I don't expect that to change anytime soon

elashri

3 replies

1d5h

2024-06-01 12:58:49 UTC

It is not that bad now with pyroot (ROOT python interface) and uproot being an option that is easy to learn for new graduate students. The problem is about legacy code which they usually have to maintain as part of experiment service

ephimetheus

2 replies

23h30m

2024-06-01 18:58:12 UTC

I can’t count the number of of times where a beginner did some stuff in pyroot that was horrifically slow and just implementing the exact same algorithm in C++ was two orders of magnitude faster.

If you don’t use RDataFrame, or it’s just histogram plotting, be very careful with pyroot.

SiempreViernes

1 replies

21h39m

2024-06-01 20:48:40 UTC

You should be using RDataFrame though, or awkward + dask.

ephimetheus

0 replies

21h29m

2024-06-01 20:58:52 UTC

+1 for RDataFrame for what it can do. Just be prepared to bail to C++ and for loops when you exceed what it can do without major headaches.

scheme271

3 replies

1d9h

2024-06-01 09:17:35 UTC

ROOT, providing the C++ repl that no one asked for.

pjmlp

0 replies

1d4h

2024-06-01 13:29:15 UTC

Before ROOT, there was Energize C++ and Visual Age for C++ v 4.0, however too expensive and resource demanding for early 1990's workstations.

There are also a couple of C++ live environments in the game industry.

fooker

0 replies

1d6h

2024-06-01 11:30:54 UTC

The researchers behind this contributed it into mainline clang as clang-repl

Jeaye

0 replies

23h57m

2024-06-01 18:31:12 UTC

I definitely asked for it. I'm using Cling for JIT compiling my native Clojure dialect: https://github.com/jank-lang/jank

Without Cling, this sort of thing wouldn't be feasible in C++. Not in the way which Clojure dialects work. The runtime is a library and the generated code is just using that library.

sbinet

3 replies

23h44m

2024-06-01 18:44:13 UTC

IMHO, ROOT[3-5] is too many things with a lot of poorly designed API and most importantly a lack of separation between ROOT-the-library and ROOT-the-program (lots of globals and assumptions that ROOT-the-program is how people should use it). ROOT 6 started to correct some of these things, but it takes time (and IMHO, they are buying too much into llvm and clang, increasing even more the build times and worsening the hackability of ROOT as a project)

Also, for the longest time, the I/O format wasn't very well documented, with only 1 implementation.

Now, thanks to groot [1], uproot (that was developed building on the work from groot) and others (freehep, openscientist, ...), it's to read/write ROOT data w/o bringing the whole TWorld. Interoperability. For data, I'd say it's very much paramount in my book to have some hope to be able to read back that unique data in 20, 30, ... years down the line.

[1] https://go-hep.org/x/hep/groot (I am the main dev behind go-hep)

ephimetheus

2 replies

23h34m

2024-06-01 18:53:56 UTC

uproot to this day doesn’t properly implement reading TEfficiency, I believe, which is a bummer, to be honest.

sbinet

1 replies

23h4m

2024-06-01 19:24:33 UTC

that's odd. TEfficiency is a relatively simple thing to read/write :

- https://github.com/go-hep/hep/blob/main/groot/rhist/efficien...

ephimetheus

0 replies

22h51m

2024-06-01 19:37:24 UTC

Yeah I think it has to do with the memberwise splitting. https://github.com/scikit-hep/uproot5/issues/38

I understand this has not been a priority so far.

It kinda works if you open a magic file with a specific on-disk representation which bypasses this, but that’s not a solution at all.

usgroup

2 replies

2024-06-01 17:55:54 UTC

I struggle to see why one may want to use an interactive analysis toolkit via C++. Could anyone who has used ROOT enlighten me on this? I understand why you may write it in C++, but why would you want to invoke it with C++ for this sort of work?

konstantinua00

0 replies

21h15m

2024-06-01 21:12:50 UTC

if you can work in a fast language, why not?

comments here have already mentioned couple horror stories of people accidentally/by inexperience doing a lot of work above the framework - if you can save that by not being slow, why not?

ephimetheus

0 replies

23h32m

2024-06-01 18:56:38 UTC

All of our other code is C++. The data reconstruction framework writing ROOT files, the analysis frameworks doing stat analysis. The event data model is implemented in C++.

It has its rough edges, but you do get a lot of good synergy out of this setup for sure.

dailykoder

2 replies

1d9h

2024-06-01 09:00:33 UTC

Debugging CERN ROOT scripts and ROOT-based programs in Eclipse IDE (30 Oct 2021)

Oh gosh. The nightmares. - What obviously shows that you can build extraordinary stuff in horrible environments.

BSDobelix

1 replies

1d9h

2024-06-01 09:04:52 UTC

I don't understand is it about eclipse?

amadio

0 replies

21h9m

2024-06-01 21:19:31 UTC

It was a nice guest post on the website about eclipse, but most people just use gdb. It is now possible to step through ROOT macros with gdb by exporting CLING_DEBUG=1. See https://indico.jlab.org/event/459/contributions/11563/

bobek

2 replies

1d6h

2024-06-01 12:23:12 UTC

Aaah, this brings memories of late night debugging sessions of code written by briliant physicists without computer science background ;)

xtracto

0 replies

16h34m

2024-06-02 01:53:59 UTC

Hehe. I worked at an online lending website around 2013 with a group of particle physicists hired to build risk prediction models. They used ROOT for the modeling and build some interface through ruby... fromnthe software engineering POV it was an abomination. But the statistics POV was pretty neat.

This was way before the Python ecosystem gained traction. And R ML packages were also just starting.

andrepd

0 replies

2024-06-01 18:27:25 UTC

Ahh I can imagine the 2000 lines-long main() :)

SilverSlash

2 replies

1d9h

2024-06-01 09:27:00 UTC

Let me guess, it only run on an IBN 5100?

div72

0 replies

1d7h

2024-06-01 10:47:21 UTC

Only for the optional "read time travel and world domination plans" module.

8organicbits

0 replies

1d8h

2024-06-01 10:19:20 UTC

No. https://root.cern/install/

wolfspider

1 replies

2024-06-01 17:42:44 UTC

The part of Root I use is Cling the C++ interpreter along with Xeus in a Jupyter notebook. I decided one night to test the fastest n-body from benchmarkgames comparing Xeus and Python 3. With Xeus I get 15.58 seconds and running the fastest Python code with Python3 kernel, both on binder using the same instance, I get 5 minutes. Output is exactly the same for both runs. Even with an overhead tax for running dynamic C++ at ~300% for this program Cling is very quick. SIMD and vectorization were not used just purely the code from benchmarkgames. I use Cling primarily as a quick stand-in JIT for languages that compile to C++.

Jeaye

0 replies

23h59m

2024-06-01 18:29:29 UTC

I'm using Cling for JIT compiling my native Clojure dialect: https://github.com/jank-lang/jank

Trying to bring C++ into the Clojure world and Clojure/interactive programming into the C++ world.

rubicks

1 replies

2024-06-01 18:27:36 UTC

What I remember about ROOT Cint is that it was an absolute nightmare to work with, mostly because it couldn't do STL containers very well. It was a weird time to do language interop for physicists.

frumiousirc

0 replies

5h47m

2024-06-02 12:41:21 UTC

Oh yes, I remember the CINT times, but then I also remember PAW and KUMAC.

Modern ROOT of course replaces CINT with Cling and STL containers are well supported.

lnauta

1 replies

1d6h

2024-06-01 12:07:04 UTC

Have they released v7 yet? When I started my PhD it they announced it, and I looked forward towards the consistency between certain parts of the software they would introduce (some mismatches really dont make sense and are clearly organic) and now I'm already 2 years past my graduation.

npalli

0 replies

1d6h

2024-06-01 12:07:51 UTC

v6.32

qa-wolf-bates

0 replies

1d2h

2024-06-01 16:19:48 UTC

I think that this article is very interesting

nousernamed

0 replies

1d2h

2024-06-01 15:48:52 UTC

the amount of times I googled 'taxis' with predictable results

nomilk

0 replies

1d9h

2024-06-01 08:37:22 UTC

Source code: https://github.com/root-project

koolala

0 replies

1d2h

2024-06-01 16:00:16 UTC

can they release a quantized 1bit version? i dont think anyones pc can science this