return to table of content

ROOT: analyzing petabytes of data scientifically

captainmuon
18 replies
1d4h

A blast from the past, I used to work in particle physics and used ROOT a lot. I had a love/hate relationship with it. On the one hand, it had a lot of technical debt and idiosyncrasies. But on the other hand, there are a bunch of things that are easier in ROOT than in more "modern" options like matplotlib. For example, anything that has to do with histograms. Or highly structured data (where your 'columns' contain objects with fields). Or just plotting functions (without having to allocate arrays for the x and y values). I also like the very straightforward object-oriented API. It feels like old-school C++ or Java, as opposed to pandas/matplotlib which has a lot of method chaining, abuse of [] syntax and other magic. It is not elegant, and quite verbose, but that is probably a good thing when doing a scientific analysis.

I left about 5 years ago, and ROOT was in a process of change. They already ripped out the old CINT interpreter and moved to a clang-based codebase, and now you can run your analyses in Jupyter as far as I know (in C++ or Python). I heard the code quality has improved a lot, too.

ilrwbwrkhv
8 replies
1d4h

I wonder if Haskell would also be a good fit for writing something like this.

shrimp_emoji
5 replies
1d4h

No.

mynameisvlad
3 replies
1d2h

This is a technical community. You really have to do better than a one word dismissal without any reasoning.

In other words, why do you think it’s not a good fit?

sfpotter
1 replies
20h47m

I think the response gets right to the point!

Using something like Haskell for ROOT is ridiculous for a lot of obvious reasons. A simple and dismissive "no" invites the cautious reader to discover them on their own rather than waste engaging in a protracted debate. Maybe it's better to reject the idea out of hand and spend our time elsewhere.

mynameisvlad
0 replies
3h18m

That’s just not how technical discussions work. Not everyone knows what you know and the point of this community is to share knowledge not gatekeep it behind some “discovering it yourself” bullshit. The fastest thing to do is not dismissing it with no explanation but rather explaining for all the readers why that is the case. Because if one person doesn’t know I can guarantee that there’s plenty out there who are just as interested to know. And it’s a waste of everyone’s time to have each person independently come to the same conclusion when it’s apparently easily explainable.

You’re free to not do any of that, of course, but be prepared to defend the fact that you’d prefer not engaging in discussion and instead just shallowly dismiss something.

dekhn
0 replies
18h40m

There's a number of reasons for this. The first is that the quant physics community has never really adopted functional programming. It's not particularly obvious to scientists, who typically want to express their computation the way they want to- something that C, C++, and Fortran are all long-established at doing. The second is that much of physics depends on old libraries written over the last 30-40 years, and it's easiest to use them from a language that the library is written in, or one that has a highly similar interface (for example, Python is similar enough to C++ that many foreign function interfaces are literally just direct wrappers). The third is that types (other than simple scalars, arrays, and trees/graphs) have never been a high priority in quant physics. The fourth is that undergrad education outside CS rarely teaches students Haskell, while most undergrads in a quant field graduate knowing some amount of Python.

It's much more likely the physics community would adopt Julia, or maybe Rust, and even that has been pretty slow.

(nothing I said above should be construed as taking a position about the suitability of any specific language or lack thereof for doing scientific computing. I have opinions, but I am attempting to explain the reason factually with a minimum of bias)

hackable_sand
0 replies
1d2h

Could it though?

tikhonj
0 replies
1d2h

Haskell would be great for designing the interface of a library like this, but not for implementing it. It would definitely not look like "old-school C++ or Java" but, well, that's the whole point :P

I haven't used ROOT so I don't know how well it would work to write bindings for it in Haskell; it can be hard to provide a good interface to an implementation that was designed for a totally different style of use. Possible, just difficult.

goy
0 replies
1d2h

I think having Haskell bindings to it will be quite valuable .For implementation of core structures, though, it's better to stick to C++ to max out on performance and have a finer control on resource usage. Haskell isn't particularly good at that.

EDIT: there's one at https://hackage.haskell.org/package/HROOT

casualscience
3 replies
23h13m

The best thing about root was how it handled data loading. TTree's, with their column based slicing on disk, are such a good idea. Ever since I graduated and moved into industry, I've been looking for something that works the same way.

moelf
1 replies
21h25m

Apache arrow and parquet all work this way. Even HDF5 in column mode isn't completely bad.

TTree is succeeded by RNTuple, which is basically CERN's take on Apache Arrow, they're incredibly similar

amelius
0 replies
20h13m

Is this a kind of lazy loading?

dekhn
0 replies
18h46m

I was hosting one of the leads of ROOT at Google and we got to talking about ROOT. I mentioned sstables and columnio and he said "oh, yeah, we've been doing that for years".

BiteCode_dev
2 replies
1d3h

Honestly now with chatgpt, matplotlib terrible API is less of a problem.

typon
0 replies
23h28m

This is a great example of why the age of truly terrible software is going to be ushered in as LLMS get better.

When the cost of complexity of interacting with an API is paid by the LLM, optimizing this particular part of software design (also one of the hardest to get right) will be less fashionable.

OutOfHere
0 replies
1d2h

That's true, but still, there are things you just can't do in matplotlib that you can do better in other GPT-aware packages like plotly.

ephimetheus
0 replies
23h36m

We all have a love/hate relationship with it. It’s a bit like Stockholm syndrome.

cozzyd
0 replies
23h13m

Because matplotlib is not so histogram focused (I guess because the kids these days have plenty of r RAM), people always show these abominable scatter plots that have so many points on top of each other that they're useless. Yuck.

mjtlittle
11 replies
1d9h

Didnt know there was a cern tld

sneak
8 replies
1d9h

Yes, the root zone is terribly polluted now. Unfortunately there’s no way to unring that bell, people depend on a lot of these new domains now.

It was a huge mistake, borne out of greed and recklessness.

https://en.wikipedia.org/wiki/ICANN#Notable_events

jesprenj
3 replies
1d9h

I guess ICANN needs to get money somehow.

lambdaxyzw
2 replies
1d9h

Why can't it just get funding from the government?

rnhmjoj
0 replies
1d8h

Aren't they already getting an outrageous amount of money for essentially supervising a txt file?

j16sdiz
0 replies
18h42m

Which government?

Biganon
1 replies
1d9h

I fail to see the problem with those new TLDs.

oefrha
0 replies
1d8h

Certain gTLDs have been borderline scams. The most infamous one might be .sucks, an extortion scheme charging an annual protection fee of $$$, complete with the pre-registration process when you could buy <yourtrademark>.sucks for $$$$ before it’s snatched up by your enemies.

They also screwed up some old URL/email parsers/sniffers hardcoding TLDs. Largely the fault of bad assumptions to begin with.

Other than the above, I don’t see much of a problem. Whatever problems people like to point out about gLTDs already existed with numerous sketchy ccTLDs, like .io. Guess what, the latest hotness .ai is also one of those.

9dev
1 replies
1d8h

I still wonder why we need that arbitrary restriction anyway?

8organicbits
0 replies
1d8h

If we allowed all possible TLDs, then we'd need a default organization to administer them. The current setup requires an organization to control each TLD, which allows us to grant control to countries or large organizations. The web should be decentralized, which means TLD ownership should be spread across multiple organizations. More TLDs with more distinct owners is a better situation than one default.

ragebol
0 replies
1d9h

Handy if they host conferences, for people worried about too many TLDs perhaps.

https://con.cern is not yet used, so...

SiempreViernes
0 replies
1d9h

Yeah... according to wikipedia they've had it since 2014, but even now a lot of their pages are on .ch

elashri
8 replies
1d5h

There are no many reasons why new analyses should default to using ROOT instead of more user friendly and sane options like uproot [1]. Maybe some people have some legacy workflow or their experiments have many custom patches on top of ROOT (common practice) for other things but for physics analysis you might be self torturing yourself.

Also I really like their 404 page [2]. And no it is not about room 404 :)

[1] https://github.com/scikit-hep/uproot5

[2] https://root.cern/404/

moelf
7 replies
1d5h

One common criticism of uproot is that it's not flexible when per-row computation gets complicated because for-loops in Python is too slow. For that one can either use Numba (when it works), or, here's the shameless plug, use Julia: https://github.com/JuliaHEP/UnROOT.jl

Past HN discussion on Julia for particle physics: https://news.ycombinator.com/item?id=38512793

szvsw
4 replies
1d3h

A great alternative to numba for accelerated Python is Taichi. Trivial to convert a regular python program into a taichi kernel, and then it can target CUDA (and a variety of other options) as the backend. No need to worry about block/grid/thread allocation etc. at the same time, it’s super deep with great support for data classes, custom memory layouts for complexly nested classes, etc etc, comes with autograd, etc. I’m a huge fan - makes writing code that runs on the GPU and integrates with your python libraries an absolute breeze. Super powerful. By far the best tool in the accelerated python toolbox IMO.

OutOfHere
3 replies
1d2h

Negative, as Taichi doesn't even support Python 3.12, and it's unclear if it ever will. Why would I limit myself to an old version of Python?

almostgotcaught
0 replies
1d1h

they made a lame excuse that Pytorch didn't support 3.12

how is this a lame excuse

but it fails on a bunch of PyTorch-related tests. We then figured out that PyTorch does not have Python 3.12 support

they have a dep that was blocking them from upgrading. you would have them do what? push pytorch to upgrade?

Later, even when Pytorch added support for 3.12, nothing changed (so far) in Taichi.

my friend that "Later" is feb/march of this year ie 2-3 months ago. exactly how fast would you like for this open source project to service your needs? not to mention there is a PR up for the bump.

I stand by my original comment.

elashri
1 replies
1d4h

That'a true and Julia might be a solution but I don't see the adoption happening anytime soon.

But this particular problem (per row computation) have different options to tackle now in hep-python ecosystem. One approach is to leverage array programming with NumPy to vectorize operations as much as possible. By operating on entire arrays rather than looping over individual elements, significant speedups can often be achieved.

Another possibility is to use a library like Awkward Array, which is designed to work with nested, variable-sized data structures. Awkward Array integrates well with uproot and provides a powerful and flexible framework for performing fast computations on i.e jagged arrays.

moelf
0 replies
1d4h

Uproot already returns you Awkward array, so both things you mentioned are different ways of saying the same thing. The irreducible complexity of data analysis is there no matter how you do it, and "one-vector-at-a-time" sometimes feel like shoehorning (other terms people come up with include vector-style mental gymnastics).

For the record, vector-style programming is great when it works, I mean Julia even has a dedicated syntax for broadcasting. I'm saying when the irreducible complexity arrives, you don't want to NOT be able to just write a for-loop

Just a recent example, a double-for loop looks like this in Awkward array: https://github.com/Moelf/UnROOT_RDataFrame_MiniBenchmark/blo... -- the result looks "neat" as in a piece of art.

SiempreViernes
7 replies
1d9h

Ah, root... every day it happens I am thankful I don't have to used a version older than 6.

YakBizzarro
4 replies
1d9h

Root was zone of the reasons to decide to not study particle physics

oefrha
1 replies
1d9h

You don’t have to. I worked on data analysis (mostly cleaning and correction) for CMS (one of the two main experiments at LHC) for a while and didn’t have to touch it. Disclaimer: I was a high energy theorist, but did the aforementioned experimental work early in my PhD for funding.

aoanla
0 replies
1d8h

I mean, most of the researchers I know at least use PyRoot (or the Julia equivalent) as much as possible, rather than actually interacting with Root itself. Which probably saves their sanity...

tempay
0 replies
1d5h

These days you can mostly avoid it. The Python HEP ecosystem is now pretty advanced so you can even read ROOT files without needing root itself. See:

https://scikit-hep.org/

brnt
0 replies
1d8h

I did my master and PhD around the time numpy/scipy got competitive for a lot of analysis (for me a complete replacement) but the Python binding for root weren't there or in beta. Root-the-data+format remained however the main output of Geant4, so I set up a tiny Python wrapper around a root script that would dump any .root contents and load it up in a numpy file.

My plots looked a lot nicer ;)

twixfel
1 replies
1d9h

I'm still waiting for the interface-breaking, let's-finally-make-root-good, version 7, which I think I first heard about in 2016 or so... true vapourware.

leohonexus
5 replies
1d9h

Very cool to see large-scale software projects used for scientific discoveries.

Another example: Gravitational waves were found with GStreamer at LIGO: https://lscsoft.docs.ligo.org/gstlal/

semi-extrinsic
0 replies
22h45m

They even have a "gstlal-ugly" package!

aulin
1 replies
1d8h

Well these are two very different examples. One, ROOT, is a powerful data analysis framework that as powerful as it is failed to be general and easy to use enough to ever get out the HEP world.

The other one, gstreamer, is a beautifully designed platform with an architecture so nice it can be easily abstracted and reused in completely different scenarios, even ones that probably never occurred to the authors.

im3w1l
0 replies
1d5h

Gstreamer must have been a winamp clone right?

hkwerf
0 replies
1d8h

Here it's more the other way around. CERN needs a data analysis framework, so CERN develops, maintains and publishes it for other users.

That being said, I don't know whether it's actually a good idea for someone external to actually use it. My experience may be a little outdated, but it's quite clunky and dated. The big advantage of using it for CERN or particle physics stuff is that it's basically a standard, so it's easy to collaborate internally.

codecalec
4 replies
1d6h

Root is definitely the backbone of a ton of work done in experimental particle physics but it is also the nightmare of new graduate students. It's affectively engrained into particle physics and I don't expect that to change anytime soon

elashri
3 replies
1d5h

It is not that bad now with pyroot (ROOT python interface) and uproot being an option that is easy to learn for new graduate students. The problem is about legacy code which they usually have to maintain as part of experiment service

ephimetheus
2 replies
23h30m

I can’t count the number of of times where a beginner did some stuff in pyroot that was horrifically slow and just implementing the exact same algorithm in C++ was two orders of magnitude faster.

If you don’t use RDataFrame, or it’s just histogram plotting, be very careful with pyroot.

SiempreViernes
1 replies
21h39m

You should be using RDataFrame though, or awkward + dask.

ephimetheus
0 replies
21h29m

+1 for RDataFrame for what it can do. Just be prepared to bail to C++ and for loops when you exceed what it can do without major headaches.

scheme271
3 replies
1d9h

ROOT, providing the C++ repl that no one asked for.

pjmlp
0 replies
1d4h

Before ROOT, there was Energize C++ and Visual Age for C++ v 4.0, however too expensive and resource demanding for early 1990's workstations.

There are also a couple of C++ live environments in the game industry.

fooker
0 replies
1d6h

The researchers behind this contributed it into mainline clang as clang-repl

Jeaye
0 replies
23h57m

I definitely asked for it. I'm using Cling for JIT compiling my native Clojure dialect: https://github.com/jank-lang/jank

Without Cling, this sort of thing wouldn't be feasible in C++. Not in the way which Clojure dialects work. The runtime is a library and the generated code is just using that library.

sbinet
3 replies
23h44m

IMHO, ROOT[3-5] is too many things with a lot of poorly designed API and most importantly a lack of separation between ROOT-the-library and ROOT-the-program (lots of globals and assumptions that ROOT-the-program is how people should use it). ROOT 6 started to correct some of these things, but it takes time (and IMHO, they are buying too much into llvm and clang, increasing even more the build times and worsening the hackability of ROOT as a project)

Also, for the longest time, the I/O format wasn't very well documented, with only 1 implementation.

Now, thanks to groot [1], uproot (that was developed building on the work from groot) and others (freehep, openscientist, ...), it's to read/write ROOT data w/o bringing the whole TWorld. Interoperability. For data, I'd say it's very much paramount in my book to have some hope to be able to read back that unique data in 20, 30, ... years down the line.

[1] https://go-hep.org/x/hep/groot (I am the main dev behind go-hep)

ephimetheus
2 replies
23h34m

uproot to this day doesn’t properly implement reading TEfficiency, I believe, which is a bummer, to be honest.

ephimetheus
0 replies
22h51m

Yeah I think it has to do with the memberwise splitting. https://github.com/scikit-hep/uproot5/issues/38

I understand this has not been a priority so far.

It kinda works if you open a magic file with a specific on-disk representation which bypasses this, but that’s not a solution at all.

usgroup
2 replies
1d

I struggle to see why one may want to use an interactive analysis toolkit via C++. Could anyone who has used ROOT enlighten me on this? I understand why you may write it in C++, but why would you want to invoke it with C++ for this sort of work?

konstantinua00
0 replies
21h15m

if you can work in a fast language, why not?

comments here have already mentioned couple horror stories of people accidentally/by inexperience doing a lot of work above the framework - if you can save that by not being slow, why not?

ephimetheus
0 replies
23h32m

All of our other code is C++. The data reconstruction framework writing ROOT files, the analysis frameworks doing stat analysis. The event data model is implemented in C++.

It has its rough edges, but you do get a lot of good synergy out of this setup for sure.

dailykoder
2 replies
1d9h

Debugging CERN ROOT scripts and ROOT-based programs in Eclipse IDE (30 Oct 2021)

Oh gosh. The nightmares. - What obviously shows that you can build extraordinary stuff in horrible environments.

BSDobelix
1 replies
1d9h

I don't understand is it about eclipse?

amadio
0 replies
21h9m

It was a nice guest post on the website about eclipse, but most people just use gdb. It is now possible to step through ROOT macros with gdb by exporting CLING_DEBUG=1. See https://indico.jlab.org/event/459/contributions/11563/

bobek
2 replies
1d6h

Aaah, this brings memories of late night debugging sessions of code written by briliant physicists without computer science background ;)

xtracto
0 replies
16h34m

Hehe. I worked at an online lending website around 2013 with a group of particle physicists hired to build risk prediction models. They used ROOT for the modeling and build some interface through ruby... fromnthe software engineering POV it was an abomination. But the statistics POV was pretty neat.

This was way before the Python ecosystem gained traction. And R ML packages were also just starting.

andrepd
0 replies
1d

Ahh I can imagine the 2000 lines-long main() :)

SilverSlash
2 replies
1d9h

Let me guess, it only run on an IBN 5100?

div72
0 replies
1d7h

Only for the optional "read time travel and world domination plans" module.

wolfspider
1 replies
1d

The part of Root I use is Cling the C++ interpreter along with Xeus in a Jupyter notebook. I decided one night to test the fastest n-body from benchmarkgames comparing Xeus and Python 3. With Xeus I get 15.58 seconds and running the fastest Python code with Python3 kernel, both on binder using the same instance, I get 5 minutes. Output is exactly the same for both runs. Even with an overhead tax for running dynamic C++ at ~300% for this program Cling is very quick. SIMD and vectorization were not used just purely the code from benchmarkgames. I use Cling primarily as a quick stand-in JIT for languages that compile to C++.

Jeaye
0 replies
23h59m

I'm using Cling for JIT compiling my native Clojure dialect: https://github.com/jank-lang/jank

Trying to bring C++ into the Clojure world and Clojure/interactive programming into the C++ world.

rubicks
1 replies
1d

What I remember about ROOT Cint is that it was an absolute nightmare to work with, mostly because it couldn't do STL containers very well. It was a weird time to do language interop for physicists.

frumiousirc
0 replies
5h47m

Oh yes, I remember the CINT times, but then I also remember PAW and KUMAC.

Modern ROOT of course replaces CINT with Cling and STL containers are well supported.

lnauta
1 replies
1d6h

Have they released v7 yet? When I started my PhD it they announced it, and I looked forward towards the consistency between certain parts of the software they would introduce (some mismatches really dont make sense and are clearly organic) and now I'm already 2 years past my graduation.

npalli
0 replies
1d6h

v6.32

qa-wolf-bates
0 replies
1d2h

I think that this article is very interesting

nousernamed
0 replies
1d2h

the amount of times I googled 'taxis' with predictable results

koolala
0 replies
1d2h

can they release a quantized 1bit version? i dont think anyones pc can science this