Polars

It's cristal clear that this page has been written for people who already know what they are looking at; the first line of the first paragraph, far from describing the tool, is about some qualities of it: "Polars is written from the ground up with performance in mind"

And the rest follows the same line.

Anyone could ELI5 what this is and for what needs it is a good solution to use?

EDIT: So an alternative implementation of Pandas DataFrame. Google gave me [0] which explains:

The pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields.

DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc. In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they’re an integral part of the Python and NumPy ecosystems.

[0]: https://realpython.com/pandas-dataframe/

It’s pandas, but fast. Pandas is the original open source data frame library. Pandas is robust and widely used, but sprawling and apparently slower than this newcomer. The word “data frames” keys in people who have worked with them before.

Actually pandas is not the original open source data frame library, perhaps only in Python. There is a very rich tradition in R on data.frames, which includes the unjustly neglected data.table.

Yep! Unless I'm mistaken, R (and its predecessor S) seems to have been the first to introduce the concept of a dataframe.

One could also argue that dataframes are basically in-memory database tables. And in that case, S and SQL probably tie in terms of the creation timeline.

The difference is dataframes can also be seen like matrices. You can do row operations, row + column operations, multiply rows and columns, multiply different matrices, transpose them etc. These kind of things don't really make sense in DB tables (and they are generally not supported and you jump through hoops to do similar things in DBs).

Yes, that's totally fair; dataframes are more flexible in that sense.

Oh, and another important difference is memory layout. The dataframe implementations mostly (or all) use column-major format. Whereas most conventional SQL implementations use row-major format, I believe.

I think most OLTP databases are row oriented whilst most OLAP are column.

The difference is dataframes can also be seen like matrices. You can do row operations, row + column operations, multiply rows and columns, multiply different matrices, transpose them etc.

I think this is overblowing the similarities to matrices. Matrices have elements all of the same type, while data.frames mix numbers, characters, factors, etc. You certainly cannot transpose a data.frame and still have a data.frame that makes sense. Multiplying rows would not make sense either, since within one row you will have different types of data. Unless you have a data.frame that is all numeric, but in that case one should probably be using a matrix in the first place.

Unless you have a data.frame that is all numeric, but in that case one should probably be using a matrix in the first place.

They still have their advantages with row/column labels, NaN handling etc. These are not operations I am speculating about by the way. I am most familiar with pandas and the dataframe there has transpose, dot product operations and almost all column operations have their correspondence in rows (i.e. you either sum(axis=0) or sum(axis=1)).

Oh, based on the comment you replied to I thought this was about R. In R matrices can handle NaNs and NAs, have column and row labels, have dot products and much more.

I feel like the predecessor of R should be Q!

The way that I've heard the story, S was short for "statistics", and R was chosen because the authors were _R_obert [Gentleman] and _R_oss [Ihaka].

Statisticians are funny!

Yeah. I think Wes McKinney liked the data frames in R, but preferred the programming language of Python. I've heard somewhere that he also got a lot of inspiration from APL.

R is literally designed to do statistics and has first class support and language feature support for many specialized tasks in statistics and closely related fields.

Python is literally designed to be easy to program with in general.

Well, it turns out when you’re dealing with terabytes of data and TFLOPS, the programming becomes more important than the math. Not all R devs are happy about this and they are very loud about it.

But it shouldn’t really surprise anyone. That is literally how those languages are designed.

Most of the R devs I know like this are just butthurt they are paid less and refuse to switch because they’re obstinate, or they’re a little scared they’re being left behind. first group is all over the place, but the second group tends to skew older of course

R is heavily influenced by Scheme. Not only is it heavily functional, but it has metaprogramming capabilities allowing a high level of flexibility and expressiveness. The tidyverse libraries use this heavily to produce very nice composable APIs that aren't really practically possible in Python.

R is fine. The issue is more in the ecosystem (with the aforementioned exception of the tidyverse).

Most of the R devs I know like this are just butthurt they are paid less and refuse to switch because they’re obstinate, or they’re a little scared they’re being left behind. first group is all over the place, but the second group tends to skew older of course

Look, I started with R and use mostly Python these days, but this is not really a fair take.

R is (still) much, much, much better for analytics and graphing (the only decent plotting library in python is a ggplot clone). The big change (and why Python ended up winning) is that integrating R with other tools (like web stuff, for example) is harder than just using Python.

pandas (for instance) is like an unholy clone of the worst features from both R and Python. Polars is pretty rocking, though (mostly because it clones from Spark/dplyr/linc).

It's another example of Python being the second best language for everything winning out in the marketplace.

That being said, if I was starting a data focused company and needed to pick a language, I'd almost certainly build all the DS focused stuff in R as it would be many many times quicker, as long as I didn't need to hire too many people.

which includes the unjustly neglected data.table

So so true.

I was working on an adhoc project that needed a quick result by the end of the day. I had to pull this series of parquet files and do some quick and dirty analysis. My first reflex was to use python with pandas, quick and easy. Python could not handle the datasets, too large. I decided to give R and data.table a go and it went smoothly. I am usually a python user but from time to time I feel compelled to jump back to R and data.table. Phenomenal tool.

My friend. You cannot make people like R. We all know about and study data.table, so it’s not neglected, we just don’t use that implementation.

Mainly because R sucks for anything that isn’t statistics.

Pandas has also moved to Apache Arrow as a backend [1], so it’s likely performance will be similar when comparing recent versions. But it’s great to have some friendly competition.

[1] https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...

Polars has an OLAP query engine so without any significant pandas overhaul, I highly doubt it will come close to polars in performance for many general case workloads.

This is a great chance to ELI5: what is an OLAP query engine and why does it make polars fast?

Polars can use lazy processing, where it collects all of the operations together and creates a graph of what needs to happen, while pandas executes everything upon calling of the code.

Spark tended to do this and it makes complete sense for distributed setups, but apparently is still faster locally.

Laziness in this context has huge advantages in reducing memory allocation. Many operations can be fused together, so there's less of a need to allocate huge intermediate data structures at every step.

yeah, totally, I can see that. I think that polars is the first library to do this locally, which is surprising if it has so many advantages.

It's been around in R-land for a while with dplyr and its variety of backends (including Arrow, the same as Polars). Pandas is just an incredibly mediocre library in nearly all respects.

Not according to DuckDB benchmarks. Not even close.

https://duckdblabs.github.io/db-benchmark/

Ouch! It is going to take a lot of work to get Polars this fast. If ever.

Not with eager API.

Memory and CPU usage is still really high though.

Ah, like polar bears are a much more aggressive implementation of the idea behind panda bears? That’s a pretty funny name if so.

Yeah. The name always makes me chuckle

I don't know, I think the name is kind of polar-izing

/pun

Oh, I'm not sure. I'd say it's bear-ly polarizing.

I'm so sorry.

Depends on your frame of mind.

This thread is turning into pandamonium

I'm worried it's going to get grizzly.

Thanks folks, you all made my day. "Frame of mind" was my favorite. I'm surprised I didn't think of some of these, I must be getting... Rust-y

Next re-implementation will be called grizzl.ys, hand-written in Y86 assembly.

Pandas is the original open source data frame library

...ehh, not quite. R and its predecessor S have Pandas beat by decades. Pandas wasn't even the first data frame library for Python. But it sure is popular now.

That's interesting! I didn't realize there had been prior dataframe libraries in Python!

Out of curiosity, what was/were the previous libraries?

it is built in data structure and function in R.

Oh, yes, I was aware that R (and its predecessor S) have a native dataframe object in the language.

It seemed that gmfawcett was indicating that there was a dataframe library in _Python_ that existed prior to Pandas. I was curious what that library was/is, as I'd not heard that before.

ok, guess I misunderstood both comments of you two. ´_>`

Sorry :) Pandas is undisputed king. But there were multiple bindings from Python into R available in the early 2000's. Some like rpy and rpy2 are still around, others are long defunct. I concede that these weren't standalone dataframe libraries, but rather dataframe features built into a language binding.

Not *original* but probably most commonly used.

yup I first met data frames in R and pandas is the Python answer to R isn't it

If I understand correctly, Pandas original scope was indexed in-memory data frames for use in high frequency trading, making use of the numpy library under the hood. At the time it was written you had JPMC's Athena, GS's platform, and several HFT internal systems (C++ my friends in that space have mentioned). Pandas just is so darn useful! I've been using it since maybe version 0.10, even got to contribute a tiny bit for the sas7bdat handling.

indeed it's both: it was created for financial analytics, and it provides R dataframe features to python. thanks for.making me detour into the history of it.

Yeah, I believe Pandas was inspired by similar functionality in R.

Yes, it's annoying negative feature of many tech products. Of course it's natural to want to speak to your target audience (in this case, data scientists who like Pandas but find it annoyingly slow/inflexible), but it's quite alienating to newbies who might otherwise become your most enthusiastic customers.

I am the target audience for Polars and have been meaning to try it for several months, but I keep procrastinating about because I feel residual loyalty to Pandas because Wes McKinney (its creator) took the time to write a helpful book about the most common analytical tools: https://wesmckinney.com/book/

Ritchie Vink (the creator of Polars) deliberately decided not to write a book so that he (and his team) can focus full time on Polars itself.

Thijs Nieuwdorp and I are currently working on the O'Reilly book "Python Polars: The Definitive Guide" [1]. It'll be a while before it gets released, but the Early Release version on the O'Reilly platform gets updated regularly. We also post draft chapters on the Polars Discord server [2].

The Discord server is also a great place to ask questions and interact with the Polars team.

[1] More information about the book: https://jeroenjanssens.com/pp/

[2] Polars Discord server: https://discord.gg/fngBqDry

Slightly offtopic: it's a tragedy that projects like this use discord as the primary discussion forum. It's like slack in that knowledge goes to die there.

I often see this comment, and every time I think; but having people come to the information AND the community is better for the project.

Short term perhaps, but long term having a non-indexed community is inconvenient for newcomers.

there are projects that you can use to index discord servers, unfortunately a lot of communities just don't use them.

That's why Ritchie is very active on, and often refers to, Stackoverflow as well! Exactly to document frequent questions, instead of losing them to chat history.

microsoft copilot can summarize discussions. with some orchestration it could extract even from past discussions question+answers and structure them in a stackoverflow-like format.

source: we use this feature in beta as part of the enterprise copilot license to summarize Teams calls. Yes, it listens to us talking and spits out bullet points from our discussions at the end of the call. It's so good it feels like magic sometimes.

note on copilot: any capable model could probably do it. I just said copilot because it does it today.

by community do you mean all the people who make an account just to ask a question on the project's discord, only ever open it to check if someone answered and then never use discord again?

Luckily our book will also be available in hard copy so you can digest all that hard-won knowledge in an offline manner :)

I'll wait until chatGPT can regurgitate it.

losing all nuance by virtue of getting dopamine quicker? count me in!

I do understand your "snarky" comment humoring, however do buy a copy if you want to support them, it's neither cheap or easy to make a book.

Yeah one thing that helps a bit is that they try to encourage that you post your questions to stack overflow and they'll answer it.

There is a free Polars user guide [0] as a part of Polars project. It was known as "polars-book" before it has been was in-tree [1].

[0] https://docs.pola.rs/user-guide/

[1] https://github.com/pola-rs/polars/tree/main/docs/user-guide

Is there a book that is even more basic for more junior people in regards to dataframe / storage solutions for ML applications to recommend? Thank you

Any plans to try fine tune an LLM specialised in polars? That would really be the killer feature to get major adoption IMO.

it's quite alienating to newbies who might otherwise become your most enthusiastic customers.

Newbies are your best target audience too! They aren't already ingrained in a system and have to learn a new framework. They are starting from yours. If a newbie can't get through your docs, you need to improve your docs. But it's strange to me how mature Polars is and that the docs are still this bad. It makes it feel like it isn't in active/continued development. Polars is certainly a great piece of software, but that doesn't mean much if you can't get people to use it. And the better your docs, the quicker you turn noobs into wizards. The quicker you do that, the quicker you offload support onto your newfound wizards.

But it's strange to me how mature Polars is and that the docs are still this bad.

Interesting, I've personally found them quite good and compared to datafusion or duckdb they're dramatically better. I agree pandas has better docs, but one of the strengths of polars is that I find I often don't need the docs due to putting lots of careful thought into designing a minimal and elegant API, not to mention they're actually care about subtle quirks like making autocomplete, type hinting, etc. work well.

Sounds like we might be coming from different perspectives. I honestly don't use any DF libraries often, and really only Pandas. I used to use pandas a fair amount, but that was years ago, and now I only have to reach for it a few times a year. So maybe the docs are good for people that already have deeper experience. Because I think just the fact that you have used datafusion and duckdb illustrates that you're more skilled in this domain than I am, because I haven't used those haha.

But I do think making good docs is quite hard. You usually have multiple audiences that you might not even be aware of. Which makes one of the most important things to do is keep an open ear to listen for them. It's easy to get trapped thinking you got your audience but you're actually closing the door to many more groups (unintentionally). It's also just easy to be focused on the "real" work and not think about docs.

What, specifically, is bad about the docs? This whole thread is people who just looked at the home page, saw that it is "DataFrames", but didn't know what that means and came here to complain. Nobody has said anything about issues with the docs for someone who understands what a data frame is (or spent like two minutes looking that up) but is struggling to figure out how to use this library specifically.

I'm a dataframes noob. I saw this post and the performance claims attracted me. I went to chatGPT to understand what dataframes were about. Then on udemy, I searched for a polar course. A course required pre-requisites : a bit about jupyter notebooks and pandas. Then I went through a few modules of a pandas course. Now, I'm going through a polars course. Altogether, I spent about 2-3 hours to setup the environment and know what this is all about.

A little bit context would have helped to have attracted a lot more noobs.g

Your first paragraph makes perfect sense! I was nodding along. But then your concluding sentence was a bit of a record scratch for me. This all worked as intended! You knew what the project was about - "data frames" - and what might make it attractive to you - the performance claims - and then you went and followed exactly the right path to get the context you needed to understand what's going on with it. It's a big topic that you were able to spin up on to a basic level in 2-3 hours, by pulling on strings starting at this landing page. This is a very successful outcome.

I'd also recommend this book: https://wesmckinney.com/book/. It's not about polars, but you'd be able to transfer its ideas to polars easily once you read it.

"How To Be A Pandas Expert"[1] is a good primer on dataframes. There's a certain mental model you need to use dataframes effectively but it's not apparent from reading the official docs. The video makes it explicit: dataframes are about like-indexed one-dimensional data, and every dataframe operation can be understood in terms of what it does to the index.

[1] https://www.youtube.com/watch?v=oazUQPrs8nw

I can't speak for the Python side of the Polars docs but coming from Python and Pandas to Rust and Polars hasn't always been easy. To be fair, that isn't just about docs but also finding articles or Stack Overflow answers for people doing similar things.

That certainly makes sense!

I think your experience is probably making it difficult to understand the noob side of things. For me, I've struggled with simply slicing up a dataframe. And as I specified, these aren't tools I use a lot, so the "who understands what a data frame is" probably doesn't apply to me very well and we certainly don't need the pejorative nature suggesting that it is trivially understood or something I should know through divine intervention. I'm sure it's not difficult, but it can take time for things to click.

Hell, I can do pretty complex integrals and derivatives and now so much of that seems trivial to me now but I did struggle when learning it. Don't shame people for not already knowing things when they are explicitly trying to learn things. Shame the people that think they know and refuse to learn. There's no reason to not be nice.

Having done a lot of teaching I have a note, don't expect noobs to be able to articulate their problems well. They're noobs. They have the capacity to complain but it takes expertise to have clarify that complaint, turning it into a critique. I get that this is frustrating, but being nice turns noobs into experts and often friends too.

I really think this is a misunderstanding of the purpose of different kinds of documentation. The documentation of a new tool for a mature technique is just not the primary place to focus on writing a beginners' tutorial / course on using that technique. Certainly, "the more the merrier" is a good mantra for documentation, so if they do add such material, all the better. But it is very sensible for it to not be the focus. The focus should be, "how can you use this specific iteration of a tool for this technique to do the things you already know how to do".

Nobody is suggesting that you should be an expert on data frames "through divine intervention". But the place to expect to learn about those things is the many articles, tutorials, courses, and books on the subject, not the website of one specific new tool in the space.

If you're really interested in learning about this, a fairly canonical place to start would be "Python for Data Analysis"[0] by Wes McKinney, the creator of pandas and one of the creators of the arrow in-memory columnar data format that most of these projects build atop now.

This is a (multiple-) book length topic, not a project landing page length topic.

0: https://wesmckinney.com/book/

The Rust docs are for some reason much worse than the Python docs, or at least that used to be the case

The docs are okay, but the feature set is lacking compared to pandas, which is understandable since this is at version 0.2. I was exploring if it's possible to use this, but we need diff aggregation which it doesn't have, so it's a no go right now.

Do you mean something like `.agg(pl.col("foo").diff())`?

Or is diff aggregation its own thing? (I tried searching for the term, but didn't find much.)

Nevermind, it has it but it's under Computation in polars.Series.diff and I was looking under Aggregation. This is great.

For instance you've got a time series with an odometer value and you want the a delta from the previous sample to compute the every trip.

"Newbies" to data science are indeed a good target audience, before they are already attached to pandas. But this doesn't imply they know nothing. It's very unlikely that someone both 1. has a need to do the kind of data analysis that polars is good at, and 2. has never heard of the "data frame" concept.

It’s annoying only because it’s on hacker news, because what are the odds of getting on it if you don’t know what is it and don’t have a need for it?

I mean, pretty high. What if your boss just tells you to learn polars, and you don’t know why? Saying what something is, is just good communication, and can help clarify for people who are confused.

Guess in the remote event that you're told to learn a new skill that you don't know anything about, you go to pola.rs website and see "DataFrames for the new era" and start getting documentation from there, about what DataFrame is, the website is clearly showing what is it, it's your duty to understand what is it, I would argue that if you knew what DataFrames are you would be saying "Why is it saying something so basic and don't just show me the good stuff?"

I for example hate website that try to serve newbies, newbies have a lot of content if they are interested, it's not that all the web needs to serve them

These workplaces where bosses tell employees to learn unheard-of tools with zero context sound terrible.

What if your boss just tells you to learn polars, and you don’t know why? Saying what something is, is just good communication

Shouldn't the good communication happen when the boss tells you to learn polars? Like, why are you telling me this, boss; what is it that you need done?

It's annoying because a single leading sentence would be enough to explain a product. Some of the words (for example "Data Frame") in that sentence can be links to other pages if that's necessary. It's a small change but it makes a huge difference.

Wes has also worked hard to improve a lot of the missteps of pandas, such as through pyarrow, which may prove even more impactful than pandas has been to date.

Polars is also a wonderful project!

Polars is also based on McKinney’s Arrow project.

Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.

https://github.com/pola-rs/polars/blob/main/README.md

Wes also literally created another Python dataframe project, Ibis, to overcome many of the issues with pandas

https://ibis-project.org

most data engines and dataframe tools these days use Apache Arrow, it's a bit orthogonal

I'm a data engineering newbie and I found it very clear, and it gave me an enthusiastic feeling (not an "alienating" feeling).

This whole thread just comes across as unmitigated pedantry to me.

Presumably you were introduced to the concept of DataFrames and how they're used through some other source, because Polars landing page doesn't even bother to mention it's used for data analysis and documentation simply assumes you're already familiar with the core concepts.

Compare that to Pandas which starts with the basics, "pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language." It then leads you to "Getting started" guide which features "Intro to pandas" that explains the core concepts.

Yes, it's annoying negative feature of many tech products.

Sadly its not only tech products, but also things like security disclosures too.

It always follows the same pattern:

    - Spend $X time coding/researching something.
    - Spend $not_enough_time documenting it.
    - Spend $far_too_much_time thinking about / "engaging with the community" in deciding on a cute name, fancy logo and cool looking website.

Above that it says “DataFrames for a new era” hidden in their graphics. I believe it’s a competitor to the Python library “Pandas”, which makes it easy to do complex transformations on tabular data in Python.

It seems like it's a disease endemic to data products. Everybody, the big cloud providers and the small data products, build something whose selling point is "I'm the same as Apache X but better." But if you don't know what Apache X is, you have to go read up on that, and its website might say "I'm the same as Whatever Else but better," and you have to go read up on that. I don't want to figure out what a product does by walking a "like X but better" chain and applying diffs in my head. Just tell me what it does!

I get that these are general purpose tools with a lot of use cases, but some real quick examples of "this is a good use case" and "this is a bad use case, maybe prefer SQL/nosql/quasisql/hadoop/a CSV file and sed" would be really helpful, please.

I run into the same problem. I don't know what Pandas are (besides the bears) and at some point up the "it's like X" chain, I guess you have to stop and admit you're just not the target user of this tech product.

I guess you have to stop and admit you're just not the target user of this tech product.

On the other hand, how can you become a target user if you don't know that a product category exists?

This project is a solution to a particular kind of problem. The way you become a target user of that solution is by first having the problem it's a solution to.

If you have the problem "I want to analyze a bunch of tabular data", you'll start researching and asking around about it, and you'll quickly discover a few things: 1. people do this with (usually columnar / "OLAP") sql query interfaces, 2. people usually end up augmenting that with some in memory analyses in a general purpose programming environment, 3. people often choose R or python for this, 4. both of those languages lean heavily on a concept they both call "data frames", 5. in python, this is most commonly done using the pandas library, which is pervasive in the python data science / engineering world.

Once you've gotten to that point, you'll be primed for new solutions to the new problems you now have, one of which is that pandas is old and pretty creaky and does some things in awkward and suboptimal ways that can be greatly improved upon with new iterations of the concept, like polars.

But if you don't have these problems, then the solution won't make much sense.

That's on you. If you want to become a data engineer and data scientist -- the two software positions most likely to use polars -- get learning. Or don't: learn it when you need it.

I dunno, I get the criticism, but also, every field assumes a large amount of "lingua franca" in order to avoid documenting foundational things over and over again.

Programming language documentation doesn't all start with "programming languages are used to direct computers to do things"; it is assumed the target audience knows that. Database documentation similarly doesn't start out with discussing what it means to store and access data and why you'd want to do that.

It's always hard to know where to draw this line, and the early iterations of a new idea really do need to put more time into describing what they are from first principles.

I remember this from the early days of "NoSQL" databases. They spilled lots of ink on what they even were trying to do and why.

But in my view this isn't one of those times. I think "DataFrames" are well within a "lingua franca" that is reasonable to expect the audience of this kind of tool to understand. This is not an early iteration of a concept that is not widely familiar, it is an iteration of an old, mature, and foundational concept with essentially universal penetration in the field where it is relevant.

Having said all that, I came across this "what is mysql" documentation[0] which does explain what a relational database is for. It's not the main entry point to the docs, but yeah, sure, it's useful to put that somewhere!

0: https://dev.mysql.com/doc/refman/8.0/en/what-is-mysql.html

See also: Is it pokemon or big data https://pixelastic.github.io/pokemonorbigdata/

If you don't know what the comparison product is either then you are not the target customer. This is a library for analyzing and transforming (mostly numerical) data in memory. Data scientists use it.

Dataframes in Python are a wrapper around 2D numpy arrays, that have labels and various accessors. Operations on them are OOM slower than using the underlying arrays.

I don't know where this myth originated from but I have seen this in multiple places. Even if you think about it 2d numpy arrays can't have different type for different columns.

"myth originated"

It's in the documentation.

I just learned pandas recently, and I would have said this same thing. Not because I read through the numpy code, but because I read the documentation.

Is it wrong? Can't a user pick up a new tool and trust some documentation without reading through 100's of libraries built on libraries.

When was last time someone traced out every dependency, so they can confidently say something is "Myth".

Where is it written in pandas documentation? Pandas dataframe is stored in list of 1d numpy arrays, not a single 2d array.

The columns with common dtypes are grouped together in something called "blocks" and inside those blocks are 2D numpy arrays. It is probably not in the documentation because it is seen as implementation detail but you can see the block manager's structure in this article (https://dkharazi.github.io/blog/blockmanager/) or in this talk (https://thomasjpfan.github.io/scipy-2020-lightning-talk-pand...).

Actually numpy has something called a structured array that is pretty much what what you described.

Well, if you use structured arrays or record arrays, you can do this (more or less).

https://numpy.org/doc/stable/user/basics.rec.html

There's a very good point here but I don't think its made clear.

If your data fits into numpy arrays or structured arrays (mainly if it is in numeric types), numpy is designed for this and will likely be much faster than pandas/polars (though I've also heard pandas can be faster on very large tables).

Pandas and Polars are designed for ease of use on heterogeneous data. They also include a python 'Object' data type which numpy very much does not. They are also designed more like database (e.g. 'join' operations). This allows you to work directly with imported data that numpy won't accept - after which Pandas uses numpy for underlying operations.

So I think the point is if you are running into speed issues in Pandas/Polars, you may find that the time-critical operations could be things that are more efficiently done in numpy (and this would be a much bigger gain than moving from Pandas to Polars)

I try to use polars each time I have to do some analysis where dataframes helps. So basically any time I'd reach for pandas, which isn't too often. So each time it's fairly "new". This makes me have a hard time believing everyone that is saying "Pandas but faster" has used Polars, because I can often write Pandas from memory.

There's enough subtle and breaking changes that it is a bit frustrating. I really think Polars would be much more popular if the learning curve wasn't so high. It wouldn't be so high if there were just good docs. I'm also confused why there's a split between "User Guide" and "Docs".

To all devs:

Your docs are incredibly important! They are not an afterthought. And dear god, don't treat them as an afterthought and then tell people opening issues to RTFM. It's totally okay to point people in the right direction without hostility. It even takes less energy! It's okay to have a bad day and apologize later too, you'll even get more respect! Your docs are just as important as your code, even if you don't agree with me, they are to everyone but you. Besides technical debt there is also design debt. If you're getting the same questions over and over, you probably have poor design or you've miscommunicated somewhere. You're not expected to be a pro at everything you do and that's okay, we're all learning.

This isn't about polars, but I'm sure I'm not the only one to experience main character devs. It makes me (and presumably others) not want to open issues on __any__ project, not just bad projects, and that's bad for the whole community (including me, because users find mistakes. And we know there's 2 types of software: those with bugs and those that no one uses). Stupid people are just wizards in training and you don't get more wizards without noobs.

As another data point, I switched to Polars because I found it much more intuitive than Pandas - I coulnt remember how to do much in pandas in the rare (maybe twice a year) times I want to do data analysis. In contrast, Polars has a (to me anyway) wonderfully consistent API that reminds me a lot of SQL

Yep, been using pandas for years now, still have no real mental model for it and constantly have to experiment or chat with our AI overlords to figure out how to use it. But SQL and polars make sense to me.

I also use Pandas very infrequently (it has been years). Usually, for data analysis, I'm reaching for R + tidyverse/data.table/arrow. I have found Python and Pandas to be inelegant and verbose for these tasks.

As of last week, I have a need to process tabular data in Python. I started working with polars on Friday, and I have an analysis running across 16 nodes today. I find it very intuitive.

Maybe that's it, because I don't really use SQL much. Reaching for pandas about the same rate. Maybe that's the difference? But I come from a physics background and I don't know many physicists and mathematicians that are well versed in SQL. But I do know plenty that use pandas and python. So there's definitely a lot of people like me. Also I could be dumb. Totally willing to accept that lol.

The story of Polars seems to be shaping up a bit like the story of Python 3000: everything probably could have been done in a slow series of migrations, but the BDFL was at their limit, and had to start fresh. So it takes 10 years for the community to catch up. In the mean time, there will be a lot of heartache.

I honestly don't believe something like polars could've evolved out of pandas.

It's a complete paradigm shift.

There's honestly not much that probably could've been shared at the point polars was conceived. Maybe now there's a little more (due to the arrow backend) but still very little probably.

Just once I’d like to see “this library was written to fulfill head-in-the clouds demands by management that we have some implementation, without regards to quality.”

That has absolutely no relation to this project. What in the world are you talking about?

Trust me. It does. ;)

What do you mean? What "management" was this created to fulfill the demands of?

I’m just responding to

"Polars is written from the ground up with performance in mind"

It is a common thing to see, I thought it would be funny to imagine the opposite.

For posterity, polars was a hobby product that started in 2020: https://news.ycombinator.com/item?id=23768227

As a hobby project I tried to build a DataFrame library in Rust. I got excited about the Apache Arrow project and wondered if this would succeed.

After two months of development it is faster than pandas for groupby's and left and inner joins. I still got some ideas for the join algorithms. Eventually I'd also want to add a query planner for lazy evaluation.

Definitely not intended as a slight toward this project, just (what I thought was) a funny thought about that expression.

Noticed exactly the same - there's no description of the library whatsoever on the landing page. It is implied that it is a DataFrame library, whatever that means.

Maybe this is sort of like the opposite of how scam emails are purposefully scammy, so that only people who can't recognize scams will fall for them. Only people who know what "a DataFrame library" is - which is an enormous number of people, since this is probably the most broadly known concept in data science / engineering - will keep reading this, and they are the target audience.

which is an enormous number of people

While that may be, I think it would make sense to describe the project in a succinct way on the first page a visitor lands.

It is described in a succinct way. "DataFrames" is that description. It's the very first text on the page. It's really the same as having the word "database" be the first text on the landing page of a new database project. If you don't know what the word "database" means, the landing page for a new database project is really not the place to expect to learn about that. The "data frame" concept is not quite as old or broad as the concept of "databases", but it's really not that far off. It's decades old, and is about as close to a universal concept for data work as it's possible to get.

But you're not the audience? There is very little to gain by tailoring the introduction to people who aren't the audience.

You don't go car parts manufacturer expecting an explanation of what an intercooler is.

I was going to say - it always feels so humbling seeing pages like this. "DataFrames for the new era" okay… maybe I know what data frames are? "Multi-threaded query engine" ahh, so it’s like a database. A graph comparing it to things called pandas, modin, and vaex - I have no clue what any of these are either! I guess this really isn’t for me.

It’s a shame because I like to read about new tech or project and try and learn more, even if I don’t understand it completely. But there’s just nothing here for me.

This must be what normal people go through when I talk about my lowly web development work…

It's pretty much just an alternative to SQL that's a lot easier/natural to use for more hardcore data analysis.

You can much more easily compose the operations you want to run.

Just think of it as an API for manipulating tabular data stored somewhere (often parquet files, though they can query many different data sources).

Dataframes and SQL have overlapping functionality, but I wouldn't say that dataframes are an "alternative" to SQL. The tradeoffs are very different. You don't have to worry about minimizing disk reads or think about concurrency issues like transactions or locks, because a dataframe is just an in-memory data structure like a list or a dict, rather than a database. Dataframes also aren't really about relational algebra like SQL is.

Have you tried polars? I agree that if pandas is all you've tried, that it's pretty far from an alternative frontend for query engines, but if you've tried polars it maps pretty cleanly to SQL, can be query optimized to a query plan similarly to SQL, I should've made it clear that I'm talking about an alternative to SQL used in an OLAP context, not for OLTP.

Data tables tend to also be a standard ingestion format for statistical tools in many cases.

I'm currently getting dragged into "data" stuff, and I get the impression it's a parallel universe, with its own background and culture. A lot of stuff is like "connect to your Antelope or Meringue instances with the usability of Nincompoop and the performance of ARSE2".

Anyway, probably the interesting things about polars are that it's like pandas, but uses a more efficient rust "backend" called Arrow (although I think that part's also in pandas now) and something like a "query planner" that makes combining operations more efficient. Typically doing things in polars is much more efficient than pandas, to the extent that things that previously required complicated infrastructure can often be done on a single machine. It's a very friendly competition, created by the main developer of pandas.

As far as I can tell everybody loves it and it'll probably supplant pandas over time.

As far as I can tell everybody loves it and it'll probably supplant pandas over time.

I've been using pandas heavily, everyday, for something like 8 years now. I also did contribute to it, as well as wrote numpy extensions. That is to say, I'm fairly familiar with the pandas/numpy ecosystem, strengths and weaknesses.

Polars is a breeze of fresh air. The API of pandas is a mess:

* overuse of polymorphic parameters and return types (functions that accept lists, ndarrays, dataframes, series or scalar) and return differently shaped dataframes or series.

* lots of indirections behind layers of trampoline functions that hide default values behind undocumented "=None" default values.

* abuse of half baked immutable APIs, favoring "copy everything" style of coding, mixed with half supported, should-have-been-deprecated, in-place variants.

* lots and lots of regressions at every new release, of the worst kind ("oh yeah we changed the behavior of function X when there is more than Y NaNs over the window")

* Very hard to actually know what is delegated to numpy, what is Cython/pandas, and what is pure python/pandas.

* Overall the beast seemed to have won against its masters, and the maintainers seem lost as to what to fix versus what to keep backward compatible.

Polars fixes a lot of these issues, but it has some shortcomings as well. Mainly I found that:

* the API is definitely more consistent, but also more rigid than pandas. Some things can be very verbose to write. It will take some years for nicer simpler "shortcuts" and patterns to emerge.

* The main issue IMHO is polars' handling of cross sectional (axis=1) computations. Polars is _very_ time series (axis=0) oriented, and most cross sectional computations require to transpose the data frame, which is very slow. Pandas has a lot of dedicated axis=1 implementations that avoid a full transposition.

Many axis=1 operations in pandas do a transpose under the hood, mind you. Axis=1 belongs in matrices, not in heterogeneous data. They are a performance footgun. We make the transpose explicit.

Many axis=1 operations in pandas do a transpose under the hood, mind you

Sure, but many others are natively axis=1-aware and avoid full transposition.

Axis=1 belongs in matrices, not in heterogeneous data.

I'm not sure to understand what that means. Care to elaborate?

They are a performance footgun.

You don't get to only solve the problems that are efficient to solve...

We make the transpose explicit.

Yes, but when you do mixed time series / cross sectional computations, you cannot always untangle both dimensions and transpose once. Sometimes your computation intrinsicely interleaves cross sectional and time series. In these case, which happen a lot in financial computations, then explicitly fully transposing is very slow.

All domains seem to have this kind of in-group shorthand, regardless of scale of the community.

In fairness, the title of the page is “Dataframes for the new Era”. The “Get Started” link below the title links to a document that points to the GitHub page, which explains what the library is about to people with data analysis backgrounds: https://github.com/pola-rs/polars

But annoyingly, not the <title>, thus the useless HN headline.

I wish HN had secondary taglines we could use to talk about the actual content or relevance of an article apart from its headline.

Had the exact same thought seeing this. Too many of these websites are missing a simple tldr of the thing actually is. Great, it's fast, but fast at what??

It has that simple tldr, it's the very first word, "DataFrames". Everyone in this thread just doesn't know what that means, and that's fine, I get that, but seriously, that's the simple summary. Data frames aren't an obscure or esoteric concept in the data analysis space; quite the opposite.

Hard agree. People post links to websites with technical descriptions and little basic info all the time, and this is the first time I'm seeing a thread of people complaining about it. If I'm interested in something I see, I start Googling terms; I don't expect a specification for software in a specific field to cater to my beginner-level knowledge.

I think something like dataframes suffers from having a name that isn't obscure enough. You read "dataframes" and think those are two words you know, so you should understand what it is.

If they'd called them flurzles you wouldn't feel like you should understand if it's not something you work with.

For me, “data frames” are forever associated with MPEG

How come some submissions don't even describe what it is about than just the name of it? It's really puzzling how everyone is meant to know what it is by its name.

I've mentioned this before and got downvoted because of course everyone is a web dev and knows what xyz random framework (name and version number in the title, nothing else) is.

Marketing is a skill that needs to be learned. You have to put yourself in the shoes of a person who knows nothing about your product. This does not come naturally to the engineers who make these products and are used to talking to other specialists like themselves.

This is true in general but I'm not sure it's what's going on here.

Marketing is also very concerned with understanding who your target audience(s) are and speaking their language.

I think talking about "DataFrames" is exactly that; the target audience of this project knows what that means. What they are interested in is "ok but who cares about data frames? I've been using pandas for like fifteen years", so what you want to tell them is why this is an improvement, how it would help them. Dumbing it down to spend a bunch of space describing what data frames are would just be a distraction. You'd probably lose the target audience before you ever got to the actual benefits of the project.

I don't use dataframes in my day job but have dabbled in them enough that I found this website pretty easy to digest.

You'd really have to be a complete data engineering newbie to not understand it I think?

I mean, where do you draw the line? You wouldn't expect a software tool like this to explain what it is in language my grandma would understand, I don't think?

You'd really have to be a complete data engineering newbie to not understand it I think?

I do occasionally use Pandas in my day job, but I honestly think very few programmers that could have use for a data frame library would describe themselves as a “data engineer” at all.

In my case, for example, I’m just a physicist - I don’t work with machine learning, big data, or in the software industry at all. I just use Pandas + Seaborn to process the results of numerical simulations and physical experiments similarly to how someone else might use Excel. Works great.

I hate this doc style that has become so popular lately. They get so wrapped up in selling you their story that they forget to tell you basic shit. Like what it is. Or how to install it.

The PMs literally simplified things so much they simplified the product right ought of the docs.

It is right there on the page, set to Python by default:

Quick install > Polars is written from the ground up making it easy to install. Select your programming language and get started!

It's fine to me. Tech UI is bad and weird, but not like if you gain 5x customers with better UX.

You were right that the page is written for those that know what they are looking for, which is just fine. If you are getting started in DS/ML/etc and you have used numpy, pandas, etc. polars is useful in some cases. A simple one, it loads dataframes faster (from experience with a team I help) than pandas.

I haven't played enough to know all it's benefits, but yes it's the next logical step if you are in the space using the above mentioned libraries, it's something one will find.

pandas dataframes but faster

Right... but the title before the first line reads "DataFrames for the new era". If you don't know what a data frame is then, yes, it's for people who already know that.

It’s not written for you and that’s fine. This is a library targeted at a very specific subset of people and you’re not in it.

Biggest advantage I found when I evaluated it was the API was much more consistent and understandable than the pandas one. Which is probably a given, they’ve learned from watching 20 major versions of pandas get released. However, since it’s much rarer, copilot had trouble writing polars code. So I’m sticking with pandas and copilot for now. Interesting barrier to new libraries in general I hadn’t noticed until I tried this.

You're the first person I ever encounter that publicly states to prefer a library because of its copilot support.

Not making a judgement, just finding it interesting.

Anyway, for what is worth, Copilot learns fast in your repos, very fast.

I use an extremely custom stack made of TS-Plus a TypeScript fork that not even the author itself uses nor recommends and Copilot churns very good TS-Plus code.

So don't underestimate how good can copilot can get at the boilerplate stage once he's seen few examples.

Umm... Could you please link to a resource so someone can parse what your last two paragraphs mean?

That sounds really interesting and valuable, I just have no idea where to start.

Given examples, Copilot can generate code for extremely rare languages or data structures. For example, it worked fine when I was writing for an obscure scripting language found in a railway simulation game.

to further elaborate, Copilot automatically grabs most recent 20 files with same extension to get code examples. you dont have to do anything special to make this happen. it just improves quietly over time.

https://dev.to/effect/the-case-for-ts-18b3

As other users said, copilot learns from the rest of your files too.

Thus it works even for relatively obscure stuff like this https://github.com/ts-plus

It was pretty straightforward to read. Maybe take another look? You can use a search engine to find this TS-Plus thing they talk about.

I think he means he uses an obscure programming language and co-pilot still gives him functioning code if he gives a few examples. Not sure if copilot is very context aware where you can feed it an entire code-base, but maybe you can point GPT to read the documentation

I had a similar experience using danfo.js, another data frame library in js. Copilot straight up hallucinate functionality and method names.

Not a big deal because I just read the docs but it was annoying that I couldn't have copilot just spit out what I need.

This is really interesting to see these two posts. I can now imagine where AI tools actually inhibit innovation in many domains simply because they’re optimized for things that are already entrenched and new entrants won’t be in the training data. Further inhibiting adoption compared to existing things and thus further inhibiting enough growth to make it into model updates.

How is that different from humans who prefer tools they know to tools they don't?

Because it’s like willfully choosing the more painful and difficult tool that occasionally stabs you in the hand, because you’re now used to being stabbed in the hand.

Continuing to choose it in the face of - in their own words - a better option, is a bit mind-boggling to me.

It is a healthy mindset to see this phenomenon as "interesting". I can get there when I dial up my mindfulness, but my default mode here is rather judgy; as in "please ppl! pick the better tool as evaluated over a 4+ hour timeframe (after you've got some muscle memory for the API) instead of a 15 minute evaluation".

Forgive me for ranting here, but have people forgotten how to bootstrap their own knowledge about a new library? Taking notes isn't hard. Making a personal cheat-sheet isn't hard. I say all this AND I use LLMs very frequently to help with technical work. But I'm mindful about the tradeoffs. I will not let the tool steer me down a path that isn't suitable.

I'm actually hopeful: there is an unexpected competitive advantage to people who are willing to embrace a little discomfort and take advantage of one's neuroplasticity.

I can now imagine where AI tools actually inhibit innovation [...] new entrants won’t be in the training data

I still imagine the opposite impact... Welcome to no-moats-lang.io! So, you've created yet another new programming language over the holidays? You have a sandbox and LSP server up, and are wondering what to do next? Our open-source LLMs are easily tuned for your wonderful language! They will help you rapidly create excellent documentation, translators from related popular languages, do bulk translation of "batteries" so your soon-to-be-hordes of users can be quickly productive, and create both server and on-prem ChatOverflowPilotBots! Instant support for new language versions, and automatic code update! "LLM's are dynamite for barriers to entry!" - Some LLM Somewhere Probably.

Once upon a time, a tar file with a compiler was MVP for a language. But with little hope of broad adoption. And year by year, user minimum expectations have grown dauntingly - towards extensive infrastructure, docs, code, community. Now even FAMG struggle to support "Help me do common-thing in current-version?". Looking ahead, not only do LLMs seemingly help drop the cost of those current expectations to something a tiny team might manage, but also help drop some of the historical barriers to rapid broad adoption - "Waiting for the datascience and webdev books? ... Week after next."

We might finally be escaping decades of language evolution ecosystem dysfunction... just as programming might be moving on from them? :/

You recognize the API is more consistent and understandable, but you want to stay with Pandas only because Copilot makes it easier? Please, (a) for your own sake and (b) for the sake of open source innovation, use the tool that you admit is better.

About me: I've used and disliked the Pandas API for a long time. I'm very proactive about continual improvement in my learning, tooling, mindset, and skills.

Please, (a) for your own sake and (b) for the sake of open source innovation, use the tool that you admit is better.

This is...such a strange take. To follow your logic to an extreme, everyone should use a very small handful of languages that are the "best" in their domain with ne'er a care for their personal comfort or preference.

for your own sake

They're sticking with Pandas exactly for their own sake since they like being able to use Copilot.

for the sake of open source innovation

Ohh by all means let's all be constantly relearning and rehashing just to follow the latest and greatest in open source innovation this week.

Tools are designed to be _used_ and if you like using a tool _and_ it does the job you require of it that seems just fine to me; especially if you're also taking the time to evaluate what else is out there occasionally.

Is it really that strange of a take? To use the best tool available for a job. That doesn't sound strange at all.

Doubly so if it involves copilot. There's no way to get training data without people writing it. This sound like a direct application of a greedy algorithm: trading long term success for short term gain. That's not the ideal way to live.

Yes, the greedy algorithm metaphor is an interesting connection!

I also like thinking about this as a feedback loop (as explained by systems dynamics), since it provides nice concepts for how systems change over time.

I'll attempt to clarify my rationale:

(a) For one's own sake, please pick the better tool as evaluated over suitable timeframe (perhaps an hour or two, after you've got some familiarity and muscle memory for the API) instead of only a brief evaluation (e.g. only 15 minutes).

(b) Better open source tools (defined however you want), which benefit us all, get better uptake when people think beyond merely the short-term.

The essence of my argument is "think beyond the short-term". Hardly controversial.

Don't miss the context: LLMs are giving people even more excuses for short-term thinking. Humans are terribly tempted for short-sighted "victories".

The Polars lib changes rapidly. I am not using Copilot but achieved very good results with ChatGpt if you set system instructions to let it know that eg with_column was replaced with with_columns etc. and add the updated doc information to the system instructions.

I use polars, but I've also run into this problem with copilot.

Copilot support is basically non existent for Polars. It does a decent job of writing basic pandas... (But could do a lot better).

Copilot support is a chicken and egg problem. It needs to train on others code but if people don't write Polars code without Copilot then Copilot will not get better at writing Polars code.

Anyone got any real-world comparison with Pandas? Like an orders of magnitude wow moment?

It allowed me to take some code that reads in a bunch of data and performs a few rounds of some pretty standard operations (groupby, filtering, calculating mean/stdevs) and going from pandas to polars allowed me to go from ~1 minute per dataset to 1 second (yes, I tried the Arrow backend for pandas too). This was after spending some time profiling the pandas code and fixing up the slowest parts as best as I could. The translation was pretty straightforward. The output of this pipeline code was a few different dataframes (each to be inserted into a separate table) and each dataframe was output from a function. I was able to migrate one function at a time after asserting that the outputs of the two functions were identical and that all relevant tests passed (I used `to_pandas()` where needed).

I'm not sure how much faster I could go, since ~1 second/dataset allowed me to answer some questions that I had that required scanning values for a few parameters. The biggest wins for me were in grouping and merging operations.

I'm a complete convert now. The API is simpler and more obvious IMO, and the ability to compose expressions (`polars.Expr`) is awesome. The performance benefits are nice and what motivated me in the first place, but I'm more swayed by the aforementioned benefits.

Running on a very high core count server? Polars in single thread applications definitely are faster but not 60x faster unless the work isn't comparable. Are you reading from parquet and only operating on some columns? That could also be it.

But yeah, polars is awesome, I'm all in on it.

I'm not including parsing time, both pandas and polars versions started from an in-memory data structure parsed from two XML files (low GB range). This is on my workstation with a single Xeon 4210 (10 cores, 20 threads @ 2.20-3.20Ghz).

Perhaps I can focus on a subset of this processing and write this up since it seems like there's at least some interest in real examples. As pointed out in a reply to a sibling comment, I don't guarantee that my starting code is the best that pandas can do -- to be honest, the runtime of the original code did not line up with my intuition of how long these operations should take. Maybe someone will school me but either way switching to polars was a relatively easy win that came with other benefits and feels right to me in a way that pandas never did.

Is polars not parallelizing some ops on the GPU?

It has zero GPU support for now.

Important point.

Nowadays, we write a pure pandas version, and when the data needs to be 100X bigger and faster, change almost nothing and have it run on the GPU via cudf, a GPU runtime that fully follows the pandas API. Most recently, we port GFQL (Cypher graph queries on dataframes) to GPU execution over the holiday weekend and it already beats most Cypher implementations. Think billions of edges traversed per second on a cheap 5 year old GPU.

We're planning the bigger than memory & multi node versions next, for both CPU + GPU, and while cudf leans towards dask_cudf, plans are still TBD. Polars, Ray, and Dask all have sweet spots here.

According to GitHub, 90% of Pandas’ codebase is written in Python, which probably means there’s a lot of language overhead during operations compared to the rust code in polars.

That, plus parallelism, probably explains the performance difference. If anything, 60x sounds conservative to me.

I think with parallelism that difference is realistic, definitely not in single core performance though, most of pandas is implemented in numpy which should be pretty fast.

Bloody hell!! Thanks, that's exactly the kind of comment I was hoping to see. Sounds like a bit of an Apache --> Nginx moment for dataframes. Super cool!!

To add some balance:

- I can't rule out that a pandas wizard couldn't have achieved the same speed-up in pandas

- polars code was slightly more verbose. For example, when calculating columns based on other columns in the same chain, in pandas, each new column can be defined as a kwarg in a single call to `assign`, whereas in polars, columns that depend on other must be defined in their own calls to `with_columns`

- handling of categoricals in polars seemed a little underbaked, though my main complaint, that categories cannot be pre-defined, seems to have been recently addressed: https://github.com/pola-rs/polars/issues/10705

- polars is not yet 1.0, breaking changes will happen

Regarding your second point, you can use the walrus operator to retain the results of a computation within a single `.with_columns()` call. See https://stackoverflow.com/a/77609494

Edited to add: also, if you’re using a lazy dataframe, you can just naively do the same operation twice (once to store it in a named column and once again in the subsequent computation), and Polars will use common subexpression elimination (CSE) to prevent recomputing the result. You can verify this is true using the `.explain()` method of a lazy dataframe operation containing the `.with_columns()` call.

That's awesome, thanks for sharing! Though tbh I'm not likely to use it.. it's a bit too magical - though still a delicious hack.

I just edited my comment above to add more info about common subexpression elimination. It’s magic that happens behind your back on lazy dataframes. Polars is great!

Real-world performance is complicated since data science covers a lot of use cases.

If you're just reading a small CSV to do analysis on it, then there will be no human-perceptible difference between Polars and Pandas. If you're reading a larger CSV with 100k rows, there still won't be much of a perceptible difference.

Per this (old) benchmark, there are differences once you get into 10 million rows/500MB+ territory: https://h2oai.github.io/db-benchmark/

DuckDB is publishing updates to the H20.ai benchmark: https://duckdb.org/2023/11/03/db-benchmark-update.html

Yep, for some stuff, like anything requiring loading / scanning through all the data (without doing anything requiring a custom lambda), I've seen orders of magnitude improvement to latency, and also (more importantly, imo) much lower memory usage.

In my experience, the biggest difference is in the API. Besides the fact that it's usually faster than pandas, it also feels faster to write and easier to read.

A few months ago I tried migrating a large pandas codebase to polars. I'm not much of a fan of doing analytics/data pipelining in Python - a complex transformation takes me 2-5x as long in pandas compared to Julia or R (using dataframes.jl & dplyr).

Unfortunately polars was not it. Too many bugs on standard operations, unreliable interoperability with pandas (which is an issue since so many libraries require pandas dataframes as inputs), the API is also very verbose for a modern dataframe library, though it's still better than pandas.

Hopefully these will get resolved out over time but for now I had the best luck using duckdb on top of pandas, it is as fast as polars but more stable/better interoperability.

Eventually I hope the Python dataframe ecosystem gets to the same point as R's, where you have a analytics-oriented dataframe library with an intuitive API (dplyr) that can be easily used alongside a high-performance dataframe library (data.table).

Nice, you have experience in data frames in R, Python and Julia! Which one of those do you like the most? I know that the ecosystem isn't really comparable, but from your experience, which one is the best to work with for core operations, etc.?

I'm not the person you replied to, but I have experience with all of these. My background is computer science / software engineering, incorporating data analysis tools a few years into my career, rather than starting with a data analysis focus and figuring out tools to help me with that. In my experience, this seems to lead to different conclusions than the other way around.

tldr: Julia is my favorite.

I could never click with R. It is true that data.table and dplyr and ggplot are well done and I think we owe a debt of gratitude to the community that created them. But the language itself is ... not good. But that's just, like, my opinion!

Pandas I also have really never clicked with. But I like python a lot more than R, and pandas basically works. For what it's worth, the polars api style is more my thing. But most of the data scientists I work with prefer the pandas style, :shrug:.

But I really like this part of Julia. It feels more "native" to Julia than pandas does to python. More like data.table in R, but embedded in a, IMO, even better language than python. The only issue is that Julia itself remains immature in a number of ways and who knows whether it will ever overcome that. But I hope it does!

I sympathize with anyone who doesn't like R. Even as a statistics/math DSL it's really wonky.

But it's a lot more fun when you realize that it's an homoiconic array language with true lazily-evaluated F-exprs (not Rebol/Tcl strings).

I realized that (not in so many words...) pretty quickly and do not like it at all :)

Not OP but R data.table + dplyr is an unbeatable combo for data processing. I handily worked with 1bn record time series data on a 2015 MBP.

The rest of the tidyverse stuff is OK (like forcats), but the overall ecosystem is a little weird. The focus on "tidy" data itself is nice up to a point, but sometimes you want to just move data around in imperative style without trying to figure out which "tidy verb" to use, or trying to learn yet another symbol interpolation / macro / nonstandard eval system, because they seem to have a new one every time I look.

Pandas is a real workhorse overall. Data.table is like a fast sports car with a very complicated engine, and Pandas is like a work van. It's a little of everything and not particularly excellent at anything and that's ok. Also its index/multiindex system is unique and powerful. But data.table always smoked it for single-process in-memory performance.

Until DuckDB and Polars, there was no Python equivalent of data.table at all. They're great when you want high performance, native Arrow (read: Parquet) support, and/or an interface that feels more like a programming library than a data processing tool. If you're coming from a programming background, or if you need to do some data processing or analytics inside of production system, those might be good choices. The Polars API will also feel very familiar to users of Spark SQL.

For geospatial data, Pandas is by far superior to all options due to GeoPandas and now SpatialPandas. There is an alpha-stage GeoPolars library but I have no idea who's working on it or how productive they will be.

If you had to learn one and only one, Pandas might still be the best option. Python is a much better general-purpose language than R, as much as I love R. And Pandas is probably the most flexible option. Its index system is idiosyncratic among its peers, but it's quite powerful once you get used to using it, and it enables some interesting performance optimization opportunities that help it scale up to data sets it otherwise wouldn't be able to handle. Pandas also has pretty good support for time series data, e.g. aggregating on monthly intervals. Pandas also has the most extensibility/customizability, with support for things like custom array back ends and custom data types. And its plotting methods can help make Matplotlib less verbose.

I've never gotten past "hello world" with Julia, not for lack of interest, but mostly for lack of time and need. I would be interested to hear about that comparison as well.

Ha I like your description of pandas as a work van. I totally have that same feel for it. It's great because it works, not because it's great :)

At a previous job, I regularly worked with dfs of millions to hundreds of millions of rows, and many columns. It was not uncommon for the objects I was working with to use 100+ GB ram. I coded initially in Python, but moved to Julia when the performance issues became to painful (10+ minute operations in Python that took < 10s in Julia).

DataFrames.jl, DataFramesMeta.jl, and the rest of the ecosystem are outstanding. Very similar to pandas, and much ... much faster. If you are dealing with small (obviously subjective as to the definition of small) dfs of around 1000-10000 rows, sticking with pandas and python is fine. If you are dealing with large amounts of real world time series data, with missing values, with a need for data cleanup as well as analytics, it is very hard to beat Julia.

FWIW, I'm amazed by DuckDB, and have played with it. The DuckDB Julia connector gives you the best of both worlds. I don't need DuckDB at the moment (though I can see this changing), and use Julia for my large scale analytics. Python's regex support is fairly crappy, so my data extraction is done using Perl. Python is left for small scripts that don't need to process lots of information, and can fit within a single terminal window (due to its semantic space handicap).

I got annoyed at the verbosity as well. Pandas is fairly verbose compared to eg data.table, but Polars really feels more like using "an API" than "a data manipulation tool".

I probably wouldn't use it for EDA or research, but I have started to use it in certain production scripts for the better performance.

R dplyr + data.table is still my favorite data manipulation experience. I just wish we had something like Matplotlib in R: ggplot is too high level, base graphics are too low level. Also Scikit-Learn is much more modular than Caret, which I don't really miss using.

Have you tried the "grid" graphics package in R? It's the basis for ggplot. It's a bit of an unsung hero, the documentation is not great, but I think it is a very solid library.

Is it usable on its own? I only ever interacted with it in trying to hack around something I didn't like in ggplot, and it didn't seem like something I could use "by hand". In hindsight it does sound a lot like what MPL does. I can take a look!

yeah, I haven't used Polars but from skimming the docs it looks kind of enterprisey. I don't want to type `df.select(pl.col("a"))` instead of `df["a"]`.

Latter also works.

I just wish we had something like Matplotlib in R

plotly could be worth a try, i use its python bindings and much prefer it to matplotlib, but i don't know much about the quality of it's R API

Maybe give ibis with the duckdb backend a try, though personally I quite like polars. The devs are pretty fix to respond to issues overall.

I am very curious to know how you feel about PRQL (prql-lang.org) ? It aims to give you the ergonomics dplyr wherever you use SQL (by compiling to SQL).

IMHO this gives you the DX of dplyr / Polars / Pandas combined with the power and universality of SQL because you can still execute your queries on any SQL compatible query execution engine of your choice, including Polars and Pandas but also DuckDB, ClickHouse, BigQuery, Redshift, Postgres, Trino/Presto, SQLite, ... to name just a few popular ones.

I'd love to hear your thoughts, either in a Discussion on Github (https://github.com/PRQL/prql/discussions) or on our Discord (https://discord.com/invite/XWxbCrWr)!

Disclaimer: I'm a PRQL contributor.

Used pandas for years and it always felt like rolling a ball uphill - just look at doing something as simple as a join (don't forget to reset the index).

Polars feels better than pandas in every way (faster + multi-core, less memory, more intuitive API). The library is still relatively young which has its downsides but in my opinion, at minimum, it deserves to be considered on any new project.

Easily being able to leverage the Rust ecosystem is also awesome - I sped up some geospatial code 100X by writing my own plugin to parallelize a function.

just look at doing something as simple as a join (don't forget to reset the index)

It's slightly ironic that you mention this, because I always thought the biggest problem with Pandas was its documentation. Case in point: did you know there's a way to join data frames without using the index? It's called "merge" rather than "join".

Pandas was originally very heavily inspired by R terminology and usage patterns, where the term "merge" to mean "join" was already commonplace. If I didn't already know R when I started learning Pandas (~2015), I don't think I'd have been able to pick it up quickly at all.

For me, Pandas fits in neatly with Matplotlib in the niche category of "R-inspired Python libraries that are somewhat counter-intuitive due to said R-inspiration"

Matplotlib is MATLAB inspired, but otherwise your point stands.

Right, brainfart

I had to check the R documentation for merge in disbelief, because it didn't ring a bell. Between data.table's [ syntax and dplyr joins I can't remember the last time I've used merge!

I always thought the biggest problem with Pandas was its documentation. Case in point: did you know there's a way to join data frames without using the index? It's called "merge" rather than "join".

chatgpt (even the free tier) solved that problem for me. I ask it what I want in sql terms (or just plain english) and it tells me the pandas spell invocation. It even started to make sense after a few kLOC...

I am very curious to know how you feel about PRQL (prql-lang.org) ? IMHO it gives you the ergonomics and DX of Polars or Pandas with the power and universality of SQL because you can still execute your queries on any SQL compatible query execution engine of your choice, including Polars and Pandas but also DuckDB, ClickHouse, BigQuery, Redshift, Postgres, Trino/Presto, SQLite, ... to name just a few popular ones.

The join syntax and semantics is one of the trickiest parts and is under discussion again recently. It's actually one of the key parts of any data transformation platform and is foundational to Relational Algebra, being right there in the "Relational" part and also the R in PRQL. Most of the PRQL built-in primitive transforms are just simple list manipulations like map, filter or reduce but joins require care to preserve monadic composition (see for example the design of SelectMany in LINQ or flatmap in the List Monad). See this comment for some of my thoughts on this: https://github.com/PRQL/prql/issues/3782#issuecomment-181131... That issue is closed but I would love to hear any comments and you are welcome to open a new issue referencing that comment or simply tagging me (@snth).

Disclaimer: I'm a PRQL contributor.

Do you compile to substrait or to SQL strings?

SQL strings as the final output but there are two intermediate representations that can be serialized to JSON.

There's an open issue for Substrait but I don't think that anyone's started any work on that yet.

First time I’ve heard of it but seems very cool. My background is data science though so being able to use DS libraries or even apply a python function is why I find myself in Pandas / Polars. This seems very powerful for a data engineer.

I also think it’s awesome you guys have a duckdb integration - maybe I’ll try it out.

Is the sole appeal of polars (vs, say, pandas) its execution speed?

I've found being able to express ideas clearly in code (to aid comprehension now and in the future) to be much more important than shaving off a few seconds of run time.

For this reason I think speed alone is not a strong sell point, except specifically in cases where execution times really matter.

Analogous somewhat to how ruby/rails might be a 'slow' language/framework (e.g. 600ms when another framework might be 200ms) but multiples faster in facilitating the expression of complex ideas through code, which tends to be the far bigger problem in most software projects.

Complex operations on very large datasets can take multiple minutes in pandas. Polars is supposed to reduce that to a few seconds.

Since a lot of people use pandas to explore and experiment with datasets, having your workflow speed limited to a few operations an hour is hard to defend. That’s where the value proposition of polars and similar solutions lie IMO.

operations on very large datasets can take multiple minutes in pandas. Polars is supposed to reduce that to a few seconds.

Good example; I can see how efficiency would matter for workflows like that.

I work with dataframes in the 10's-100's millions of rows (mostly in tidyverse, but also pandas and base python and R), and find most data wrangling operations are close to instant on modern laptops. Plotting is another story (not sure if polars helps there).

So the case for efficiency is weak at the 10-100 million row dataframe size (unless doing some intense computations), but gains strength as the size of the dataframe grows.

Would be a fun aside to test all these frameworks side by side with some 1m/10m/100m/1bn row joins, filters, summary calcs, maps etc to get some concrete data on where efficiency starts to become noticeable and starts to matter. I think at sub 100m rows it probably doesn't. Not for the kinds of operations I do anyway.

I'd be interested to know what proportion of users of dataframes are working at different orders of magnitude.

Most of my life I've had databases of like 1000. Now I have a big one of about 500K! So for me, speed is almost a non-issue. But that is my specific field.

They're similarly expressive, and there are some areas where Polars (taking ideas from years of Pandas development) is ahead, for example with nullable dtypes.

That being said, there are many, many times when I would be willing to rewrite code to make it faster or more memory efficient. Just yesterday I rewrote a method from Pandas to Polars to take advantage of that.

Time might be cheap (while developing) but memory is expensive. Similarly, if you're writing data science jobs for production, you care about both.

Personally I prefer its API as well, but that seems to be a more controversial opinion than its often-huge performance wins.

The advantages are: raw performance, an optimizing query engine, streaming/out-of-memory processing support, backed by Arrow so you can load data with zero copy. Some people also prefer the API, which is very similar to that of Spark SQL and might feel more comfortable and consistent to people with professional programming backgrounds.

There are lots of cases where execution times really matter. We use polars for exactly that reason.

It strives for a more consistent Dataframe API too. It’s quite subjective but I prefer it

Meh. Another VC 'open-source' company.

Polars will remain MIT-licensed and the company will sponsor and accelerate the open-source development of Polars.

I bet that they will change to a custom business source available license sooner or later.

From the about page: [0]

We successfully closed a seed round of approximately 4M$, which was lead by Bain Capital Ventures.

Waiting for the day that a seperate closed-source version of polars gets built / forked privately and improvements do not make it back into the source.

[0] https://pola.rs/posts/company-announcement/

will remain MIT-licensed

Seems like fairly strong assurance to me

Doesn't that imply the opposite? They can take the code base and do anything they want with it.

Since they are the owners, they can do what they want with it no matter the license. The license only says what others who get a copy of the software can do with it.

Yes. And it doesn't really make sense to make the library closed source. The common business model for these kind of tools is to provide consultation, deployment platform, or a distributed version.

I'm so sick of this debate. I really don't get it. You all want less useful software to be made by less software developers being paid less for their work? What's the game plan here?

Yeah so this is AI's MovableType moment - maybe. Mojo - a will-be-open-sourced-soon-we-promise fast python alternative targeting AI - is in the same boat. Saying they'll be open source, releasing stuff, but VC backed with the ROI impetus.

Quick history lesson: MT owned blogging, got VC backed, changed the license because capitalism, and everyone fled to WordPress which powers over 40% of the Web today. I think looking at the way Matt has run WP and the level of sensitivity he's had to have towards OSS/GPL in order to maintain the ecosystem and community, is instructive when it comes to the viability of an OSS AI VC backed C corp.

My data science team evaluated Polars and came back with a mixed bag of results. If there was any performance-critical section, then we would consider employing it, but otherwise it was marginal negative given the overhead of replacing Pandas across dozens of projects.

I think that's the right call. IMHO now is the time to experiment with it, not to replace pandas where it's already working.

The API is still seeing some (expected) breaking changes, and could become a maintenance burden across multiple projects. But the API already feels more consistent, and overall seems to be going in the right direction.

You made me curious to look up the recent breaking releases https://github.com/pola-rs/polars/releases/tag/py-0.19.0 https://github.com/pola-rs/polars/releases/tag/py-0.20.0 And their policy about it: https://docs.pola.rs/development/versioning/#deprecation-war...

Looks like you ought to set aside some time to do updates each quarter, but I do wonder how much breakage can there be in practice, most changes seem pretty niche.

Going forward: https://github.com/fugue-project/fugue

Do you know if this project is collaborating with the people behind the "standard data frame API"? https://data-apis.org/dataframe-api/draft/index.html

Definitely don't rewrite all the code, I think it's worth adopting or evaluating in new code though. Especially with the very cheap pandas interop, you can zero-copy to pandas pretty easily if you use the arrow backend.

does your team use pandasql or mainly the direct pandas api? curious

Detailed Comparison Between Polars, DuckDB, Pandas, Modin, Ponder, Fugue, Daft - https://news.ycombinator.com/item?id=37087279 - Aug 2023 (1 comment)

Polars: Company Formation Announcement - https://news.ycombinator.com/item?id=36984611 - Aug 2023 (52 comments)

Replacing Pandas with Polars - https://news.ycombinator.com/item?id=34452526 - Jan 2023 (82 comments)

Fast DataFrames for Ruby - https://news.ycombinator.com/item?id=34423221 - Jan 2023 (25 comments)

Modern Polars: A comparison of the Polars and Pandas dataframe libraries - https://news.ycombinator.com/item?id=34275818 - Jan 2023 (62 comments)

Rust polars 0.26 is released - https://news.ycombinator.com/item?id=34092566 - Dec 2022 (1 comment)

Polars: Fast DataFrame library for Rust and Python - https://news.ycombinator.com/item?id=29584698 - Dec 2021 (124 comments)

Polars: Rust DataFrames Based on Apache Arrow - https://news.ycombinator.com/item?id=23768227 - July 2020 (1 comment)

so you took my original username

From 2014-03-29, https://news.ycombinator.com/item?id=7494093:

A couple of personal points that I may as well insert here. The account I'm now using, dang, was used briefly in 2010 by someone who didn't leave an email address. I feel bad for not being able to ask them if they still want it, so I'm considering this an indefinite loan. If you're the original owner, please email me and I'll give it back to you. The reason I want the name dang is that (a) it's a version of my real name and (b) it's something you say when you make a mistake.

You're probably too late.

You're probably too late.

From the 2014 thread, it was a (four years and) three months offer:

“I was thinking that because it's just software, we could restore the old account to its exact prior state. But the attention that accrues to a moderator account would make that impossible. So I guess I'll keep the offer open for three months, after which we can make some other arrangement.”

So I guess I'll keep the offer open for three months, after which we can make some other arrangement.

Like a decade too late it seems

not sure if this is a joke but hey you should check out this new slowpoke meme, it's kind of cool

https://imgflip.com/memegenerator/Slowpoke

Been watching polars for a while. Look promising, would love to test it against spark some day. So much of our infra is on spark though so we’re probably just locked in forever.

Spark feels overwrought, but polars does not scale out yet.

https://github.com/pola-rs/polars/issues/5621

Interesting, that is a huge nonstarter for us then. I always thought of polars as “pandas but distributed” but I see I have been incorrect in that assessment.

You might want to consider seeing how far vertical scaling can bring you.

https://motherduck.com/blog/big-data-is-dead/

Is a good read

That is a good read, and I see most of those points for probably even most midsize companies. I think in this case we are data 1%-ers. We generate and process terabytes of data every day, so the need for horizontal scaling is real.

Not super on topic because this is all immature and not integrated with one another yet, but there is a scaled-out rust data-frames-on-arrow implementation called ballista that could maybe? form the backend of a polars scale out approach: https://github.com/apache/arrow-ballista

There must be a corollary to Greenspun's Tenth Rule (https://en.wikipedia.org/wiki/Greenspun's_tenth_rule) that any sufficiently complicated data analysis library contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of SQL.

I use Pandas from time to time and I'll probably try this out, but I always find myself wishing I'd just started with shoving whatever data I'm working with into Postgres.

It's not like I'm some database expert either, I'm far more comfortable with Python, but the facilities for selecting, sorting, filtering, joining, etc tabular data are just way better in SQL.

Imo lazy dataframe syntax is a far superior frontend to a query engine. Polars also has SQL support, but really the frontend isn't generally where bugs come from but instead come from the query engine.

Postgres would be an order of magnitude slower than OLAP query engines for the types of queries that people do with them.

I recommend you look at DuckDB and the duckdb-prql extension.

DuckDB allows you to work on your Polars and Pandas (and any data on Arrow format) directly using SQL without needing any data copying or duplication.

The duckdb-prql allows you to use PRQL (prql-lang.org) which gives you all the power and universality of SQL with the ergonomics and DX of Polars or Pandas (IMHO).

Disclaimer: I'm a PRQL contributor.

You could do that, but it would likely both perform significantly worse (if you're doing "analytical" kinds of queries) and be a lot less flexible and expressive.

But you may want to look into DuckDB, which has a sql implementation that is not ad hoc, bug ridden, slow, or incomplete (though I honestly don't know about the formality of its specification). And it is compatible with polars :)

the creator of pandas created Ibis, which has a Postgres backend, for reasons like this: https://ibis-project.org/backends/postgresql

this is a better approach to Python dataframes and SQL

I'm about publish Effective Pandas 2 (waiting for pandas 2.2 release) and an getting reviewed for my next book, Effective Polars.

Happy to assist and answer any questions I can.

Before I ask my questions, here is an idea for your pandas book, if you haven't covered it already. The support of basic operations, like `round`, depends on the underlying data type (regular float and numpy's float16, float32, float64). Some np floats get rounded while others simply get ignored (they will not get rounded). It took me many hours to figure it out and fix resulting bugs. Maybe others would appreciate some information about this and similar gotchas.

Regarding polars, would you have time to answer these questions?

1) How are polars supported by popular data science packages, e.g., for plotting?

2) I know it is a bit silly: is there a way to get around typing `pl.col`, etc. all the time?

3) Besides `tidypolars`, are there any reasonable packages that add support for dplyr-style pipes or operation chaining?

I haven't run into any friction for (1), since worst case you just call `.to_pandas()` at the end of your pipeline before you start plotting. For any plotting apis that rely on the direct column vectors, no conversion to pandas is required.

For (2) I like to do:

1. `from polars import col` which at least shortens each use by a few characters 2. For columns that I use very frequently, define a "constant" for them e.g. ID=col("id").

This lets you do df.groupby(ID) instead of df.groupby(pl.col("ID")). Another advantage of defining these column "constants" is that it makes it much easier to refactor to rename all usages of the column (without needing to check whether each string "id" is being used as a column name vs. something else)

1 - Polars has support for plotting in the most recent release. Using Matplotlib w/ Polars data tends to work too. Otherwise, you can drop into Pandas to get support where it may be missing.

2 - If you just want the columns, there is no need to make an expression; just pass the string. There are shortcuts like pl.sum(col) instead of pl.col(col).sum(). I guess you could shortcut, c = pl.col, if you really hate typing it...

3 - Polars supports chaining and encourages it out of the box. Anything specific you are looking for?

Polars is cool, but man, I really have come to think that dataframes are disastrous for software. The mess of internal state and confusion of writing functions that take “df” and manipulate it - its all so hard to clean up once you’re deep in the mess.

Quivr (https://github.com/spenczar/quivr) is an alternative approach that has been working for me. Maybe types are good!

Polars is a lot better than pandas at maintaining valid state.

Because you ideally describe everything in terms of lazy operations, it actually internally keeps track of all your data types at every step until materialization and execution of the query plan. Because of that, you're not going to have the same kind of data type issues you might have in pandas. There are also libraries based off pydantic built for polars dataframes validation (see patito, though it's not mature yet).

Totally agree with the critique, though it bears mentioning that one way that polars differentiates itself from pandas is the expressions API. See [1] for an example.

1: https://kevinheavey.github.io/modern-polars/method_chaining....

Yes, the dataframe is very tricky to get right function signature wise. I used to write a lot of pandas heavy software and converged on writing most functions to take a dataframe or series and when possible just return a series, then put that series back into the df at the function call site. Handing dfs off to functions that mutate them gets gnarly very quickly.

No idea of what this is about, but that pie chart is a crime against data.

It resembles a pie chart, but if you think of it instead as like a stop watch, then it makes perfect sense.

It would make more sense if each time was an arc instead of a segment. And what does one revolution represent?

Amount of time to eat a pie while waiting for your numerical library of choice.

Conclusion: Polars gives you indigestion.

One comment and a question: firstly, this site is very nice and works well on mobile. Secondly, I wasn't able to find an explanation of how using Polars in Python works. Is it using pyo3 or is this something totally different? Re the performance highlighted in that pie chart, does that hold when using Polars from Python?

It does for the vast majority of things. If anything, due to very aggressive compiler settings, polars in python can be faster.

It does use pyo3. Though basically anytime you use a python lambda (the equivalent of pandas apply) on a row basis, it will be limited by python speed/the GIL.

Thanks!

I don’t use Polars directly, but instead I use it as a materialization format in my DuckDB workflows.

Duckdb.query(sql).pl() is much faster than duckdb.query(sql).df(). It’s zero copy to Polars and happens instantaneously while Pandas takes quite a while if the DataFrame is big. And you can manipulate it like a Pandas DataFrame (albeit with slightly different syntax).

It’s greater for working with big datasets.

Very cool!

This is a really nice insight that I wasn't aware of. Many thanks.

I recently reached the limits of Pandas running on my 2020 16gb M1. Counting the number of times an element appears in a 1.7B row DataFrame using `df.groupby().size()` would consistently exceed available memory.

Rust Polars is able to handle this using Lazy DataFrames / Streaming without issue.

FWIW I think df.column.value_counts() is better to use here in pandas.

It unfortunately also exceeded available memory.

A basic approach which worked was sequentially loading each df from the filesystem, iterating through record hashes, and incrementing a counter; however the runtime was an order of magnitude greater than my final implementation in Polars.

There is a new open source columnar lakehouse gaining traction with low cost data processing and storage, with Polars as part of it:

1. Querying and Data processing in Polars or DuckDB

2. Metadata (for transactions, time-travel) table formats (Iceberg, Hudi, Delta)

3. Storage with Parquet on S3.

With Polars/DuckDB, you can process up to 100+ GB of data at a time on a single VM, turn it off when you're not using it. It's a lower overhead, lower time-to-value stack than Spark. And even if you have TBs of data, so long as you only process ~100GB at a time, it works fine.

Why the 100GB hard limit? If you can stream from disk you should be able to process infinite data sets if the operations allow it. If you need to do stuff like deduplication, then it will depend on how much RAM you have available.

Know of any papers / articles on this stack?

and here is a Polars library if you use R: https://github.com/pola-rs/r-polars

If you prefer dplyr syntax you can use https://tidypolars.etiennebacher.com/

The R syntax is not kind on the eyes. I know that's a shallow dismissal, but I think it would really start to irritate me if I had to read that all day.

What is this thing? Does it aspire to be a Spark replacement?

Why are there so many low-brow dismissal comments here? Read the docs if you don't know what it is. If you don't know what they mean by "data frames", either move on because this just isn't for you, or do the tiniest possible amount of research into what that means. It's a very common concept that I'm sure every single person who frequents this site is perfectly capable of understanding on their own.

pandas replacement; it runs on a single machine.

I'm really excited about Polars and it's speed performance is super impressive buuutt. . . It annoys me to see vaex, modin and dask all compared on the same benchmarks.

For anyone who doesn't use those libraries, they are all targeted towards out-of-core data processing (i.e. computing across multiple machines because your data is too big). Comparing them to a single core data frame library is just silly, and they will obviously be slower because they necessarily come with a lot of overhead. It just wouldn't make sense to use polars in the same context as those libraries, so seeing them presented in benchmarks as if they are equivalents is a little silly.

And on top of that, duckdb, which you might use in the same context as polars and is faster than polars in a lot of contexts, isn't included in the benchmarks.

The software engineering behind polars is amazing work and there's no need to have misleading benchmarks like this.

I don't know about the others but you can use Dask on a single machine, and it's also the easiest way to use Dask. It allows parallelizing operations by splitting dataframes into partitions that get processed in individual cores on your machine. Performance boost over pandas can be 2x with zero config, and I've seen up to 5x on certain operations.

Ibis, a Python dataframe created by the creator of pandas, uses DuckDB as the default backend and generally beats Polars on these benchmarks (with exceptions on some queries)

The Explorer library [0] in Elixir uses Polars underneath it.

[0] https://github.com/elixir-explorer/explorer

Wow cool! I had no idea it had made its way into so many libraries across so many languages already! Very impressive for such a new project!

DuckDB maintains a performance benchmark of open source database-like tools, including Polars and Pandas

https://duckdblabs.github.io/db-benchmark/

I'd like to see Kdb in here.

Any recommendations on which binary format to use when I want to store all the DataFrames to disk in order to load them at a later point? My data as an indented JSON file takes up around 800 MB.

Parquet works well as it natively support Apache Arrow, the underlying data structure.

When we shipped Jupyter support in Deno, `nodejs-polars` was one of the cornerstone library for data science we supported.

https://blog.jupyter.org/bringing-modern-javascript-to-the-j...

I'm not personally a Data Science guy, but considering how early the JS/Jupyter ecosystem is, it was surprisingly quick to get pola.rs-based analysis up and running in TypeScript.

The JS bindings certainly need a bit of love, but hopefully now that it's more accessible we'll see some iteration on it.

TIL. always wanted pandas in JS!

I am using polars (the rust library) in production. I am mostly satisfied so far.

Excellent work!

Same! And it's been an absolute delight!

Building high performant, multi-language and multi-platform libraries is probably the best use case for Rust because of the memory safety advantages it has over C and C++. Polars is great example of this. It currently being used in Rust, Python, Javascript, Elixir, R and Ruby.

The amount of memory leaks I got while diving into Polars tell a different story.

Thoughts on using Polars in production environments? It's not even 1.0 yet. Where will it be 3 years from now?

Note: I do like Polars but have not used it in work setting.

1.0 should be soon. https://github.com/pola-rs/polars/issues/6616

How well does it work with sklearn/scipy/pytorch?

sklearn is adding support through the dataframe interchange protocol (https://github.com/scikit-learn/scikit-learn/issues/25896). scipy, as far as I know, doesn't explicitly support dataframes (it just happens to work when you wrap a Series in `np.array` or `np.asarray`). I don't know about PyTorch but in general you can convert to numpy.

Newbie question:

When would you use dataframes (e.g Pandas, Polars...) and when would you use tensors (Pytorch, TF...)?

Are the usecases completely distinct or is there overlap?

fairly distinct, there is of course some overlap. you could technically do (mostly?) everything with tensors that you could with dataframes, but generally dataframes are for analyzing and transforming data for ETL/analytic workloads. tensors are how machine learning models understand data, i.e. before training a neural network (or LLM) at some point text is converted to numbers in tensors

you still transform data in tensors, but generally that's one-hot encoding or transposing or other transformation done right before model training. before that, you might use a dataframe to cleanup strings, aggregate timeseries data, etc.

hope that makes sense. so yes there's some overlap, but generally they're distinct toolsets that would be used together for an end-to-end ML project

Hey, what do you peeps use this for? Instead of json.loads() ? :-D

More like in memory transformations you would do in pandas

Like with many such projects, it's very helpful if you use DataFrames in isolation, but it lacks support from the wider scientific ecosystem. I've found that using polar will often break other common data scientific packages such as scikit-learn. This unfortunately often makes it impractical in the wild.

Just convert it to pandas/numpy at the edges?

Have been using pandas for years and statements about polars definitely seem appealing. Especially around performance, where apply function in pandas (to iterate over rows and derive a new column) can easily be 100x slower than vanilla python. Similar with some pandas APIs culprits like making sure to reset the index after joins and other transformations

If I didn't just take courses at DataCamp for Python and data science I wouldn't have known this is a replacement for the pandas library.

Always happy to see new stuff on the block, but hard to leave pandas and python ecosystem for this

Not sure where this fits in to any workflow tbh, with sufficiently large datasets, you will inevitably need spark (which has same API as pandas)

Are there resources for converting a Pandas extension array (ie custom data type) to Polars? I have a column type that's a searchable full-text index, I'd like to have it support both if possible.

Why is everyone's marketing team underlining a random word in their slogans these days? Is the most important takeaway not to be forgotten from this product the word "era"?

Apart from being in memory, any other advantages of data frames over a Postgres table with indexes?

I rarely use Pandas (just haven't come across it much in my work) but I almost want to come up with a side project just so I can fiddle with this. It's making some big promises.

By the way, that pie chart on the home page is a crime.

I have been using Polars on and off for the past few years,

but now I've been using it 99% exclusively for the past 2–3 months. I would say it's ready for prime time! I don't get any @TODO seg faults anymore lol

I don't know if I notice the speed difference over Pandas most of the time, but I do find the way of expressing transformations way more intuitive than Pandas, it's more similar to SQL.

For those familiar with what Pandas and Polars is, I wanted to draw attention to this comment in this thread. Wow!

https://news.ycombinator.com/item?id=38920417

I am curious how does this compare to nvidia's rapids.ai / cudf.

I have ported a few internal libraries to polars from pandas and had great results.

I never liked pandas much due to its annoying API, indices and single thread only implementation (we usually get a 10x performance boost at least and for me that also means improvements in productivity). Also, pandas never handled NULLs properly. This should now work with the pyarrow backend in pandas, but we can’t read our parquet files generated by PySpark. With polars it mostly just works, but we use pyarrow to read/write parquet.

Overall I can recommend it, conversion from/to pandas DataFrames was never an issue as well.

I used this at work about a year ago to build a statistics platform and it was damn good, chewing through gigabytes of data quickly and with no hassle. Having never worked with such a library before I found myself asking my colleague who'd worked with Pandas a lot of questions which he was able to answer easily due to the overlap.

My only critique of the Rust crate is that it's not as well documented as the Python API and the Rust API required a lot more unwrapping and error handling than other Rust crates which was quite tedious.

Crazy coincidence, on a whim I learned a bit of polars this morning to prep for a data science interview. All I really wanted was "SQL but cleanly FFI mixable with python," and the API flowed nicely. Every time I said a certain function or syntax should exist, the dir() or help() info confirmed it existed. It was fantastic. Pandas was a nightmare trying to remember how to slice and dice everything, and this was a breath of fresh air.

I like the api for this much better than pandas but ymmv

What a domain!

Romanians chuckling at the domain name

paralellism

should be parallelism

We've been using polars in production for over a year as a replacement to pandas. It's been a good experience: smaller memory footprint, way faster and just more pleasant in general to code. Package is being developed quickly so things get deprecated quickly but I'm not one to complain.

Last year, we began testing Polars with TypeScript because I'd heard great things about it for Python in terms of performance and usability. Unfortunately, I ran into bugs that first morning that stopped us in our tracks. I'll definitely give it another shot once they've ironed out the kinks as it looks very promising, but it wasn't ready for prime time for us.

Is there a good book/overview on how dataframes and OLAP query engines work under the hood?