It's cristal clear that this page has been written for people who already know what they are looking at; the first line of the first paragraph, far from describing the tool, is about some qualities of it: "Polars is written from the ground up with performance in mind"
And the rest follows the same line.
Anyone could ELI5 what this is and for what needs it is a good solution to use?
EDIT: So an alternative implementation of Pandas DataFrame. Google gave me [0] which explains:
The pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields.
DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc. In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they’re an integral part of the Python and NumPy ecosystems.
It’s pandas, but fast. Pandas is the original open source data frame library. Pandas is robust and widely used, but sprawling and apparently slower than this newcomer. The word “data frames” keys in people who have worked with them before.
Actually pandas is not the original open source data frame library, perhaps only in Python. There is a very rich tradition in R on data.frames, which includes the unjustly neglected data.table.
Yep! Unless I'm mistaken, R (and its predecessor S) seems to have been the first to introduce the concept of a dataframe.
One could also argue that dataframes are basically in-memory database tables. And in that case, S and SQL probably tie in terms of the creation timeline.
The difference is dataframes can also be seen like matrices. You can do row operations, row + column operations, multiply rows and columns, multiply different matrices, transpose them etc. These kind of things don't really make sense in DB tables (and they are generally not supported and you jump through hoops to do similar things in DBs).
Yes, that's totally fair; dataframes are more flexible in that sense.
Oh, and another important difference is memory layout. The dataframe implementations mostly (or all) use column-major format. Whereas most conventional SQL implementations use row-major format, I believe.
I think most OLTP databases are row oriented whilst most OLAP are column.
I think this is overblowing the similarities to matrices. Matrices have elements all of the same type, while data.frames mix numbers, characters, factors, etc. You certainly cannot transpose a data.frame and still have a data.frame that makes sense. Multiplying rows would not make sense either, since within one row you will have different types of data. Unless you have a data.frame that is all numeric, but in that case one should probably be using a matrix in the first place.
They still have their advantages with row/column labels, NaN handling etc. These are not operations I am speculating about by the way. I am most familiar with pandas and the dataframe there has transpose, dot product operations and almost all column operations have their correspondence in rows (i.e. you either sum(axis=0) or sum(axis=1)).
Oh, based on the comment you replied to I thought this was about R. In R matrices can handle NaNs and NAs, have column and row labels, have dot products and much more.
I feel like the predecessor of R should be Q!
The way that I've heard the story, S was short for "statistics", and R was chosen because the authors were _R_obert [Gentleman] and _R_oss [Ihaka].
Statisticians are funny!
Yeah. I think Wes McKinney liked the data frames in R, but preferred the programming language of Python. I've heard somewhere that he also got a lot of inspiration from APL.
R is literally designed to do statistics and has first class support and language feature support for many specialized tasks in statistics and closely related fields.
Python is literally designed to be easy to program with in general.
Well, it turns out when you’re dealing with terabytes of data and TFLOPS, the programming becomes more important than the math. Not all R devs are happy about this and they are very loud about it.
But it shouldn’t really surprise anyone. That is literally how those languages are designed.
Most of the R devs I know like this are just butthurt they are paid less and refuse to switch because they’re obstinate, or they’re a little scared they’re being left behind. first group is all over the place, but the second group tends to skew older of course
R is heavily influenced by Scheme. Not only is it heavily functional, but it has metaprogramming capabilities allowing a high level of flexibility and expressiveness. The tidyverse libraries use this heavily to produce very nice composable APIs that aren't really practically possible in Python.
R is fine. The issue is more in the ecosystem (with the aforementioned exception of the tidyverse).
Look, I started with R and use mostly Python these days, but this is not really a fair take.
R is (still) much, much, much better for analytics and graphing (the only decent plotting library in python is a ggplot clone). The big change (and why Python ended up winning) is that integrating R with other tools (like web stuff, for example) is harder than just using Python.
pandas (for instance) is like an unholy clone of the worst features from both R and Python. Polars is pretty rocking, though (mostly because it clones from Spark/dplyr/linc).
It's another example of Python being the second best language for everything winning out in the marketplace.
That being said, if I was starting a data focused company and needed to pick a language, I'd almost certainly build all the DS focused stuff in R as it would be many many times quicker, as long as I didn't need to hire too many people.
So so true.
I was working on an adhoc project that needed a quick result by the end of the day. I had to pull this series of parquet files and do some quick and dirty analysis. My first reflex was to use python with pandas, quick and easy. Python could not handle the datasets, too large. I decided to give R and data.table a go and it went smoothly. I am usually a python user but from time to time I feel compelled to jump back to R and data.table. Phenomenal tool.
My friend. You cannot make people like R. We all know about and study data.table, so it’s not neglected, we just don’t use that implementation.
Mainly because R sucks for anything that isn’t statistics.
Pandas has also moved to Apache Arrow as a backend [1], so it’s likely performance will be similar when comparing recent versions. But it’s great to have some friendly competition.
[1] https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...
Polars has an OLAP query engine so without any significant pandas overhaul, I highly doubt it will come close to polars in performance for many general case workloads.
This is a great chance to ELI5: what is an OLAP query engine and why does it make polars fast?
Polars can use lazy processing, where it collects all of the operations together and creates a graph of what needs to happen, while pandas executes everything upon calling of the code.
Spark tended to do this and it makes complete sense for distributed setups, but apparently is still faster locally.
Laziness in this context has huge advantages in reducing memory allocation. Many operations can be fused together, so there's less of a need to allocate huge intermediate data structures at every step.
yeah, totally, I can see that. I think that polars is the first library to do this locally, which is surprising if it has so many advantages.
It's been around in R-land for a while with dplyr and its variety of backends (including Arrow, the same as Polars). Pandas is just an incredibly mediocre library in nearly all respects.
Not according to DuckDB benchmarks. Not even close.
https://duckdblabs.github.io/db-benchmark/
Ouch! It is going to take a lot of work to get Polars this fast. If ever.
Not with eager API.
Memory and CPU usage is still really high though.
Ah, like polar bears are a much more aggressive implementation of the idea behind panda bears? That’s a pretty funny name if so.
Yeah. The name always makes me chuckle
I don't know, I think the name is kind of polar-izing
/pun
Oh, I'm not sure. I'd say it's bear-ly polarizing.
I'm so sorry.
Depends on your frame of mind.
This thread is turning into pandamonium
I'm worried it's going to get grizzly.
Thanks folks, you all made my day. "Frame of mind" was my favorite. I'm surprised I didn't think of some of these, I must be getting... Rust-y
Next re-implementation will be called grizzl.ys, hand-written in Y86 assembly.
...ehh, not quite. R and its predecessor S have Pandas beat by decades. Pandas wasn't even the first data frame library for Python. But it sure is popular now.
That's interesting! I didn't realize there had been prior dataframe libraries in Python!
Out of curiosity, what was/were the previous libraries?
it is built in data structure and function in R.
Oh, yes, I was aware that R (and its predecessor S) have a native dataframe object in the language.
It seemed that gmfawcett was indicating that there was a dataframe library in _Python_ that existed prior to Pandas. I was curious what that library was/is, as I'd not heard that before.
ok, guess I misunderstood both comments of you two. ´_>`
Sorry :) Pandas is undisputed king. But there were multiple bindings from Python into R available in the early 2000's. Some like rpy and rpy2 are still around, others are long defunct. I concede that these weren't standalone dataframe libraries, but rather dataframe features built into a language binding.
Not *original* but probably most commonly used.
yup I first met data frames in R and pandas is the Python answer to R isn't it
If I understand correctly, Pandas original scope was indexed in-memory data frames for use in high frequency trading, making use of the numpy library under the hood. At the time it was written you had JPMC's Athena, GS's platform, and several HFT internal systems (C++ my friends in that space have mentioned). Pandas just is so darn useful! I've been using it since maybe version 0.10, even got to contribute a tiny bit for the sas7bdat handling.
indeed it's both: it was created for financial analytics, and it provides R dataframe features to python. thanks for.making me detour into the history of it.
Yeah, I believe Pandas was inspired by similar functionality in R.
Yes, it's annoying negative feature of many tech products. Of course it's natural to want to speak to your target audience (in this case, data scientists who like Pandas but find it annoyingly slow/inflexible), but it's quite alienating to newbies who might otherwise become your most enthusiastic customers.
I am the target audience for Polars and have been meaning to try it for several months, but I keep procrastinating about because I feel residual loyalty to Pandas because Wes McKinney (its creator) took the time to write a helpful book about the most common analytical tools: https://wesmckinney.com/book/
Ritchie Vink (the creator of Polars) deliberately decided not to write a book so that he (and his team) can focus full time on Polars itself.
Thijs Nieuwdorp and I are currently working on the O'Reilly book "Python Polars: The Definitive Guide" [1]. It'll be a while before it gets released, but the Early Release version on the O'Reilly platform gets updated regularly. We also post draft chapters on the Polars Discord server [2].
The Discord server is also a great place to ask questions and interact with the Polars team.
[1] More information about the book: https://jeroenjanssens.com/pp/
[2] Polars Discord server: https://discord.gg/fngBqDry
Slightly offtopic: it's a tragedy that projects like this use discord as the primary discussion forum. It's like slack in that knowledge goes to die there.
I often see this comment, and every time I think; but having people come to the information AND the community is better for the project.
Short term perhaps, but long term having a non-indexed community is inconvenient for newcomers.
there are projects that you can use to index discord servers, unfortunately a lot of communities just don't use them.
That's why Ritchie is very active on, and often refers to, Stackoverflow as well! Exactly to document frequent questions, instead of losing them to chat history.
microsoft copilot can summarize discussions. with some orchestration it could extract even from past discussions question+answers and structure them in a stackoverflow-like format.
source: we use this feature in beta as part of the enterprise copilot license to summarize Teams calls. Yes, it listens to us talking and spits out bullet points from our discussions at the end of the call. It's so good it feels like magic sometimes.
note on copilot: any capable model could probably do it. I just said copilot because it does it today.
by community do you mean all the people who make an account just to ask a question on the project's discord, only ever open it to check if someone answered and then never use discord again?
Luckily our book will also be available in hard copy so you can digest all that hard-won knowledge in an offline manner :)
I'll wait until chatGPT can regurgitate it.
losing all nuance by virtue of getting dopamine quicker? count me in!
I do understand your "snarky" comment humoring, however do buy a copy if you want to support them, it's neither cheap or easy to make a book.
Yeah one thing that helps a bit is that they try to encourage that you post your questions to stack overflow and they'll answer it.
There is a free Polars user guide [0] as a part of Polars project. It was known as "polars-book" before it has been was in-tree [1].
[0] https://docs.pola.rs/user-guide/
[1] https://github.com/pola-rs/polars/tree/main/docs/user-guide
Is there a book that is even more basic for more junior people in regards to dataframe / storage solutions for ML applications to recommend? Thank you
Any plans to try fine tune an LLM specialised in polars? That would really be the killer feature to get major adoption IMO.
Newbies are your best target audience too! They aren't already ingrained in a system and have to learn a new framework. They are starting from yours. If a newbie can't get through your docs, you need to improve your docs. But it's strange to me how mature Polars is and that the docs are still this bad. It makes it feel like it isn't in active/continued development. Polars is certainly a great piece of software, but that doesn't mean much if you can't get people to use it. And the better your docs, the quicker you turn noobs into wizards. The quicker you do that, the quicker you offload support onto your newfound wizards.
Interesting, I've personally found them quite good and compared to datafusion or duckdb they're dramatically better. I agree pandas has better docs, but one of the strengths of polars is that I find I often don't need the docs due to putting lots of careful thought into designing a minimal and elegant API, not to mention they're actually care about subtle quirks like making autocomplete, type hinting, etc. work well.
Sounds like we might be coming from different perspectives. I honestly don't use any DF libraries often, and really only Pandas. I used to use pandas a fair amount, but that was years ago, and now I only have to reach for it a few times a year. So maybe the docs are good for people that already have deeper experience. Because I think just the fact that you have used datafusion and duckdb illustrates that you're more skilled in this domain than I am, because I haven't used those haha.
But I do think making good docs is quite hard. You usually have multiple audiences that you might not even be aware of. Which makes one of the most important things to do is keep an open ear to listen for them. It's easy to get trapped thinking you got your audience but you're actually closing the door to many more groups (unintentionally). It's also just easy to be focused on the "real" work and not think about docs.
What, specifically, is bad about the docs? This whole thread is people who just looked at the home page, saw that it is "DataFrames", but didn't know what that means and came here to complain. Nobody has said anything about issues with the docs for someone who understands what a data frame is (or spent like two minutes looking that up) but is struggling to figure out how to use this library specifically.
I'm a dataframes noob. I saw this post and the performance claims attracted me. I went to chatGPT to understand what dataframes were about. Then on udemy, I searched for a polar course. A course required pre-requisites : a bit about jupyter notebooks and pandas. Then I went through a few modules of a pandas course. Now, I'm going through a polars course. Altogether, I spent about 2-3 hours to setup the environment and know what this is all about.
A little bit context would have helped to have attracted a lot more noobs.g
Your first paragraph makes perfect sense! I was nodding along. But then your concluding sentence was a bit of a record scratch for me. This all worked as intended! You knew what the project was about - "data frames" - and what might make it attractive to you - the performance claims - and then you went and followed exactly the right path to get the context you needed to understand what's going on with it. It's a big topic that you were able to spin up on to a basic level in 2-3 hours, by pulling on strings starting at this landing page. This is a very successful outcome.
I'd also recommend this book: https://wesmckinney.com/book/. It's not about polars, but you'd be able to transfer its ideas to polars easily once you read it.
"How To Be A Pandas Expert"[1] is a good primer on dataframes. There's a certain mental model you need to use dataframes effectively but it's not apparent from reading the official docs. The video makes it explicit: dataframes are about like-indexed one-dimensional data, and every dataframe operation can be understood in terms of what it does to the index.
[1] https://www.youtube.com/watch?v=oazUQPrs8nw
I can't speak for the Python side of the Polars docs but coming from Python and Pandas to Rust and Polars hasn't always been easy. To be fair, that isn't just about docs but also finding articles or Stack Overflow answers for people doing similar things.
That certainly makes sense!
I think your experience is probably making it difficult to understand the noob side of things. For me, I've struggled with simply slicing up a dataframe. And as I specified, these aren't tools I use a lot, so the "who understands what a data frame is" probably doesn't apply to me very well and we certainly don't need the pejorative nature suggesting that it is trivially understood or something I should know through divine intervention. I'm sure it's not difficult, but it can take time for things to click.
Hell, I can do pretty complex integrals and derivatives and now so much of that seems trivial to me now but I did struggle when learning it. Don't shame people for not already knowing things when they are explicitly trying to learn things. Shame the people that think they know and refuse to learn. There's no reason to not be nice.
Having done a lot of teaching I have a note, don't expect noobs to be able to articulate their problems well. They're noobs. They have the capacity to complain but it takes expertise to have clarify that complaint, turning it into a critique. I get that this is frustrating, but being nice turns noobs into experts and often friends too.
I really think this is a misunderstanding of the purpose of different kinds of documentation. The documentation of a new tool for a mature technique is just not the primary place to focus on writing a beginners' tutorial / course on using that technique. Certainly, "the more the merrier" is a good mantra for documentation, so if they do add such material, all the better. But it is very sensible for it to not be the focus. The focus should be, "how can you use this specific iteration of a tool for this technique to do the things you already know how to do".
Nobody is suggesting that you should be an expert on data frames "through divine intervention". But the place to expect to learn about those things is the many articles, tutorials, courses, and books on the subject, not the website of one specific new tool in the space.
If you're really interested in learning about this, a fairly canonical place to start would be "Python for Data Analysis"[0] by Wes McKinney, the creator of pandas and one of the creators of the arrow in-memory columnar data format that most of these projects build atop now.
This is a (multiple-) book length topic, not a project landing page length topic.
0: https://wesmckinney.com/book/
The Rust docs are for some reason much worse than the Python docs, or at least that used to be the case
The docs are okay, but the feature set is lacking compared to pandas, which is understandable since this is at version 0.2. I was exploring if it's possible to use this, but we need diff aggregation which it doesn't have, so it's a no go right now.
Do you mean something like `.agg(pl.col("foo").diff())`?
Or is diff aggregation its own thing? (I tried searching for the term, but didn't find much.)
Nevermind, it has it but it's under Computation in polars.Series.diff and I was looking under Aggregation. This is great.
For instance you've got a time series with an odometer value and you want the a delta from the previous sample to compute the every trip.
"Newbies" to data science are indeed a good target audience, before they are already attached to pandas. But this doesn't imply they know nothing. It's very unlikely that someone both 1. has a need to do the kind of data analysis that polars is good at, and 2. has never heard of the "data frame" concept.
It’s annoying only because it’s on hacker news, because what are the odds of getting on it if you don’t know what is it and don’t have a need for it?
I mean, pretty high. What if your boss just tells you to learn polars, and you don’t know why? Saying what something is, is just good communication, and can help clarify for people who are confused.
Guess in the remote event that you're told to learn a new skill that you don't know anything about, you go to pola.rs website and see "DataFrames for the new era" and start getting documentation from there, about what DataFrame is, the website is clearly showing what is it, it's your duty to understand what is it, I would argue that if you knew what DataFrames are you would be saying "Why is it saying something so basic and don't just show me the good stuff?"
I for example hate website that try to serve newbies, newbies have a lot of content if they are interested, it's not that all the web needs to serve them
These workplaces where bosses tell employees to learn unheard-of tools with zero context sound terrible.
Shouldn't the good communication happen when the boss tells you to learn polars? Like, why are you telling me this, boss; what is it that you need done?
It's annoying because a single leading sentence would be enough to explain a product. Some of the words (for example "Data Frame") in that sentence can be links to other pages if that's necessary. It's a small change but it makes a huge difference.
Wes has also worked hard to improve a lot of the missteps of pandas, such as through pyarrow, which may prove even more impactful than pandas has been to date.
Polars is also a wonderful project!
Polars is also based on McKinney’s Arrow project.
Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.
https://github.com/pola-rs/polars/blob/main/README.md
Wes also literally created another Python dataframe project, Ibis, to overcome many of the issues with pandas
https://ibis-project.org
most data engines and dataframe tools these days use Apache Arrow, it's a bit orthogonal
I'm a data engineering newbie and I found it very clear, and it gave me an enthusiastic feeling (not an "alienating" feeling).
This whole thread just comes across as unmitigated pedantry to me.
Presumably you were introduced to the concept of DataFrames and how they're used through some other source, because Polars landing page doesn't even bother to mention it's used for data analysis and documentation simply assumes you're already familiar with the core concepts.
Compare that to Pandas which starts with the basics, "pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language." It then leads you to "Getting started" guide which features "Intro to pandas" that explains the core concepts.
Sadly its not only tech products, but also things like security disclosures too.
It always follows the same pattern:
Above that it says “DataFrames for a new era” hidden in their graphics. I believe it’s a competitor to the Python library “Pandas”, which makes it easy to do complex transformations on tabular data in Python.
It seems like it's a disease endemic to data products. Everybody, the big cloud providers and the small data products, build something whose selling point is "I'm the same as Apache X but better." But if you don't know what Apache X is, you have to go read up on that, and its website might say "I'm the same as Whatever Else but better," and you have to go read up on that. I don't want to figure out what a product does by walking a "like X but better" chain and applying diffs in my head. Just tell me what it does!
I get that these are general purpose tools with a lot of use cases, but some real quick examples of "this is a good use case" and "this is a bad use case, maybe prefer SQL/nosql/quasisql/hadoop/a CSV file and sed" would be really helpful, please.
I run into the same problem. I don't know what Pandas are (besides the bears) and at some point up the "it's like X" chain, I guess you have to stop and admit you're just not the target user of this tech product.
On the other hand, how can you become a target user if you don't know that a product category exists?
This project is a solution to a particular kind of problem. The way you become a target user of that solution is by first having the problem it's a solution to.
If you have the problem "I want to analyze a bunch of tabular data", you'll start researching and asking around about it, and you'll quickly discover a few things: 1. people do this with (usually columnar / "OLAP") sql query interfaces, 2. people usually end up augmenting that with some in memory analyses in a general purpose programming environment, 3. people often choose R or python for this, 4. both of those languages lean heavily on a concept they both call "data frames", 5. in python, this is most commonly done using the pandas library, which is pervasive in the python data science / engineering world.
Once you've gotten to that point, you'll be primed for new solutions to the new problems you now have, one of which is that pandas is old and pretty creaky and does some things in awkward and suboptimal ways that can be greatly improved upon with new iterations of the concept, like polars.
But if you don't have these problems, then the solution won't make much sense.
That's on you. If you want to become a data engineer and data scientist -- the two software positions most likely to use polars -- get learning. Or don't: learn it when you need it.
I dunno, I get the criticism, but also, every field assumes a large amount of "lingua franca" in order to avoid documenting foundational things over and over again.
Programming language documentation doesn't all start with "programming languages are used to direct computers to do things"; it is assumed the target audience knows that. Database documentation similarly doesn't start out with discussing what it means to store and access data and why you'd want to do that.
It's always hard to know where to draw this line, and the early iterations of a new idea really do need to put more time into describing what they are from first principles.
I remember this from the early days of "NoSQL" databases. They spilled lots of ink on what they even were trying to do and why.
But in my view this isn't one of those times. I think "DataFrames" are well within a "lingua franca" that is reasonable to expect the audience of this kind of tool to understand. This is not an early iteration of a concept that is not widely familiar, it is an iteration of an old, mature, and foundational concept with essentially universal penetration in the field where it is relevant.
Having said all that, I came across this "what is mysql" documentation[0] which does explain what a relational database is for. It's not the main entry point to the docs, but yeah, sure, it's useful to put that somewhere!
0: https://dev.mysql.com/doc/refman/8.0/en/what-is-mysql.html
See also: Is it pokemon or big data https://pixelastic.github.io/pokemonorbigdata/
If you don't know what the comparison product is either then you are not the target customer. This is a library for analyzing and transforming (mostly numerical) data in memory. Data scientists use it.
Dataframes in Python are a wrapper around 2D numpy arrays, that have labels and various accessors. Operations on them are OOM slower than using the underlying arrays.
I don't know where this myth originated from but I have seen this in multiple places. Even if you think about it 2d numpy arrays can't have different type for different columns.
"myth originated"
It's in the documentation.
I just learned pandas recently, and I would have said this same thing. Not because I read through the numpy code, but because I read the documentation.
Is it wrong? Can't a user pick up a new tool and trust some documentation without reading through 100's of libraries built on libraries.
When was last time someone traced out every dependency, so they can confidently say something is "Myth".
Where is it written in pandas documentation? Pandas dataframe is stored in list of 1d numpy arrays, not a single 2d array.
The columns with common dtypes are grouped together in something called "blocks" and inside those blocks are 2D numpy arrays. It is probably not in the documentation because it is seen as implementation detail but you can see the block manager's structure in this article (https://dkharazi.github.io/blog/blockmanager/) or in this talk (https://thomasjpfan.github.io/scipy-2020-lightning-talk-pand...).
Actually numpy has something called a structured array that is pretty much what what you described.
Well, if you use structured arrays or record arrays, you can do this (more or less).
https://numpy.org/doc/stable/user/basics.rec.html
There's a very good point here but I don't think its made clear.
If your data fits into numpy arrays or structured arrays (mainly if it is in numeric types), numpy is designed for this and will likely be much faster than pandas/polars (though I've also heard pandas can be faster on very large tables).
Pandas and Polars are designed for ease of use on heterogeneous data. They also include a python 'Object' data type which numpy very much does not. They are also designed more like database (e.g. 'join' operations). This allows you to work directly with imported data that numpy won't accept - after which Pandas uses numpy for underlying operations.
So I think the point is if you are running into speed issues in Pandas/Polars, you may find that the time-critical operations could be things that are more efficiently done in numpy (and this would be a much bigger gain than moving from Pandas to Polars)
I try to use polars each time I have to do some analysis where dataframes helps. So basically any time I'd reach for pandas, which isn't too often. So each time it's fairly "new". This makes me have a hard time believing everyone that is saying "Pandas but faster" has used Polars, because I can often write Pandas from memory.
There's enough subtle and breaking changes that it is a bit frustrating. I really think Polars would be much more popular if the learning curve wasn't so high. It wouldn't be so high if there were just good docs. I'm also confused why there's a split between "User Guide" and "Docs".
To all devs:
Your docs are incredibly important! They are not an afterthought. And dear god, don't treat them as an afterthought and then tell people opening issues to RTFM. It's totally okay to point people in the right direction without hostility. It even takes less energy! It's okay to have a bad day and apologize later too, you'll even get more respect! Your docs are just as important as your code, even if you don't agree with me, they are to everyone but you. Besides technical debt there is also design debt. If you're getting the same questions over and over, you probably have poor design or you've miscommunicated somewhere. You're not expected to be a pro at everything you do and that's okay, we're all learning.
This isn't about polars, but I'm sure I'm not the only one to experience main character devs. It makes me (and presumably others) not want to open issues on __any__ project, not just bad projects, and that's bad for the whole community (including me, because users find mistakes. And we know there's 2 types of software: those with bugs and those that no one uses). Stupid people are just wizards in training and you don't get more wizards without noobs.
As another data point, I switched to Polars because I found it much more intuitive than Pandas - I coulnt remember how to do much in pandas in the rare (maybe twice a year) times I want to do data analysis. In contrast, Polars has a (to me anyway) wonderfully consistent API that reminds me a lot of SQL
Yep, been using pandas for years now, still have no real mental model for it and constantly have to experiment or chat with our AI overlords to figure out how to use it. But SQL and polars make sense to me.
I also use Pandas very infrequently (it has been years). Usually, for data analysis, I'm reaching for R + tidyverse/data.table/arrow. I have found Python and Pandas to be inelegant and verbose for these tasks.
As of last week, I have a need to process tabular data in Python. I started working with polars on Friday, and I have an analysis running across 16 nodes today. I find it very intuitive.
Maybe that's it, because I don't really use SQL much. Reaching for pandas about the same rate. Maybe that's the difference? But I come from a physics background and I don't know many physicists and mathematicians that are well versed in SQL. But I do know plenty that use pandas and python. So there's definitely a lot of people like me. Also I could be dumb. Totally willing to accept that lol.
The story of Polars seems to be shaping up a bit like the story of Python 3000: everything probably could have been done in a slow series of migrations, but the BDFL was at their limit, and had to start fresh. So it takes 10 years for the community to catch up. In the mean time, there will be a lot of heartache.
I honestly don't believe something like polars could've evolved out of pandas.
It's a complete paradigm shift.
There's honestly not much that probably could've been shared at the point polars was conceived. Maybe now there's a little more (due to the arrow backend) but still very little probably.
Just once I’d like to see “this library was written to fulfill head-in-the clouds demands by management that we have some implementation, without regards to quality.”
That has absolutely no relation to this project. What in the world are you talking about?
Trust me. It does. ;)
What do you mean? What "management" was this created to fulfill the demands of?
I’m just responding to
It is a common thing to see, I thought it would be funny to imagine the opposite.
For posterity, polars was a hobby product that started in 2020: https://news.ycombinator.com/item?id=23768227
Definitely not intended as a slight toward this project, just (what I thought was) a funny thought about that expression.
Noticed exactly the same - there's no description of the library whatsoever on the landing page. It is implied that it is a DataFrame library, whatever that means.
Maybe this is sort of like the opposite of how scam emails are purposefully scammy, so that only people who can't recognize scams will fall for them. Only people who know what "a DataFrame library" is - which is an enormous number of people, since this is probably the most broadly known concept in data science / engineering - will keep reading this, and they are the target audience.
While that may be, I think it would make sense to describe the project in a succinct way on the first page a visitor lands.
It is described in a succinct way. "DataFrames" is that description. It's the very first text on the page. It's really the same as having the word "database" be the first text on the landing page of a new database project. If you don't know what the word "database" means, the landing page for a new database project is really not the place to expect to learn about that. The "data frame" concept is not quite as old or broad as the concept of "databases", but it's really not that far off. It's decades old, and is about as close to a universal concept for data work as it's possible to get.
But you're not the audience? There is very little to gain by tailoring the introduction to people who aren't the audience.
You don't go car parts manufacturer expecting an explanation of what an intercooler is.
I was going to say - it always feels so humbling seeing pages like this. "DataFrames for the new era" okay… maybe I know what data frames are? "Multi-threaded query engine" ahh, so it’s like a database. A graph comparing it to things called pandas, modin, and vaex - I have no clue what any of these are either! I guess this really isn’t for me.
It’s a shame because I like to read about new tech or project and try and learn more, even if I don’t understand it completely. But there’s just nothing here for me.
This must be what normal people go through when I talk about my lowly web development work…
It's pretty much just an alternative to SQL that's a lot easier/natural to use for more hardcore data analysis.
You can much more easily compose the operations you want to run.
Just think of it as an API for manipulating tabular data stored somewhere (often parquet files, though they can query many different data sources).
Dataframes and SQL have overlapping functionality, but I wouldn't say that dataframes are an "alternative" to SQL. The tradeoffs are very different. You don't have to worry about minimizing disk reads or think about concurrency issues like transactions or locks, because a dataframe is just an in-memory data structure like a list or a dict, rather than a database. Dataframes also aren't really about relational algebra like SQL is.
Have you tried polars? I agree that if pandas is all you've tried, that it's pretty far from an alternative frontend for query engines, but if you've tried polars it maps pretty cleanly to SQL, can be query optimized to a query plan similarly to SQL, I should've made it clear that I'm talking about an alternative to SQL used in an OLAP context, not for OLTP.
Data tables tend to also be a standard ingestion format for statistical tools in many cases.
I'm currently getting dragged into "data" stuff, and I get the impression it's a parallel universe, with its own background and culture. A lot of stuff is like "connect to your Antelope or Meringue instances with the usability of Nincompoop and the performance of ARSE2".
Anyway, probably the interesting things about polars are that it's like pandas, but uses a more efficient rust "backend" called Arrow (although I think that part's also in pandas now) and something like a "query planner" that makes combining operations more efficient. Typically doing things in polars is much more efficient than pandas, to the extent that things that previously required complicated infrastructure can often be done on a single machine. It's a very friendly competition, created by the main developer of pandas.
As far as I can tell everybody loves it and it'll probably supplant pandas over time.
I've been using pandas heavily, everyday, for something like 8 years now. I also did contribute to it, as well as wrote numpy extensions. That is to say, I'm fairly familiar with the pandas/numpy ecosystem, strengths and weaknesses.
Polars is a breeze of fresh air. The API of pandas is a mess:
* overuse of polymorphic parameters and return types (functions that accept lists, ndarrays, dataframes, series or scalar) and return differently shaped dataframes or series.
* lots of indirections behind layers of trampoline functions that hide default values behind undocumented "=None" default values.
* abuse of half baked immutable APIs, favoring "copy everything" style of coding, mixed with half supported, should-have-been-deprecated, in-place variants.
* lots and lots of regressions at every new release, of the worst kind ("oh yeah we changed the behavior of function X when there is more than Y NaNs over the window")
* Very hard to actually know what is delegated to numpy, what is Cython/pandas, and what is pure python/pandas.
* Overall the beast seemed to have won against its masters, and the maintainers seem lost as to what to fix versus what to keep backward compatible.
Polars fixes a lot of these issues, but it has some shortcomings as well. Mainly I found that:
* the API is definitely more consistent, but also more rigid than pandas. Some things can be very verbose to write. It will take some years for nicer simpler "shortcuts" and patterns to emerge.
* The main issue IMHO is polars' handling of cross sectional (axis=1) computations. Polars is _very_ time series (axis=0) oriented, and most cross sectional computations require to transpose the data frame, which is very slow. Pandas has a lot of dedicated axis=1 implementations that avoid a full transposition.
Many axis=1 operations in pandas do a transpose under the hood, mind you. Axis=1 belongs in matrices, not in heterogeneous data. They are a performance footgun. We make the transpose explicit.
Sure, but many others are natively axis=1-aware and avoid full transposition.
I'm not sure to understand what that means. Care to elaborate?
You don't get to only solve the problems that are efficient to solve...
Yes, but when you do mixed time series / cross sectional computations, you cannot always untangle both dimensions and transpose once. Sometimes your computation intrinsicely interleaves cross sectional and time series. In these case, which happen a lot in financial computations, then explicitly fully transposing is very slow.
All domains seem to have this kind of in-group shorthand, regardless of scale of the community.
In fairness, the title of the page is “Dataframes for the new Era”. The “Get Started” link below the title links to a document that points to the GitHub page, which explains what the library is about to people with data analysis backgrounds: https://github.com/pola-rs/polars
But annoyingly, not the <title>, thus the useless HN headline.
I wish HN had secondary taglines we could use to talk about the actual content or relevance of an article apart from its headline.
Had the exact same thought seeing this. Too many of these websites are missing a simple tldr of the thing actually is. Great, it's fast, but fast at what??
It has that simple tldr, it's the very first word, "DataFrames". Everyone in this thread just doesn't know what that means, and that's fine, I get that, but seriously, that's the simple summary. Data frames aren't an obscure or esoteric concept in the data analysis space; quite the opposite.
Hard agree. People post links to websites with technical descriptions and little basic info all the time, and this is the first time I'm seeing a thread of people complaining about it. If I'm interested in something I see, I start Googling terms; I don't expect a specification for software in a specific field to cater to my beginner-level knowledge.
I think something like dataframes suffers from having a name that isn't obscure enough. You read "dataframes" and think those are two words you know, so you should understand what it is.
If they'd called them flurzles you wouldn't feel like you should understand if it's not something you work with.
For me, “data frames” are forever associated with MPEG
How come some submissions don't even describe what it is about than just the name of it? It's really puzzling how everyone is meant to know what it is by its name.
I've mentioned this before and got downvoted because of course everyone is a web dev and knows what xyz random framework (name and version number in the title, nothing else) is.
Marketing is a skill that needs to be learned. You have to put yourself in the shoes of a person who knows nothing about your product. This does not come naturally to the engineers who make these products and are used to talking to other specialists like themselves.
This is true in general but I'm not sure it's what's going on here.
Marketing is also very concerned with understanding who your target audience(s) are and speaking their language.
I think talking about "DataFrames" is exactly that; the target audience of this project knows what that means. What they are interested in is "ok but who cares about data frames? I've been using pandas for like fifteen years", so what you want to tell them is why this is an improvement, how it would help them. Dumbing it down to spend a bunch of space describing what data frames are would just be a distraction. You'd probably lose the target audience before you ever got to the actual benefits of the project.
I don't use dataframes in my day job but have dabbled in them enough that I found this website pretty easy to digest.
You'd really have to be a complete data engineering newbie to not understand it I think?
I mean, where do you draw the line? You wouldn't expect a software tool like this to explain what it is in language my grandma would understand, I don't think?
I do occasionally use Pandas in my day job, but I honestly think very few programmers that could have use for a data frame library would describe themselves as a “data engineer” at all.
In my case, for example, I’m just a physicist - I don’t work with machine learning, big data, or in the software industry at all. I just use Pandas + Seaborn to process the results of numerical simulations and physical experiments similarly to how someone else might use Excel. Works great.
I hate this doc style that has become so popular lately. They get so wrapped up in selling you their story that they forget to tell you basic shit. Like what it is. Or how to install it.
The PMs literally simplified things so much they simplified the product right ought of the docs.
It is right there on the page, set to Python by default:
It's fine to me. Tech UI is bad and weird, but not like if you gain 5x customers with better UX.
You were right that the page is written for those that know what they are looking for, which is just fine. If you are getting started in DS/ML/etc and you have used numpy, pandas, etc. polars is useful in some cases. A simple one, it loads dataframes faster (from experience with a team I help) than pandas.
I haven't played enough to know all it's benefits, but yes it's the next logical step if you are in the space using the above mentioned libraries, it's something one will find.
pandas dataframes but faster
Right... but the title before the first line reads "DataFrames for the new era". If you don't know what a data frame is then, yes, it's for people who already know that.
It’s not written for you and that’s fine. This is a library targeted at a very specific subset of people and you’re not in it.