return to table of content

NumPy 2.0

dahart
22 replies
19h4m

The thing I want most is a more sane and more memorable way to compose non-element-wise operations. There are so many different ways to build views and multiply arrays that I can’t remember them and never know which to use, and have to relearn them every time I use numpy… broadcasting, padding, repeating, slicing, stacking, transposing, outers, inners, dots of all sorts, and half the stack overflow answers lead to the most confusing pickaxe of all: einsum. Am I alone? I love numpy, but every time I reach for it I somehow get stuck for hours on what ought to be really simple indexing problems.

salamo
4 replies
17h51m

When I started out I was basically stumbling around for code that worked. Things got a lot easier for me once I sat down and actually understood broadcasting.

The rules are: 1) scalars always broadcast, 2) if one vector has fewer dimensions, left pad it with 1s and 3) starting from the right, check dimension compatibility, where compatibility means the dimensions are equal or one of them is 1. Example: np.ones((2,3,1)) * np.ones((1,4)) = np.ones((2,3,4))

Once your dimensions are correct, it's a lot easier to reason your way through a problem, similar to how basic dimensional analysis in physics can verify your answer makes some sense.

(I would disable broadcasting if I could, since it has caused way too many silent bugs in my experience. JAX can, but I don't feel like learning another library to do this.)

Once I understood broadcasting, it was a lot easier to practice vectorizing basic algorithms.

o11c
2 replies
14h19m

Explicit internal broadcasting (by adding an axis with an explicit `indefinite` size instead of `1`) would be so much simpler to reason about.

Unfortunately there is far too much existing code and python is not type-safe.

bramblerose
1 replies
11h31m

You can do this with `np.newaxis` - in the NumPy course I wrote as TA we required students to always be explicit about the axes (also in e.g. sums). It would be nice if you could disable the implicit broadcasting, but as you mention that would break so much code

o11c
0 replies
3h43m

`np.newaxis` explicitly adds a `1` dimension, not an `indefinite` one.

nerdponx
0 replies
16h2m

The broadcasting doc is surprisingly readable and easy to follow. And the rules are surprisingly simple. The diagrams and examples are excellent. https://numpy.org/doc/stable/user/basics.broadcasting.html

After taking the time to work through that doc and ponder some real-world examples, I went from being very confused by broadcasting to employing intermediate broadcasting techniques in a matter of weeks. Writing out your array dimensions in the same style of their examples (either in a text file or on a notepad) is the key technique IMO:

  Image  (3d array): 256 x 256 x 3
  Scale  (1d array):             3
  Result (3d array): 256 x 256 x 3
And of course with practice you can do it in your head.

nl
2 replies
13h25m

ChatGPT is really good at this. Using it to solve numpy and matplotlib problems is worth the cost of the subscription.

esafak
1 replies
12h30m

What a waste of electricity. All because numpy and pandas did not get their APIs right. My hot take.

CamperBob2
0 replies
2h15m

I'd say you're both right, based on my limited experience. I can't seem to do much in numpy or pandas without resorting to ChatGPT or a notebook-based assistant. I assume that need will go away with additional experience, but maybe that's being overoptimistic.

I wish ChatGPT had been around when I learned C. It would sure have saved the programmers in my neighboring offices a lot of grief.

mFixman
2 replies
8h28m

Most of the bugs I got on numpy programs came from a variable with a different ndims as expected being broadcast implicitly.

Implicit type casting is considered a mistake in most programming languages; if I were to redesign numpy from scratch I would make all broadcasting explicit.

My solution to these problems is asserting an array's shape often. Does anybody know is there's a tool like mypy or valgrind, but that checks mismatched array shapes rather than types or memory leaks?

cozzyd
1 replies
17m

numpy's API was carefully designed to match matlab, where most of its early users came from.

mFixman
0 replies
12m

Then we are lucky they didn't decide to start their arrays at 1 :-)

notrealyme123
0 replies
10h7m

+1 been at your talk presenting it at a Retreat. Happy User

akasakahakada
2 replies
18h56m

To be honest einsum is the easiest one. You get fine control on which axis matmul to which. But I wish it can do more than matmul.

The others are just messy shit. Like you got np.abs but no arr.abs, np.unique but no arr.unique. But now you have arr.mean.

Sometimes you got argument name index, sometimes indices, sometimes accept (list, tuple), sometime only tuple.

samsartor
1 replies
18h26m

https://github.com/mcabbott/Tullio.jl is my favorite idea for extending einsum notation to more stuff. Really hope numpy/torch get something comparable!

enkursigilo
0 replies
8h35m

Yeah, Tullio.jl is a great package. Basically, it is just a macro for generating efficient for loops.

I guess, it might be hard to achieve similar feature in Python without metaprogramming.

hansvm
1 replies
13h54m

It gets more comfortable over time, but I remember feeling that way for the first year or three. My wishlist now is for most of numpy to just be a really great einsum implementation, along with a few analogous operations for the rest of the map-reduces numpy accelerates.

I've been writing my own low-level numeric routines lately, so I'm not up-to-date on the latest news, but there have been a few ideas floating around over the last few years about naming your axes and defining operations in terms of those names [0,1,2,3]. That sort of thing looks promising to me, and one of those projects might be a better conceptual fit for you.

[0] https://nlp.seas.harvard.edu/NamedTensor

[1] https://pypi.org/project/named-arrays/

[2] https://pytorch.org/docs/stable/named_tensor.html

[3] https://docs.xarray.dev/en/stable/

mgunyho
0 replies
10h26m

I want to second the idea of broadcasting based on named axes/dimensions. I think it's a logical next step in the evolution of the array programming paradigm.

I particularly recommend checking out xarray. It has made my numpy-ish code like 90% shorter and it makes it trivial to juggle six+ dimensional arrays. If your data is on a grid (not shaped like a table/dataframe), I see no downsides to using xarray instead of bare numpy.

SubiculumCode
1 replies
17h25m

Is this the dplyR use case in R? Is there dplyPython for Numpy?

nerdponx
0 replies
16h35m

You mean the pipe operator from Magrittr %>%, and the R 4.0 built-in operator |>?

Pandas has a .pipe(fn) method, but without the lazy evaluation to enable the R symbol capturing magic, the syntax is pretty clunky and not particularly useful. The closest approximation is method chaining, which at least is more consistently available in Pandas than in Numpy.

If you're talking about Dplyr "verbs" then no, there's nothing quite like that in Python, but it's much less necessary in Pandas or Polars than in R, because the set of standard tools for working with data frames in the bear libraries is much richer than in the R standard library.

begueradj
0 replies
9h36m

What do you mean by: "more memorable way to ..." ?

fbdab103
17 replies
21h43m

Any notable highlights for a consumer of Numpy who rarely interfaces directly with it? Most of my work is pandas+scipy, with occasionally dropping into the specific numpy algorithm when required.

I am much more of an "upgrade when there is a X.1" release kind of guy, so my hat off to those who will bravely be testing the version on my behalf.

amelius
8 replies
20h44m

One new interesting feature, though, is the support for string routines

Sounds almost like they're building a language inside a language.

ssahoo
6 replies
20h20m

No. Native python ops in string suck in performance. String support is absolutely interesting and will enable abstractions for many NLP and LLM use cases without writing native C extensions.

ayhanfuat
4 replies
19h6m

Native python ops in string suck in performance.

That’s not true? Python string implementation is very optimized, probably have similar performance to C.

mirashii
1 replies
14h18m

It is absolutely true that there is massive amounts of room for performance improvements for Python strings and that performance is generally subpar due to implementation decisions/restrictions.

Strings are immutable, so no efficient truncation, concatenation, or modifications of any time, you're always reallocating.

There's no native support for a view of string, so operations like iteration over windows or ranges have to allocate or throw away all the string abstractions.

By nature of how the interpreter stores objects, Strings are always going to have an extra level of indirection compared to what you can do with a language like C.

Python strings have multiple potential underlying representations, and thus have some overhead for managing and dealing with those multiple representations without exposing those details to user code

Too
0 replies
13h34m

There is a built in memoryview. But it only works on bytes or other objects supporting the buffer protocol, not on strings.

bvrmn
0 replies
9h43m

For numpy applications you have to always box a value to get a new python string. It quite far from fast.

topper-123
0 replies
19h29m

Yeah, operating on strings has historically been a major weak point of Numpy's. I'm looking forward seeing benchmarks for the new implementation.

nerdponx
0 replies
16h4m

It's already very much a DSL, and has been for the decade-ish that I've used it.

They're not building a language. They're carefully adding a newly-in-demand feature to a mature, already-built language.

brcmthrowaway
1 replies
19h22m

Does numpy use GPU?

ahurmazda
0 replies
19h7m

No.

You may want to check out cupy

https://cupy.dev/

ahurmazda
1 replies
19h5m

This one will be rough :|

arange’s start argument is positional-only
nerdponx
1 replies
15h54m

As a more or less daily user, I was surprised at how not-breaking the 2.0 changes will be for 90% of Numpy users. Unless their dependencies/environments break, I expect that casual users won't even notice the upgrade.

Even the new string dtype I expect would go unnoticed by half of users or more, because they won't be using it (because Numpy historically only had fixed-length strings and generally poor support for them) and so won't even think to try it. Pandas meanwhile has had a proper string dtype for a while, so anyone interested in doing serious work on strings in data frames / arrays would presumably be using Pandas anyway.

Most of the breaking changes are in long-deprecated oddball functions that I literally have never seen used in the wild, and in the internal parts that will be a headache for library developers.

The only change that a casual user might actually notice is the change in repr(np.float64(3.0)), from "3.0" to "np.float64(3.0)".

cozzyd
0 replies
14h19m

I suspect the C ABI break to be the biggest issue, though maybe fewer packages than I imagine compile against the numpy C ABI...

notatoad
13 replies
19h8m

it feels like the first major release in 18 years which introduces lots of breaking changes should just be a fork rather than a version.

let me do `pip install numpy2` and not have to worry about whether or not some other library in my project requires numpy<2.

vessenes
6 replies
17h0m

From a consumer (developer consumer) point of view, I hear you.

From a project point of view, there are some pretty strong contra-indicators in the last 20 years of language development that make this plan suspect, or at least pretty scary — both Perl and Python had extremely rocky transitions around major versions; Perl’s ultimately failing and Python’s ultimately taking like 10 years. At least. I think the last time I needed Python 2 for something was a few months ago, and before that it had been a year or so. I’ve never needed Perl 6, but if I did I would be forced to read a lot of history while I downloaded and figured out which, if any, Perl 5 modules I’m looking for got ported.

I’d imagine the numpy devs probably don’t have the resources to support what would certainly become two competing forks that each have communities with their own needs.

sestep
2 replies
15h5m

Could you elaborate further on this? Is it not the case that they'll still need to support numpy 1.x for at least the near-term future? My understanding was the parent comment was specifically talking about the technical difficulty of multiple versions for the same Python package, not the social problem of project and community management across breaking changes.

vessenes
0 replies
12h37m

Yes, agreed they’ll need to support 1.x for (probably) quite a while, depending on API and interface changes between major versions.

My point, or at least the point I had in mind, was that the social and technical go together in a lot of subtle and sometimes surprising ways; in this case, I’d bet the idea of a second package name a) is a bad one because it’s likely to create differing community expectations about whether or not it’s okay to keep using the 1.0 package, and b) would let people feel okay not upgrading for a while / demanding security and bug fix point releases on 1.x longer than if the package itself just updates its major version.

rovr138
0 replies
13h31m

For the near term future, they can use the latest 1 version

librasteve
2 replies
10h21m

i think it’s fair to say that perl6 has been an “extremely rocky transition” ultimately it was renamed raku to reflect this and avoid camping on the perl5 version numbering

raku has good package compatibility via Inline::Perl5 and Inline::Python and FFI to languages like Rust and Zig

among the many downsides of the transition, one upside is that raku is a clean sheet of paper and has some interesting new work for example in LLM support

I have started work on a new raku module called Dan::Polars and would welcome contributions from Numpy/Pandas folks with a vision of how to improve the APIs and abstractions … it’s a good place to make a real contribution, help make something new and better and get to grips with some raku and some rust.

just connect via https://github.com/librasteve/raku-Dan-Polars if you are interested and would like to know more

vessenes
1 replies
4h0m

Thanks for this — Perl 4 was my first serious programming language, and every five or so years I come back and check out what’s up. Seems like a quick raku tour could be fun.

One huge pain point for me in perl 5 was just how incredibly slow CPAN was compared to `go import`, like two orders of magnitude slower. I remember putting up with those times in the ‘90s because package management was a kind of miracle over FTP sites, but it’s a big ask in today’s world.

What’s raku’s story here, out of curiosity?

librasteve
0 replies
41m

(I suggest you start with https://raku.guide (like the Rust Book) and also take a look at https://docs.raku.org/language/5to6-nutshell)

the raku package manager - zef comes bundled with the rakudo compiler - I use https://rakubrew.org

https://raku.land is a directory of raku packages

I would say that zef is very good (it avoids the frustrations of Python package managers like pip and conda) like perl before it, raku was designed with packages and installers in mind with a concern for a healthy ecosystem

for example, all versions (via the META6.json payload descriptor) carry versioning and the module version descriptor is a built in language type https://docs.raku.org/type/Version that does stuff like this:

  say v1.0.1 ~~ v1.*.1;   # OUTPUT: «True␤»
and this

  zef install Dan::Pandas:ver<4.2.3>:auth<github:jane>:api<1>
and this

  use Dan::Pandas:ver<4.3.2+>:auth<github:jane>:api<1>;
(of course, authors must authenticate to upload modules)

make3
2 replies
18h59m

knowing how careful the NumPy devs are, this was likely a very well pondered decision & all of these deprecations likely have been announced for a long time. Seeing knee jerk reactions like this is annoying.

kaashif
1 replies
18h3m

Do you have a link to the discussion where this was very well pondered? I can't find anything, but I'm very interested in that kind of discussion.

theamk
0 replies
1h38m

That was my first reaction as well, but apparently as far a Python goes, numpy 2 code is fully compatible with numpy 1 code [0], with exception of single "byte_bounds" function (which sounds super rare, so I doubt it'd be a problem)

So at least the migration path for python modules is clear: upgrade to be numpy 2 compatible, wait for critical mass, start adding numpy 2 features. Sounds way better than python2 -> python3 migration, for example.

However, the fact that I had to look at 3rd party page to find this out is IMHO a big documentation problem. It should be plastered in all announcements, on documentation and migration page: "there is a common subset for python code of numpy 1 and 2, so you can upgrade now, no need to wait for full adoption"

[0] https://docs.astral.sh/ruff/rules/numpy2-deprecation/

patrick451
0 replies
13h55m

A project with both numpy arrays and numpy2 arrays getting mixed together sounds like a disaster to me.

aphexairlines
0 replies
11h27m

If your requirements.in referenced numpy before this release, then doesn't your requirements.txt already reference a specific 1.x version?

chmaynard
1 replies
20h41m

This is a draft. No release notes yet.

wartijn_
0 replies
18h25m

The GitHub release seems to have the final notes. It has at least the placeholder texts replaced:

It is the result of 11 months of development since the last feature release and is the work of 212 contributors spread over 1078 pull requests

instead of:

It is the result of X months of development since the last feature release by Y contributors

https://github.com/numpy/numpy/releases/tag/v2.0.0

darepublic
1 replies
2h8m

I would love for numpy to be ported as a typescript project personally. So I can do ml in ts. The python ecosystem feels a bit insane to me (more so than the js one). Venv helps but is still inferior to a half decent npm project imo. I feel there is no strict reason why this migration couldn't happen, only the inertia that makes it unlikely

elialbert
0 replies
1h18m

who's stopping you

RandomBK
1 replies
13h57m

I'm starting to see some packages break due to not pinning 1.x in their dependencies. `pip install numpy==1.*` is a quick and hacky way to work around those issues until the ecosystem catches up.

globular-toast
0 replies
10h43m

"numpy~=1.0”

Is this not common knowledge? Also, pip install? Or do you mean some requirements file?

tpoacher
0 replies
6h22m

I wish numpy pushed their structured arrays (and thereby also improvements to their interface) more aggressively.

Most people are simply unaware of them, which is why we get stuff like pandas on top of everything.

fertrevino
0 replies
8h45m

So apparently this is what broke my CI job since it was indirectly installed. One of the downsides of using loose version locking with requirements.txt rather than something like poetry I guess.

ayhanfuat
0 replies
18h48m

The default integer type on Windows is now int64 rather than int32, matching the behavior on other platforms

This was a footgun due to C long being int32 in win64. Glad that they changed it.

antonoo
0 replies
12h47m

X months of work by Y contributors?

Makes it look like they pressed publish before filling in their template, or is this on purpose?

Kalanos
0 replies
7h10m

What are the implications of the new stringdtype? If I remember correctly, string performance was a big part of the pandas switch to arrow.