That's one interesting project. As someone who relies heavily on collaboration with people using Jupyter Notebook. The most annoying points about reproducing their work are the environment and the hidden state of Jupyter Notebooks.
This does to address directly the second problem. It does however by sacrificing flexibility. I might need to change a cell just to test a new thing (without affecting the other cells) but thats a trade off if you focus on reproducibility.
I know that requirements.txt is the standard solution to the other problem. But generating and using it is annoying. The command pio freeze will list all the packages in bloated way (there is better ways) but I always hoped to find a notebook system that will integrate this information natively and have a way to embed that into a notebook in a form that I can share with other people. Unfortunately I can't see support for something in any of the available solutions (at least up to my knowledge).
Yes, the second half of reproducibility is for sure packages. A solution for reproducible environments is on our roadmap (https://marimo-team.notion.site/The-marimo-roadmap-e5460b9f2...), but we haven't quite figured it out yet.
It's a bit challenging because Python has so many different solutions for package management. If you have any ideas we'd love to hear them.
People always complain about pip and python packaging but it’s never been an issue for me. I create a requirements.base.txt that has the versions of things I want installed. I then:
Install is then simply: Updating / installing something new is a matter of adding to the base file and then refreezing.There are several problems with this approach, notably you don't get information about specific platform stuff. You don't get information on how these package are installed (conda, mamba..etc).
And it does not account for dependincies version conflicts which life very hard.
I don’t understand the platform thing, is that something to do with running on Windows? Why wouldn’t you just pip install? Why bring conda etc into the mix?
If you have conflicts then you have to reconcile those at point of initial install - pip deals with that for you. I’ve never had a situation in 15 years of Python packages where there wasn’t a working combination of versions.
These are genuine questions btw. I see these common complaints and wonder how I’ve not ever had issues with it.
I will try to summarize the complaints (mine at least) in obvious simple points
1- pip freeze will miss packages not installed by pip (i.e. Conda).
2- It does include all packages, even not used in the project.
3- It just dumps all packages, their dependencies and sub-dependencies. Even without conflicts, if you happen to change a package, then it is very hard to keep track of dependencies and sub-dependencies that need to be removed. At some point, your file will be a hot mess.
4. If you install specific platform package version then this information will not be tracked
Ok. I think that’s all handled by my workflow, but it does involve taking responsibility for requirements files.
If I want to install something, I pip install and then add the explicit version to the base. I can then freeze the current state to requirements to lock in all the sub dependencies.
It’s a bit manual (though you only need a couple of cli commands) but it’s simple and robust.
I don't think that manual handling of requirement.txt in a collaborative environment is a robust process. It will be a waste of time and resources to handle it like that. And I don't know about your workflow but it is obviously not standard and it does not address the first and forth points.
Haha. Ok. I think that’s where we’re just going to have to agree to disagree.
This is my workflow too. And it works fine. I think the disconnect here is that I grew up fighting dependencies when compiling other programs from source on Linux. I know how painful it can be and I’ve accepted the pain and when I came to python/venv I thought “This isn’t so bad!”
But if someone is coming from data science and not dev-ops then no matter how much we say “all you have to do”. The response will be why do I have to do any of this?
Can you name a package manager (any language) that handles #3 well?
How does it handle the problem?
Problems 1 and 2 can be solved by using a virtualev/venv per project.
3 is solved by the workflow of manually adding requirements and not including dependencies. It may not work for everyone. Something like pipreqs might work for many people.
I do not understand why 4 is such a problem. Can you explain further?
1/4- Ordinary `pip install` works for binary/platform-specific wheels (e.g., numpy) too and even non-Python utilities (e.g., shellcheck-py)
2/3- you need to track only the direct dependencies _manually_ but for reprodicible deployments you need fixed versions for all dependencies. The latter is easy to generate _automatically_ (`pip freeze`, pip-tools, pipenv/poetry/etc).
Yes, there are more problems with Windows.
Poetry handles all of this properly.
Just not PyTorch apparently.
I regularly observe it stalling at dependency resolution stage upon changing version requirements for one of the packages (or python version requirements).
I follow a similar approach -- top-level dependencies in pyproject.toml and then a pip freeze to get a reproducible set for applications. I know there are edge cases but this has worked really well for me for a decade without much churn in my process (other than migrating from setup.py to setup.cfg to pyproject.toml).
After trying to migrate everything to pipenv and then getting burned, I went back to this and can't imagine I'll use another third-party packaging project (other than nix) for the foreseeable future.
The post you’re responding to said that there are many Python packaging options, not that they don’t work. Pip freeze works reasonably well for a lot of situations but that doesn’t necessarily mean it’s the best option for their notebook tool, especially if they want to attract users who are used to conda.
The link redirect does not specify which point in the list you are referring to but I guess it is "Install missing packages from...". If so, then I really wonder if you mean supporting something like '!pip install numpy' like Jupyter or something else?
I don't think this is really a solution, not to mention that this raise the question. Does it support running shell commands using '!' like Jupyter Notebook?
Oh, sorry for not being more clear. That's not the one. It's "Package management: make notebooks reproducible down to the packages they use": https://marimo-team.notion.site/840c475fd7ca4e3a8c6f20c86fce...
Does that align with what you're talking about?
That page has some scrawled brainstormed notes. But we haven't spent time designing a solution yet.
Thanks. That is precisely what I was talking about in my comment. It would solve the problem if we have some like that integrated natively. I understand that between pip, conda, mamba and all the others it would be hard problem to solve. But at least auto generating requirements.txt would be easier. But to be honest the hard part is identify packages and where they are from not what to do with information. Good luck with the development.
The third half is data which only exists on your machine :P
And even if it’s on some shared storage, it may have been generated by another unreproducible notebook or worse, manually.
Nix is the only solution for reproducible environments that I would call rock-solid.
It comes with costs and the gpu-related stuff is especially tricky e.g. https://www.canva.dev/blog/engineering/supporting-gpu-accele...