"Shamelessly stole the title from a hero of mine". Your Shamelessness is all fine. But at first I thought this is a post from Andrej Karpathy. He has one of the best personal brands out there on the internet, while personal brands can't be enforced, this confused me at first.
I spent three semesters in college learning RL only to be massively disappointed in the end after discovering that the latest and greatest RL techniques can’t even beat a simple heuristic in Tetris.
I modeled part of my company's business problem as a MAB problem and saved my company 10% off their biggest cost and, just as important, showcased an automated truth signal that helped us understand what was, and wasn't, working in several of our features. Like all tools, finding the right place to use RL concepts is a big deal. I think one thing that is often missed in a classroom setting is pushing more real world examples of where powerful ideas can be used. Talking about optimal policies is great, but if you don't help people understand where those ideas can be applied then it is just a bunch of fun math. (which is often a good enough reason on its own :)
For those not in the know, "MAB" is short for Multi-Armed Bandit [1], which is a decision-making framework that is often discussed in the broader context of reinforcement learning.
In my limited understanding, MAB problems are simpler than those tackled by Deep Reinforcement Learning (DRL), because typically there is no state involved in bandit problems. However, I have no idea about their scale in practical applications, and would love to know more about said business problem.
There are often times when you have n possible providers of service y, each with strengths and weaknesses. If you have some ultimate truth signal (like follow on costs which are linked to quality, which was what I used) then you can model the providers as bandits and use something like UCB1 to choose which to use. If you then apply this to every individual customer what you end up doing is learning the optimal vendor for each customer which gives you a higher efficiency than had you picked just one 'best all around' vendor for all customers. So the pattern here is: If you have n_service_providers and n_customers and a value signal to optimize then maybe MAB is the place to go for some possible quick gains. Of course if you have a huge state space to explore instead of just n_service_providers, for instance you want to model combinations of choices, using something like a NN to learn the state space value function is also a great way to go.
RL seems to be in this weird middle ground right now where nobody knows how to make it work all that well but almost everybody at the top levels of ML research agrees it's a vital component of further advances in AI.
RL can be massively disappointing, indeed. And I agree with you (and with the amazing post I already referenced [1]) that it is hard to get it to work at all. Sorry to hear you have been disappointed so much!
Nonetheless, I would personally recommend even just learning the basics and fundamentals of RL. Beyond supervised, unsupervised, and the most-recent and well-deservedly hyped semi-supervised learning (generative AI, LLMs, and so on), reinforcement learning indeed models the learning problem in a very elegant way: an agent interacting with an environment and getting feedback. Which is, arguably, a very intuitive and natural way of modeling it. You could consider backward error correction / propagation as an implicit reward signal, but that would be a very limited view.
On a positive note, RL has very practical sucessful applications today - even if in niche fields. For example, LLM fine-tuning techniques like RLHF successfully apply RL to modern AI systems, companies like Covariant are working on large robotics models which definitely use RL, and generally as a research field I believe (but I may be proven wrong!) there is so much more to explore. For example, check Nvidia Eureka that combines LLM to RL [2]: pretty cool stuff IMHO!
Far from attempting to convince you on the strength and capabilities of DRL, just recommending folks to not discard it right away and at least give it a chance to learn the basics, even just for an intellectual exercise :) Thanks again!
While trying to learn the latest in Deep Reinforcement Learning, I was able to take advantage of many excellent resources (see credits [1]), but I couldn't find one that provided the right balance between theory and practice for my personal experience. So I decided to create something myself, and open-source it for the community, in case it might be useful to someone else.
None of that would have been possible without all the resources listed in [1], but I rewrote all algorithms in this series of Python notebooks from scratch, with a "pedagogical approach" in mind. It is a hands-on step-by-step tutorial about Deep Reinforcement Learning techniques (up ~2018/2019 SoTA) guiding through theory and coding exercises on the most utilized algorithms (QLearning, DQN, SAC, PPO, etc.)
I shamelessly stole the title from a hero of mine, Andrej Karpathy, and his "Neural Network: Zero To Hero" [2] work. I also meant to work on a series of YouTube videos, but didn't have the time yet. If this posts gets any type of interest, I might go back to it. Thank you.
P.S.: A friend of mine suggested me to post here, so I followed their advice: this is my first post, I hope it properly abides with the rules of the community.
[1] https://github.com/alessiodm/drl-zh/blob/main/00_Intro.ipynb [2] https://karpathy.ai/zero-to-hero.html
very cool, thanks for putting this together
It would be great to see a page dedicated to SoTA techniques & results
Thank you so much! And very good advice: I have an extremely brief and not-descriptive list in the "Next" notebook, initially intended for that. But it definitely falls short.
I may actually expand it in a second "more advanced" series of notebooks, to explore model-based RL, curiosity, and other recent topics: even if not comprehensive, some hands on basic coding exercise on those topics might be of interest nonetheless.
Does it rely heavily on python, or could someone use a different language to go through the material?
Yes, the material relies heavily on Python. I intentionally used popular open-source libraries (such as Gymnasium for RL environments, and PyTorch for deep learning) and Python itself given their popularity in the field, so that the content and learnings could be readily applicable to real-world projects.
The theory and algorithms per-se are general: they can be re-implemented in any language, as long as there are comparable libraries to use. But the notebooks are primarily in Python, and the (attempted) "frictionless" learning experience would lose a bit if the setup is in a different language, and it'll likely take a little bit more effort to follow along.
In case you want to expand to more chapters one day: there's lots of tutorials of doing the simple things that has been verified to work, but if I'm struggling it's normally with something people barely ever mention - what to do when things go wrong. For example your actions just consistently get stuck at maximum. Or the exploration doesn't kick in, regardless how noisy you make the off-policy training. Or ...
I wish there were more practical resources for when you've got the basics usually working, but suddenly get issues nobody really talks about. (beyond "just tweak some stuff until it works" anyway)
Thanks a lot, and another great suggestion for improvement. I also found that the common advice is "tweak hyperparameters until you find the right combination". That can definitely help. But usually issues hide in different "corners", both of the problem space and its formulation, the algorithm itself (e.g., just different random seeds have big variance in performance), and more.
As you mentioned, in real applications of DRL things tend to go wrong more often than right: "it doesn't work just yet" [1]. And my short tutorial definitely lacks in the area of troubleshooting, tuning, and "productionisation". If I carve time for expansion, this will likely make top of list. Thanks again.
Thanks for sharing [1], that was a great read. I'd be curious to see an updated version of that article, since it's about 6 years old now. For example, Boston Dynamics has transitioned from MPC to RL for controlling its Spot robots [2]. Davide Scaramuzza, whose team created autonomous FPV drones that beat expert human pilots, has also discussed how his team had to transition from MPC to RL [3].
[2]: https://bostondynamics.com/blog/starting-on-the-right-foot-w...
[3]: https://www.incontrolpodcast.com/1632769/13775734-ep15-david...
Thank you for the amazing links as well! You are right that the article [1] is 6 years old now, and indeed the field has evolved. But the algorithms and techniques I share in the GitHub repo are the "classic" ones (dating back then too), for which that post is still relevant - at least from an historical perspective.
You bring up a very good point though: more recent advancements and assessments should be linked and/or mentioned in the repo (e.g., in the resources and/or an appendix). I will try to do that sometime.
If there anything like that, but for NLP?
There is an NLP section in Jeremy Howard's "Practical Deep Learning for Coders" course (free): https://course.fast.ai/Lessons/lesson4.html
The whole course is fantastic. I recommend it frequently to folks who want to start with DL basics and ramp up quickly to more advanced material.
There's the series this material references - "Neural networks: zero to hero" that has GPT related parts.
I took the Deep Learning course [1] by deeplearning.ai in the past, and their resources where incredibly good IMHO. Hence, I would suggest to take a look at their NLP specialization [2].
+1000 to "Neural networks: zero to hero" already mentioned as well.
[1] https://www.deeplearning.ai/courses/deep-learning-specializa... [2] https://www.deeplearning.ai/courses/natural-language-process...
This looks really interesting! I tried exploring deep RL myself some time ago but could never get my agents to make any meaningful progress, and as someone with very little stats/ML background it was difficult to debug what was going wrong. Will try following this and seeing what happens!
I mean, resources like these are great, but RL in itself is quite dense and topic heavy, so not sure there is any way to reduce the inherent difficulty level, any beginner should be made clear to that. That's my primary gripe with ML topics (especially RL related).
Thank you. It is true, indeed the material does assume some prior knowledge (which I mention in the introduction). In particular: being proficient in Python, or at least in one high-level programming language, be familiar with deep learning and neural networks, and - to get into the theory and mathematics (optional) - basic calculus, algebra, statistics, and probability theory.
Nonetheless, especially for RL foundations, I found that a practical understanding of the algorithms at a basic level, writing them yourself, and "playing" with them and their results (especially in small toy settings like the grid world) provided the best way to start getting a basic intuition in the field. Hence, this resource :)
Thank you very much! I'd be really interested to know if your agents will eventually make progress, and if these notebooks help - even if a tiny bit!
If you just want to see if these algorithm can even work at all, feel free to jump on the `solution` folder and pick any algorithm you think could work and just try it out there. If it does, then you can have all the fun rewriting it from scratch :) Thanks again!
This looks great - maybe add a link to the youtube videos in the README?
Thank you so much! Unfortunately, that is a mistake in the README that I just noticed (thank you for pointing it out!) :( As I mentioned in the first post, I didn't get to make the YouTube videos yet. But it seems the community would be indeed interested.
I will try to get to them (and in the meantime fix the README, sorry about that!)
Awesome, I've been sort of stuck in the limbo of doing courses that taught me some theory but missing the hands on knowledge I need to really use RL. This looks like exactly the type of course I'm looking for!
Thank you! I'll be be curious if / how these notebooks help and how your experience is! Any feedback welcome!
A few years ago I made something similar. It doesnt go all the way to ppo, and has a different style.
I won't claim it is better or worse, but if anyone here is trying to learn, having the same information presented in multiple forms is always nice.
Thanks for making this!
Note: I was carefully reading along and well into the third notebook before I realized that the code sections marked "TODO" were actual exercises for the reader to implement! (And the tests which follow are for the reader to check their work.)
This is a clever approach. It just wasn't obvious to me from the outset.
(I thought the TODOs were just some fiddly details you didn't want distracting readers from the big picture. But in fact, those are the important parts.)
This is really nice, great idea. I am going to make a suggestion which I hope is helpful - I don't mean to be critical of this nice project.
After going through the MDP example, I have one comment on the way you introduce the non-deterministic transition function. In your example the non-determinism comes from the agent making "mistakes", it can mistakenly go left or right when trying to go up or down:
1) You could introduce the mistakes more clearly as it isn't really explained the agent makes mistakes in the text, and so the comment about mistakes in the transition() function is initally a bit confusing.
2) I think the way this introduces non-determinism could be more didactic if the non-determinism came from the environment, not the agent? For example the agent might be moving on a rough surface and moving its tracks/limbs/whatever might not always produce the intended outcome. As you present it the transition is a function from an action to a random action to a random state, and the definition is just a function from an action to a random state.
Maybe I can use this in my pygame game
Great resources! Thank you for making this.
I'm attaching here a DRL framework I made for music generation, similar to OpenAI Gym. If anyone wants to test the algorithms OP includes, you are welcome to use it. Issues and PRs are also welcome.
TL;DR: If more folks feel this way, please upvote this comment: I'll be happy to take down this post, change the title, and either re-post it or just don't - the GitHub repo is out there - that that should be more than enough. Sorry again for the confusion (I just upvoted it).
I am deeply sorry about the confusion. And the last thing I intended was to grab any attention away from Andrej, and / or being confused with him.
I tried to find a way to edit the post title, but I couldn't find one. Is there just a limited time window to do that? If you know how to do it, I'd be happy to edit it right away in case.
I didn't even think this post would get any attention at all - it is my first post indeed here, and I really did it just b/c if anybody could use this project to learn RL I was happy to share.
Didn't "Zero to Hero" come from Disney's Hercules movie before Karparthy used it?
Didn't know that, but now I have an excuse to go watch a movie :D
Throwing in my vote - I wasn’t confused, saw your GH link and a “Zero to Hero” course name on RL, seems clear to me and “Zero to Hero” is a classic title for a first course, nice that you gave props to Andrea too! Multiple people can and should make ML guides and reference each other. Thanks for putting in the time to share your learnings and make a fantastic resource out of it!
Thanks a lot. It makes me feel better to hear that the post is not completely confusing and appropriating - I really didn't mean that, or to use it as a trick for attention.
I didn't find it confusing at all. I think it's totally ok to re-use phrasing made famous by someone else - this is how language evolves after all.
Thank you, I appreciate it.
this is a great resource nonetheless. Even if you did use the name to get attention how does it matter? I still see it as a net positive. Thanks for sharing this
Thank you!