Last year I was fortunate enough to attend NeurIPS 2019. It was an amazing experience, I was able to meet lots of smart people and learned a ton. This post discusses my time at NeurIPS 2019
This December, I was lucky enough to be able to go to my first NeurIPS and present my work at the workshop Tackling Climate Change with AI. While it was exciting to be able to present my first paper at such a big workshop (and to give a spotlight talk, my talk starts at about the 33:30 mark), the real highlights were getting to hear about the amazing work being done by others in the field and networking with people. And I got to meet Richard Sutton, the godfather of Reinforcement Learning ! (Richard is the guy in the blue button down shirt in the photo)
In no particular order, the talks I especially enjoyed are:
Please send me an email or leave a comment if you find some error or find a link relevant to the talks that you feel should be included.
I took some notes during each of the above talks. I’ll try to make sense of them below. My notes for some of the talks are sparse or end early. In this case, I tried to fill in the blanks but if you have the time and are interested enough, I strongly recommend watching the full talks, or at least looking through the slides.
The problem being addressed here is that robots need to be able to understand the world in a robust way in order to be able to act reliably. Learning a policy to directly map from states to actions works for simple tasks, but starts to break down when tasks become complex. We want to use neural rendering to model the state of the world implicitly.
The work done here extends the Generative Query Network (GQN) to:
They created an attention mechanism, called Epipolar Attention, to improve upon the original GQN. The talk concluded with the statement that: “Geometrically inspired neural network primitives improve implicit 3D understanding.”
A big theme in this talk was how to make RL bridge the gap between RL’s horrible sample inefficiency and human’s comparably sample-efficient learning. Very quickly Pieter pointed out that Humans after 15 minutes of experience on an Atari typically outperform DQN after 115 hours of training. Humans are able to use their past experience to quickly pick up new tasks. I.e., if I am an athlete who plays hockey, maybe I can use many of the skills I learned from hockey to quickly perform well at lacrosse. We’d like to make machine learning algorithms also use their past experience to quickly pick up new skills. Enter meta-RL:
Typically the agent is an RNN, so that there is some memory of performing past tasks, which will be leveraged to quickly pick up new tasks. Different activations in the RNN means that the current policy is different from the last policy. And the meta training goal can be optimized using an existing reinforcement learning algorithm.
A major fallacy of training in simulation is that we cannot create a single simulation that will reliably and exactly recreate the physics and interactions that a robot will encounter in the real world. One way to get around this is via domain randomization. This technique randomizes many aspects of the simulation, such as the coefficient of friction, colors and sizes of objects in the simulation, and so on, and trains a policy across these randomized simulations. A slide on Pieter Abbeel’s presentation read: “If the RL model sees enough simulated variations, the real world may look like just the next simulator.” In this case, the model would be able to perform the task in the real world, since it would have learned how to perform the task in a robust way that worked across all simulations. They used domain randomization for robot grasping by randomizing the structure of the objects in simulation and were able to show that a policy trained in simulation also worked in the real world.
In model based RL (MBRL), a poliy interacts with the real world and then is updated. Using these collected interactions, a learned simulator is trained to model the environment. Then, the policy is improved by interacting with the learned simulator (not with the real environment).
Overfitting happens in MBRL because policy optimization wants to exploit the regions of the learned simulator where there hasn’t been enough data collected for the learned simulator to accurately model the environment. This leads to massive failures.
In an alternative approach, Model based Meta Policy Optimization (MBMPO), interaction data is collected under an adaptive policy. Then, an ensemble of simulators is learned from the collected interaction data. Following this, meta-policy optimization is done over the ensemble and a new meta policy and new adaptive policies are collected from the optimization update. In this way, the authors could meta-learn a policy that could adapt quickly.
A couple of points from the talk:
The talk went on and covered Guided Meta-Policy Search but I do not have any good notes on that.
Igor prefaced this talk with saying that we’ve seen good AI progress in specific and well-defined tasks, but that we still don’t have the ability to move to complex and varied tasks. He proposes that multi-agent interaction can be a tool to move towards this because:
Such a system could be deployed in a self-supervised way for continuous learning.
If we mix:
In the hide-and-seek environment agents are given a team based reward so that collaboration is encouraged. The hiders are also given a preparatory phase where the seekers are frozen in place and cannot move. This preparation phase allows the hiders to construct their shelter.
They saw many unexpected behaviors from the agents during experiments in this work.
In the blog post they discuss testing intrinsic motivation compared to multi-agent competition and show that their method outperforms an intrinsic motivation scheme. It looks like they only compare to count-based motivation, and while I’m not an expert in intrinsic motivation, it appears that there are other methods that might perform better.
Measuring progress in the hide-and-seek environment was a big challenge for them in this work, and they used “intelligence tests” to measure progress of the agents as they learned!
The second part of Igor’s talk focused on this topic and seemed to discuss some work relevant to the POLO paper.
So far in RL we’ve focused on learning habitual, reflexive behaviors. These are hard to generalize or to improvise with. We’d like to move more towards learning through some feedback-guided exploration.
Under this paradigm, we can:
Some surprising benefits of energy-based models are that they are robust, generalize well, and learn continuously.
This is as far as I got with my notes. Please see Igor’s talk for more!
Richard Sutton laid out the premises for his talk:
These are basic needs but obtaining all of them is unprecedented.
We need algorithms for constructing state features by learning non-linear recurrent state update functions.
Theorem: If the approximation function is linear, value iteration with a distribution model learns nothing if an expectation model is used instead.
The value function should be linear in state features!
Currently, core RL learns:
We need to learn:
Subproblems can help solve the main problem by:
My notes at this point end. Sorry this section was kind of sparse, but please watch Sutton’s talk for more detail!
That’s the end of my notes here. Sorry the first half is sparse, and that it’s all short, but again, check out David’s talk for more.
I had an amazing time at the conference and saw some fascinating work. Luckily, I had a friend going with me who had been twice before and gave me a little bit of advice before we went. Mainly:
I’m happy to say that I took pretty much all of those pieces of advice. I was very careful to pace myself and see only what I really wanted to see, and I’m really glad I did. Even after doing this, I was still really tired every night. Also, being selective about time spent at the conference gave me slightly more opportunity to experience Vancouver (it’s a really cool city with good food and coffee).
I went to a couple of social events. This is one part of my NeurIPS I’d do differently next time. The only events I really went to were a New in Machine Learning meetup (which was lots of fun) and the Reinforcement Learning Social, which is where I met Richard Sutton. Next time, I’ll try to go to one or two of the company parties as well. The sponsor parties at NeurIPS are famous, and it seems for good reason. Some people that went to these parties posted photos of them on Twitter, and they looked very fun.
The Expo was a really cool part of the conference but it was also very overwhelming. There is opportunity to engage with brilliant researchers at many prestigious companies, which is very exciting, but there is also just a lot of people crammed into the Expo room almost all the time. I enjoyed going and talking to folks at different companies (and getting free stuff!) and will do it again next time. One thing I did notice was that by the end of the Expo (Wednesday) all of the people manning the sponsor booths are tired and less congenial than earlier in the week. I can’t blame them - the Expo was crazy the whole time it was happening - but it is worth remembering to make sure to go to the Expo early in the week in the future.
Fortunately for me, I emailed two people I really wanted to meet and one of them agreed to have breakfast with me! The other didn’t, but I also waited until the very last minute (the day before the conference began) to reach out. Having breakfast with this person was an excellent experience, and they were friendly, expressed interest in my work, and let me pick their brain. I would definitely recommend doing this to anyone attending NeurIPS or any other large machine learning conference. The worst that can happen is they can say no to meeting you, but if they say yes then you could make a new friend and have a positive interaction with someone.
Trying to keep up with all of the new work is already like trying to drink from a fire hose. Trying to do it at NeurIPS is like trying to drink from 100 fire hoses. So, I’ve learned to be very selective about the work that I put effort into really understanding during the conference. I saw a tweet from Colin Raffel saying that you shouldn’t try to deeply understand more than one new paper a day during NeurIPS, and I think that’s great advice.
I’ve concluded that I agree with all of the people who say that the most important thing at NeurIPS isn’t all of the new work being presented (even though that is important), but rather the opportunity to interact with so many brilliant ML researchers in one place. One can go back and read papers on arXiv at any time, but only a couple of times a year do large conferences like this happen, and the networking opportunity here is just too massive to spend all of your time catching up on the new work. I made some new connections, but not as many as I could have, and next time I’ll work to more effectively split my time between seeing new work and connecting with all of the interesting people.
NeurIPS 2019 was an overwhelmingly positive experience for me and I’m extremely excited to go back. Everyone that I met was incredibly kind, smart, and friendly and I’m happy that I connected with some people.
One final tip: before you go to NeurIPS (or another big conference) take some Emergen-C or Airborne or something. I got sick towards the end of the conference and I probably could have avoided it by taking some vitamin C in advance.
Thank you for reading this post! I’m happy to answer any questions in the comments and if you have your own NeurIPS experience to share, it would be great to hear about it!
Happy New Year!