In 2016, DeepMind announced that it had reduced Google’s data centre cooling bills by 30%, by developing an artificial intelligence (AI) control system which learned over time how to better control the centre’s actuators and coolers. 30% would have been a striking improvement for any large-scale energy-consuming environment; given the existing sophistication of Google’s infrastructure, it was especially so.
DeepMind gained its fame with its early breakthroughs in developing AI algorithms to play games. After achieving superhuman ability with a selection of old Atari games, its algorithms went on to defeat the world’s top Go player, Lee Sedol, in a 21st Century echo of IBM’s chess computer Deep Blue beating Garry Kasparov in the 1990s, but in a game with orders-of-magnitude greater complexity.
DeepMind’s successes have come from reviving a branch of AI which, since its inception in the 1950s, had become a bit of a backwater. Reinforcement Learning (RL) has its roots in dynamic programming and Richard Bellman’s famous update equation for sequential decisions, mentioned in our previous blog. It concerns the design of agents (algorithms or entities which perform actions) which can learn how to make better decisions over time via feedback from their environment. The approach takes many of its cues from neuroscience, and insights on how animals and humans learn from experience - think Pavlov and his dog.
Practical RL had stalled over the years since its inception, as the functions required for a sophisticated agent to learn had proved too complex for early computers. These “functions” are mathematical models which encourage an agent to act in a certain way - for example, a chess player could have a policy “function” which prompts her to make a certain move given a certain board state, or a value “function” allowing her to assess whether she is winning or losing.
DeepMind revisited RL at a fortuitous time, when advances in other branches of AI and computer hardware had reached a level of maturity to enable these complex policy and value functions to be modelled for the first time. “Deep” artificial neural networks – learning algorithms inspired by the architecture of biological brains – had proved adept at learning complex functions from historical data, and were now adopted for the reinforcement learning requirement of learning how to evaluate situations and develop strategies with experience. The resulting approach, “deep” reinforcement learning (DRL) proved incredibly powerful.
Since their breakthrough applying DRL to Atari and Go, DeepMind have been developing their DRL agents to be increasingly generalised. AlphaZero, an advancement on their Go-specialised algorithm which beat Lee Sedol, taught itself how to play multiple games - Go, Chess and Shogi (Japanese chess) – from scratch, playing copies of itself with only the rules of the game as constraints to inform its learning process.
It quickly became the strongest chess player in the world, beating all existing chess computers, which are ancestors of IBMs Deep Blue (humans have long since been left in their wake). While these older algorithms rely on heuristics and encoded insights from teams of expert human players and brute force number crunching, AlphaZero carries no such human biases, and chess grandmasters have marvelled at the flair, originality and aggressiveness with which the algorithm plays its games.
It is this facet of deep reinforcement learning which promises so much for the operational research community. Today’s state of the art is similar to chess computers like Deep Blue – humans research and codify the rules, heuristics and constraints which describe how logistical or organisational systems work, and then powerful optimisation solvers crunch the numbers to give optimal actions according to these models. But these solutions are limited by the likeness of these models to the real world, and humans come with blind spots and biases which can be incorporated into this modelling work.
DeepMind’s work with Google’s data centres offered a first tantalising glimpse of the AlphaZero approach – their algorithm spotted strategies no humans had considered, for example by learning to anticipate winter’s natural cooling effect on the mains water supply, exploiting this effect to save energy using its water coolers less.
While DRL is still in its infancy – it still remains largely restricted to working out in the AI “gyms” of virtual games and challenges, or physical toy examples such as robot arms playing cup-and-ball – these exciting techniques are beginning to show their promise for just the kind of large-scale optimal control problems which are the concern of operational research.
Von Neumann, Blackett, Dantzig, Bellman and the other brilliant men and women they inspired in the field of operational research would be very excited to see the speed with which this new frontier of AI is progressing. From its origins in the existential threat of the world wars of the 20th century, we believe operational research will be crucial in fixing the existential threat of climate change in the 21st so that, perhaps, humans and von Neumann’s machines will be free to spread across the galaxy in the 22nd!