The Importance of Simulation in the Age of Deep Learning

01 Dec 2018 Gregory J. Stein

The modern revolution in machine learning and robotics have been largely enabled by access to massive repositories of labeled image data. AI has become synonymous with big data, chiefly because machine learning approaches to tasks like object detection or automated text translation require massive amounts of labeled training data. Yet obtaining real-world data can be expensive, time-consuming, and inconvenient. In response, many researchers have turned to simulation tools — which can generate nearly limitless training data. These tools have become fundamental in the development of algorithms, particularly in the fields of Robotics and Deep Reinforcement Learning.

This is the first post in a three-part series on the role of simulated image data in the era of Deep Learning. In this post, I discuss the significance of simulation tools in the field of robotics and the promise and limitations of photorealistic simulators.

Why would we simulate data?

Getting data can be a burden. For most object detection tasks, for instance, images containing objects must be hand-annotated: i.e. a trained user must draw a box around each object in the scene and label it. Achieving state-of-the-art performance on an object detection task can require thousands or millions of images, putting the capability out of reach of most academic labs and industry researchers.

As machine learning has grown, both academic researchers and companies are increasingly interested in tackling tasks for which a so-called “off-the-shelf” dataset may not exist. For example, imagine a car company that would like to automate the process of taking inventory in one of its warehouses. Even though many available object-detection datasets support detecting cars, most do not have data distinguishing the particular make and model of the car. Furthermore, some of the cars in the warehouse may not even be on the road yet, so labeling them reliably would assuredly require the comany to collect the data themselves. Fortunately, video game engines are now capable of producing photorealistic images incredibly quickly. Given a 3D model of their new car, the company can easily generate a massive number of labeled images that can be used to retrain the object detector.

Video games are typically built on top of powerful 3D rendering tools. These images are screenshots from Grand Theft Auto V (from the Driving in the Matrix paper) and are impressively realistic: #[image===gtav-screenshots]#

For applications in robotics, the robot must interact with its environment. Almost definitionally, a publicly available dataset is not likely sufficient for testing the robot performance: except for trivial environments, no dataset is sufficiently large so as to contain every possible combination of actions the robot may take. Simulations have been a boon for robotics experiments, since they allow the robot to interact with its environment without risking any damage to either the robot or the world it is supposed to interact with. The Build-Measure-Learn loop is usually made dramatically faster using simulation tools; once the simulator is up-and-running, it is usually trivial to iterate over the design of the robot, the configuration of the environment, or the parameters of the algorithm.

When I say “simulator”, I usually mean a digital simulator, but there are real-world simulators as well. For instance, the Jet Propulsion Laboratory has a scale replica of the Mars Rover here on Earth that they use to simulate the behavior of the robot on Mars. For most problems, this sort of simulation does not allow for a high-number of trials.

In summary, there are a number of reasons we might prefer to simulate experiments:

Synthetic data is usually cheap and nearly limitless: Digital simulation tools, like 3D game engines, can create synthetic image data extremely quickly and with minimal overhead. The 3D engine can be run on multiple computers simultaneously, if necessary, allowing for parallelization that is not usually achievable for real-world data. Like the car company example above, having access to (1) a 3D model of an object I would like to detect and (2) access to a simulated world means that I can easily generate image data of that object in arbitrarily many different scenes and scenarios.

While we have 3D graphics engines that can create images with objects in them, identifying the location of an object in an arbitrary image — the inverse problem — is quite difficult in general.

Simulations enable rapid prototyping: Because of how easy it is to generate synthetic data, the data can be used to quickly develop new algorithms and tools, particularly for never-before-tried applications. As opposed to the hand-collection of data, simulation tools can usually be easily customized to support new tasks or environments in ways that may take incredibly long — or may be incredibly expensive — in the real world. Yet since the data is not real, it may differ in key ways from realistic data, creating risks associated with fine-tuning algorithms on simulated data alone. I discuss this more in the sections below.

Simulations are low-risk: This is particularly important for robotics applications, for which testing a new algorithm may require risking multi-million-dollar hardware. High-quality, physics-driven simulation tools can partially eliminate the need for high-risk testing that may be required before deploying the robot (e.g. the Mars Rover). Furthermore, there is a branch of machine learning known as reinforcement learning (and its deep-learning-based variant: deep reinforcement learning) in which an autonomous agent learns to interact with its environment via trial and error; deploying a completely untrained system may lead to chaos.

Simulations allow researchers to compare algorithms and reproduce experiments: As opposed to the real world, which can be noisy or change in ways that we do not expect, simulated environments are designed to be repeatable and reliable. For this reason, they are perfect platforms for comparing algorithms on identical experiments. This is extremely useful in the field of robotics, where hardware and experiments may differ slightly between labs.

Video games and synthetic image generation

Hand-annotated image data, like the ImageNet dataset, has been the biggest enabler of progress in the modern machine learning revolution. Images are rich with information. The ubiquity of cheap, high-quality cameras has resulted in an increasing number of machine learning applications, like self-driving cars and home care robots, that rely on images to make decisions. Yet because of the high-risk and high-cost nature of real-world data collection and experimentation, video games, and the 3D rendering tools that drive them, have become popular test beds for algorithms development.

This overview of how the machine learning community has benefited from the video game industry is by no means an exhaustive list of different simulation tools available. I hope to provide a more complete list of tools and the types of data they aim to provide in Part 3 of this series.

The 2012 deep learning breakthrough came about when researchers realized that a machine learning data structure called a deep neural network could be much more quickly trained using GPUs, a graphics processing unit. Video games, for which GPUs were originally designed, have matured at the perfect time for Deep Learning. The gaming industry has become a multi-billion dollar per year industry and is largely devoted to making the most realistic images possible. Naturally, the machine learning community has piggybacked off of the experience and tools available to the gaming industry to generate large quantities of photorealistic image data.

The SYNTHIA Dataset is an open-source tool for generating data for tasks of relevance to autonomous vehicles. The simulated environment provides rather realistic image data with a host of labels, including objects to detect and semantic segmentation.

Costing over $250 million to produce, Grand Theft Auto V is one of the best-looking open-world games ever made. Image data from the game is sufficiently photorealistic that unaltered images from the game have been used to train object detection algorithms to locate cars in the real world. Similarly, though it is not a game per se, the SYNTHIA dataset is another simulation platform primarily for tasks relating to self-driving cars. Since the platform is open-source, it has been steadily growing in popularity amongst researchers for generating data for testing algorithms for tasks like object detection, semantic segmentation, and self-driving. The platform supports a myriad of different weather and lighting, making it ideal for testing how real-world system will behave as conditions vary.

Not every simulation tool aims to be an immediate replacement for the real world, but are instead geared specifically for the development of new algorithms. deep reinforcement learning describes a family of algorithms that learn to perform arbitrary tasks through trial and error with their environment. Naturally, the development of these algorithms is much easier and faster in simulated environments. The following video shows an agent navigating through a simulated maze and tackling a variety of other interesting tasks:

The AI agent shown in this video was trained on thousands of mazes, something that would be effectively impossible on a real-world platform. The video is supplementary material for the paper Reinforcement Learning with Unsupervised Auxiliary Tasks.

The environment in the clip above was developed by the AI powerhouse DeepMind, which has recently announced a partnership with Unity3D, a popular tool for video game development:

DeepMind researchers are addressing huge AI problems, and they have selected Unity as a primary research platform for creating complex virtual environments that will enable the development of algorithms capable of learning to solve complex tasks. We believe the future of AI is being shaped by increasingly sophisticated human-machine interactions, and Unity is proud to be the engine that is enabling these interactions.

In a reference paper published last week, we detailed how to use the Unity platform to create and leverage simulation environments that are rich in sensory and physical complexity, provide compelling cognitive challenges, and support dynamic multi-agent interaction. These environments provide the foundation to accelerate AI research in areas such as computer vision, robotics, natural language instruction, autonomous vehicle development, and many other areas of science and technology. This is just the beginning with much more to come.

Unity3D
September 26, 2018

Easy to use and easy to modify simulation tools like these are becoming increasingly common, and are the primary driver of progress in deep reinforcement learning. I look forward to seeing how relationships between the gaming community and machine learning labs continue to evolve; I expect such partnerships will continue to positively impact progress.

The limits of simulated data

</span> There are limitations to using simulated data, namely that *the data is not collected in the real world*. Any discrepancy between the real world and the simulated world may cause machine learning systems to behave in often unpredictable ways. For instance, the deep reinforcement learning community is notorious for being [problematically sensitive](https://arxiv.org/pdf/1709.06560.pdf) to even slight changes in the inputs and training procedure. While many of the images I have shown above appear mostly realistic, under scrutiny it is clear that there are discrepancies, like the material textures or the scene lighting, that may present challenges to using simulated data in production. As such, testing *only* in simulation is typically insufficient for real-world applications. There is, however, a body of research devoted to bridging this gap.

Not discussed here are tasks for which simulation is as difficult as the problem of interest. One example is language translation: if we already knew how to generate sentences in different languages, the problem would be solved.

The next post in this series will be devoted to a discussion of various *sim-to-real* techniques: algorithmic approaches to ensuring that machine learning systems trained on simulated data will behave as expected in the real world. Until then, feel free to leave comments below.