No Free Lunch and Neural Network Architecture

Summary: Machine learning must always balance flexibility and prior assumptions about the data. In neural networks, the network architecture codifies these prior assumptions, yet the precise relationship between them is opaque. Deep learning solutions are therefore difficult to build without a lot of trial and error, and neural nets are far from an out-of-the-box solution for most applications.

Since I entered the machine learning community, I have frequently found myself engaging in conversation with researchers or startup-types from other communities about how they can get the most out of their data and, more often than not, we end up taking about neural networks. I get it: the allure is strong. The ability of a neural network to learn complex patterns from massive amounts of data has enabled computers to challenge (and even outperform) humans on general tasks like object detection and games like Go. But reading about the successes of these newer machine learning techniques rarely makes clear one important point:

Nothing is ever free.

When training a neural network — or any machine learning system — a tradeoff is always made between flexibility in the sorts of things the system can learn and the amount of data necessary to train these systems. Yet in practice, the precise nature of this tradeoff is opaque. That a neural network is capable of learning complex concepts — like what an object looks like from a bunch of images — means that training it effectively requires a large amount of data to convincingly rule-out other interpretations of the data and reject the impact of noise. On the face of it, this statement is perhaps obvious: of course it requires more work/data/effort to extract meaning out of more complex problems. Yet, perhaps counterintuitive to the thinking of many machine learning outsiders, the way in which these systems are designed and the relationship between the many complex hyperparameters that define them has a profound impact on how well the system performs.

Noise takes many forms. In the case of object detection, noise might include the color of the object: I should be able to identify that a car is a car regardless of its color.

On its own, data cannot provide all the answers. The goal of learning is to extract more general meaning from a limited amount of data, discovering properties about the data that allow new data to be understood. Generalizing to unseen data requires making a decision, whether explicitly or implicitly, about the nature of the data beyond what is provided in the dataset. Consider a very simple dataset consisting of 10 points evenly spaced along the x-axis:

Example: Noisy Data

Suppose your goal was to fit a curve to this data, what might you do? Well, from inspection, the “curve” looks pretty much like a line, so you fit a line to the points and hand off the resulting fit to another system. But what if you were to collect more data and you discover that, instead of falling close to a line (left), the new data has a rather dramatic periodic component (right):

Example: More Noisy Data

You might (justifiably) claim that more data is necessary, and insist that a separate test set be provided so that you can choose between these different hypotheses. But at what point should you stop? Collecting data ad infinitum in order to rule out all possible hypothesis is impractical, so the learning algorithm must impose some assumptions about the structure of the data to be effective. Prior knowledge of the structure of the data is often essential to good machine learning performance: if you know that the data is linear with some noise added to it, you would know that fitting more complex curves is unnecessary. But if the underlying structure of the data is a mystery, as is often the case for very high-dimensional learning algorithms like a neural network, this problem is impossible to avoid.

Of course, having test data in addition to the training dataset is essential, and would mitigate the risk of a dramatic surprise.

Neural networks are vastly more complex and opaque than linear curve-fitting, exacerbating the challenges associated with constraining the space of possible solutions. The parameters that define a neural network — including the number of layers and the learning rate — do bias the system towards producing certain types of output, but often precisely how those parameters are related to the structure of the data is unclear. Unfortunately, detailed inspection of the performance of the network is largely impossible, owing to the complexity of the system being trained; most modern deep learning systems have millions of free parameters and no longer are a few coefficients enough to understand how the system performs.

There are methods for neural network inspection, but they remain an area of active research.

Relatedly, prior knowledge takes a markedly different form for more complex types of data, like images. For the dataset above, it was pretty clear from quick inspection of the data that a linear model was probably the right representation for the learning problem. If instead your dataset is billions of stock photos and your objective is to identify which of these contain household pets, it is not even clear how one should go about turning intuition into mathematical operations. In fact, this is ultimately the draw of such systems — the utility of neural network models is their ability to learn otherwise impossible-to-write-down features in the data.

Broadly, there is some well-validated intuition about the differences between vastly different neural network structures. Convolutional Neural Networks (CNNs), for example, are often used for image data, since the primary mode of their operation processes local patches of texture rather than the entire image all at once; this dramatically reduces the number of free parameters, and greatly restricts what the system can learn in a way that has proven effective for many image-based tasks. Yet whether I use six layers in my neural network instead of five, for example, is often decided by how well the system performs on the data, rather than an understanding of how that number of layers reflects the structure of the data.

I conclude with a word of caution: neural networks are far from an out-of-the-box solution for most learning problems and the process required to reach near-state-of-the art performance often comes as a surprise to non-experts. The design of machine learning systems remains an active area of research, and no size fits all for neural networks. Many off-the-shelf neural networks are designed or pretrained with a number of unwritten biases or priors about the nature of the dataset. Whether or not a particular learning strategy will work on one dataset is often a judgment call until validated with experiments, making iterating over the design an intensive process that often requires “expert” intuition or guidance.

As always, I welcome your thoughts in the comments below or on Hacker News.