Home · Research · Essays · Link Farm · Podcast · Curriculum Vitæ · Impressum

Deep Learning

Published on 2015-04-18


It is generally hard to make predictions, especially when they are about the future. I'll make one, however: the current fad known as “Deep Learning” will create the biggest wave of over-hyped, irreproducible and hence ultimately wrong publications in decades.

First, what is “Deep Learning”? Deep learning is the name given to a combination of two things: neural networks and lots of data. “Hang on,” I hear you say, “neural networks? We had these like in the seventies and eighties!” and you would be correct. Neural networks were one of the first attempts to use structures that are similar to the brain's in machine learning. Essentially, a neural network consists of neurons. Each neuron gets signal inputs from other neurons and fires its own signal when a threshold is surpassed. Neurons are then layered so that neurons only get signals from neurons in the layer immediately above them and send signals only to neurons in the layer immediately below them. This is then known as a “feed-forward network”.

One nice thing about neural networks is that you can train them to recognise patterns in the input: you can gradually adjust the threshold of individual neurons so that certain output neurons only fire when a certain pattern is present at the input neurons. The other nice thing about neurons is that they're very simple so it's easy to create large neural networks that also work very fast.

For some reason, neural networks weren't used much after the eighties. I remember taking a course in machine learning and was taught the basic algorithm for threshold adaptation as part of my diploma studies, but it appeared to me to be a cute little trick, nothing more. One problem then seems to have been that it simply was too expensive to create large networks that could do complex tasks and recognise complex patterns. But today, fast CPUs and large memories are cheap, neural networks are almost trivially parallelisable, and thus truly vast networks now become available to researchers. Fundamental improvements (convolutions and regularisation) make neural networks work even better.

The other thing that has happened in the meantime is the ubiquity of “data”. The chance to discover something interesting in a pile of data is very exciting to many people, myself included, so there is a temptation to use all kinds of methods to tease hidden patterns out of seemingly unstructured and random-looking data.

So now these two things are put in apposition by applying one to the other, and the combination has been named “Deep Learning”. The technique is being explored, applied, and refined; the field has its own conferences and journals; and thus the impact of Deep Learning grows.

And as far as the story goes, this is all good and proper. However, let us now add a few rather toxic ingredients to this mix:

Human nature being what it is, my contention is the following: that academics will jump on this new ship called Deep Learning because it is cool, because it will find out what's important for them, because they can use it without knowing how it works and also because they are not even expected to explain the results. Deep Learning thus gives them a magic box that they can let loose on a pile of data. If they can get a spectacular result, excellent. If not, well, there is this other pile of data that we haven't used yet. This magic box is, if I'm allowed to mix my metaphors, a gigantic carrot that's dangling in front of their noses. The temptation of being able to get potentially spectacular results without investing a lot of brain power will be too much for many academics.

Yet, as usual, the story is not so simple. For example, some problems are pathological. Here is Yann LeCun, one of the premier advocates for Deep Learning, on one such problem class (from an interview):

The limitations [of finding good parameter sets] do not concern just backprop, but all learning algorithms that use gradient-based optimization.

These methods only work to the extent that the landscape of the objective function is well behaved. You can construct pathological cases where the objective function is like a golf course: flat with a tiny hole somewhere1. Gradient-based methods won't work with that.

The trick is to stay away from those pathological cases. One trick is to make the network considerably larger than the minimum size required to solve the task. This creates lots and lots of equivalent local minima and makes them easy to find. The problem is that large networks may overfit, and we may have to regularize the hell out of them (e.g. using drop out).

Even without understanding all the technical terms, what LeCunn is saying, essentially, is “don't apply Deep Learning (or, to be fair, any “gradient-based” method) to problems that aren't of the right shape.” Well, quite. But then he says, “But here's one way to make your problem the right shape, even though you might have to fudge a little, or even a lot.” (By the way, “regularization” is a technique that seems to make many networks better behaved, although no one seems to fundamentally understand why. In this way, it's like pixie dust. Originally, I had also included PKI here, but while it's certainly true that many people believe that more PKI is better, it cannot be argued with a straight face that more PKI makes a system better. Buf I digress.)

Compare this with Forman S. Acton on model fitting (in Numerical Methods That (usually) Work):

[Researchers that try to fit a 300-parameter model by least squares] leap from the known to the unknown with a terrifying innocence and the perennial self-confidence that every parameter is totally justified. It does no good to point out that several parameters are nearly certain to be competing to “explain” the same variations in the data and hence the equation system will be nearly indeterminate. It does no good to point out that all large least-squares matrices are striving mightily to be proper subsets of the Hilbert matrix—which is virtually indeterminate and uninvertible—and so even if all 300 parameters were beautifully independent, the fitting equations would still be violently unstable. [...]

Most of this instability is unnecessary, for there is usually a reasonable procedure. Unfortunately, it is undramatic, laborious, and requires thought [...]. They should merely fit a five-parameter model, then a six-parameter one. If all goes well and there is a statistically valid reduction of the residual variability, then a somewhat more elaborate model may be tried. Somewhere along the line—and it will be much closer to 15 parameters than to 300—the significant improvement will cease and the fitting operation is over. [...] The computer center's director must prevent the looting of valuable computer time by these would-be fitters of many parameters.

To me, it's almost as if Acton had channeled the future. To paraphrase Acton again, in the hands of a LeCun, Deep Learning might work wonders; for ordinary mortals, the results will be a mixed bag. (To be fair again, LeCun advocates increasing the size of the network only while the accuracy increases, but how many researchers will listen to that? Or even know that this is the proper way? Or take into account a large network's tendency to overfit? Or can balance on the one hand the need for a large network to overcome the geometry of their problem space, and on the other the need not to make the network too large, lest it overfit? And so on.)

I can see two bad developments coming from all this, one unpleasant but tolerable, the other quite alarming.

First, lots of scarce research money will be poured down the drain in pursuit of early, exciting, but ultimately unrepeatable and hence wrong results. That, to me, is unpleasant and annoying, since I naturally tend to think that this research money would be better spent on my own pet ideas. Of course, the pursuit of knowledge for its own sake is part of what makes science such a great field to work in, and wrong results are always more numerous than correct ones. Equally of course, “exciting” is not the same as “important” and much money will be spent on a mere fad. Conferences and journals will be filled with papers that will gather digital dust on the digital shelves as no one reads them. Money will not come my way but instead will be heaped on the undeserving. But all of that is fine, or at least tolerable.

The alarming development is when policy-makers or others with money get wind of this magic box and want to use it for purposes for which it was not designed. For example, one thing that Deep Learning can apparently do is time series forecasting. This offers a sciency-sounding crystal ball to all those “technical” stock analysts that I'm sure they will eagerly embrace (if they haven't done so already). Will the current champions of Deep Learning have the honesty to point out that without domain knowledge, any forecast by any technique is just so much nonsense? Or will they, under pressure by their Universities' administrations, yield to the temptation of all those research dollars? Or will they perhaps find that, once on the back of the tiger, it's difficult to get off?

2015-04-19. Edited to add: On re-reading this, I seem to be coming down hard on the technique of Deep Learning. This is not my intention. Deep Learning, while having a snappy name that instils suspicion in suspicious bastards like myself, is a perfectly good machine learning technique, and, from what little I know, seems to be a genuine improvement over what I learned at University. Rather, it is the promise of near-effortless results bundled with pressures on academics to acquire funds and to publish that is so toxic.

2015-04-19. Edited: Fixed a few typos and reformulated the footnote.

2015-04-25. Edited: Added irreproducibility to first paragraph.


1If a golf course isn't one of the most well-behaved geographies on Earth, I don't know what is.