A Slight Case Of Overfitting

5 min readOct 11, 2017

Here’s a handwavey description of what a certain class of machine-learning algorithms do: they squish a sensation space down into a representation space. By “sensation space” I mean the total set of possible inputs to the algorithm, which is typically imposingly large; and by “representation space” I mean a set of internal representations which the algorithm uses to decide on its final output, which is typically much smaller; hence the “squishing down”. The “learning” part of the algorithm’s task is to find a squishing-down that yields a representation space in which only the features of the sensation space which make a difference to the outcome are represented. A bad squishing-down randomly forgets, confuses and jumbles up its inputs into a representation that has no bearing on the desired outcome. A good squishing-down warps, declutters and aggregates its inputs to form an opinionated digest, in which characteristics of the input that “matter” for its purposes are represented prominently, while characteristics irrelevant to the desired outcome are smoothed out or entirely eliminated.

A trained neural net is a “smart” algorithm manufactured by a “dumb” one. The “dumb” algorithm takes a large set of labelled training data, which pairs every input with a desired outcome, and uses this to try to develop a squishing-down that reproduces the outcomes recorded in the training data. The “smart” algorithm uses the resulting squishing-down to convert inputs it hasn’t seen before to outcomes that hopefully reflect the intention captured by the training data’s labelling. Its “smarts” are all in the particular squishing-down that it has been trained to perform. For example, suppose the training data is a large collection of images of individual handwritten digits, each labelled with the digit that the image represents. During the learning process, the neural net converges on a squishing-down that correlates these inputs with their labelled outcomes as reliably as possible. Subsequently, we can feed it new images of handwritten digits, and it will use the same squishing-down to take a decent guess at which digits those images represent.

One of the big surprises of machine learning in recent years has been the discovery that neural nets trained in this way have enough degrees of freedom in selecting a squishing-down of sensation space into representation space that they can often converge on a “good” squishing-down given a large enough training data set. In other words, neural nets offer a space of possible squishings-down that is both large enough to contain useful squishings-down, and navigable enough that a training algorithm can find them. In fact, different ways of setting up a neural net offer different varieties of possible squishings-down, such that we are dealing with a space (neural net configurations) of spaces (possible “trained” states of a neural net so configured) of transformations between (sensation and representation) spaces. The search for useful configurations has a decidedly empirical and experimental flavour. We don’t know what else is out there.

A useful simplification of the world — a heuristic — is a handy gadget, and human beings are avid collectors of such gadgets. When we find a way of interpreting and predicting the behaviour of something in our environment which offers a lot of predictive power in return for modest cognitive investment — a simple rule-of-thumb that’s right about 80% of the time, say — we tend to get excited about it. This is why Daily Mail headlines of the “X causes cancer” or “Y prevents dementia” variety are so enticing: cancer and dementia are terrifying and unpredictable, and a rule-of-thumb that boils down to “avoid X” or “do more of Y” gives us a feeling of control and safety in the face of what frightens us. The reader who views these headlines with skepticism knows that genuinely effective simplifications of complex phenomena are rare and hard to find, and that the headline is most probably crying up a weak correlation which depends on many other factors. But it’s for just this reason that we cherish effective simplifications when we do find them, and our interest is always piqued when the headline is something like “turns out you can get an 80%-right prediction of whether someone will self-identify as gay or not by analysing the structure of their facial features” (even if the balance of probability is that it’s bullshit).

The machine-learning techniques developed over the past few years enable us to crank out gizmos of this kind at a probably unprecedented rate, and this raises all kinds of ethical alarms about the uses of heuristics: squishing down a complex reality into a simpler one so that you can anticipate and control it sounds awesome if it’s something like being able to predict the weather, and sinister if it’s something like classifying one’s fellow human beings as potential criminals and terrorists. Team awesome regards team sinister as scaremongering ignoramuses; team sinister regards team awesome as irresponsible snake-oil pedlars (being committed to the slightly contradictory view that the gizmo on offer is both ineffective and dangerous). My view is that we need to be both less alarmed and less excited, get used to seeing “AI” gizmos for the cognitive playthings they usually are, and remember that heuristics are a fundamental survival tool that we’ve been living with for a very long time. Part of the task of being human is evaluating the heuristics you use to find your way around against a wider range of considerations than those which the heuristic itself crystallises: learning their failure modes, refining or enlarging them where possible, and discarding them when necessary. Easy bewitchment by heuristics is a mark of immaturity.

A Slight Case Of Overfitting

Written by Dominic Fox

No responses yet