Contextual stories about Learners
Naive Bayes & Viagra
Naive Bayes is a famous learner that performs well with little data. We apply it all the time. Christian and Griffiths state in their book, Algorithms To Live By, that 'our days are full of small data'. Imagine, for example, that you're standing at a bus stop in a foreign city. The other person who is standing there has been waiting for 7 minutes. What do you do? Do you decide to wait? And if so, for how long? When will you initiate other options? Another example. Imagine a friend asking advice about a relationship. He's been together with his new partner for a month. Should he invite the partner to join him at a family wedding?
Having pre-existing beliefs is crucial for Naive Bayes to work. The basic idea is that you calculate the probabilities based on prior knowledge and given a specific situation.
The theorem was formulated during the 1740s by Thomas Bayes, a reverend and amateur mathematician. He dedicated his life to solving the question of how to win the lottery. But Bayes' rule was only made famous and known as it is today by the mathematician Pierre Simon Laplace in France a bit later in the same century. For a long time after La Place's death, the theory sank into oblivion until it was dug up again during the Second World War in an effort to break the Enigma code.
Most people today have come in contact with Naive Bayes through their email spam folders. Naive Bayes is a widely used algorithm for spam detection. It is by coincidence that Viagra, the erectile dysfunction drug, was approved by the US Food & Drug Administration in 1997, around the same time as about 10 million users worldwide had made free webmail accounts. The selling companies were among the first to make use of email as a medium for advertising: it was an intimate space, at the time reserved for private communication, for an intimate product. In 2001, the first SpamAssasin programme relying on Naive Bayes was uploaded to SourceForge, cutting down on guerilla email marketing.
Machine Learners, by Adrian MacKenzie, MIT Press, Cambridge, US, November 2017.
Naive Bayes & Enigma
This story about Naive Bayes is taken from the book 'The Theory That Would Not Die', written by Sharon Bertsch McGrayne. Among other things, she describes how Naive Bayes was soon forgotten after the death of Pierre Simon Laplace, its inventor. The mathematician was said to have failed to credit the works of others. Therefore, he suffered widely circulated charges against his reputation. Only after 150 years was the accusation refuted.
Fast forward to 1939, when Bayes' rule was still virtually taboo, dead and buried in the field of statistics. When France was occupied in 1940 by Germany, which controlled Europe's factories and farms, Winston Churchill's biggest worry was the U-boat peril. U-boat operations were tightly controlled by German headquarters in France. Each submarine received orders as coded radio messages long after it was out in the Atlantic. The messages were encrypted by word-scrambling machines, called Enigma machines. Enigma looked like a complicated typewriter. It was invented by the German firm Scherbius & Ritter after the First World War, when the need for message-encoding machines had become painfully obvious.
Interestingly, and luckily for Naive Bayes and the world, at that time, the British government and educational systems saw applied mathematics and statistics as largely irrelevant to practical problem-solving. So the British agency charged with cracking German military codes mainly hired men with linguistic skills. Statistical data was seen as bothersome because of its detail-oriented nature. So wartime data was often analysed not by statisticians, but by biologists, physicists, and theoretical mathematicians. None of them knew that the Bayes rule was considered to be unscientific in the field of statistics. Their ignorance proved fortunate.
It was the now famous Alan Turing – a mathematician, computer scientist, logician, cryptoanalyst, philosopher and theoretical biologist – who used Bayes' rules probabilities system to design the 'bombe'. This was a high-speed electromechanical machine for testing every possible arrangement that an Enigma machine would produce. In order to crack the naval codes of the U-boats, Turing simplified the 'bombe' system using Baysian methods. It turned the UK headquarters into a code-breaking factory. The story is well illustrated in The Imitation Game, a film by Morten Tyldum dating from 2014.
A story about sweet peas
Throughout history, some models have been invented by people with ideologies that are not to our liking. The idea of regression stems from Sir Francis Galton, an influential nineteenth-century scientist. He spent his life studying the problem of heredity – understanding how strongly the characteristics of one generation of living beings manifested themselves in the following generation. He established the field of eugenics, defining it as ‘the study of agencies under social control that may improve or impair the racial qualities of future generations, either physically or mentally'. On Wikipedia, Galton is a prime example of scientific racism.
Galton initially approached the problem of heredity by examining characteristics of the sweet pea plant. He chose this plant because the species can self-fertilize. Daughter plants inherit genetic variations from mother plants without a contribution from a second parent. This characteristic eliminates having to deal with multiple sources.
Galton's research was appreciated by many intellectuals of his time. In 1869, in Hereditary Genius, Galton claimed that genius is mainly a matter of ancestry and he believed that there was a biological explanation for social inequality across races. Galton even influenced his half-cousin Charles Darwin with his ideas. After reading Galton's paper, Darwin stated, 'You have made a convert of an opponent in one sense for I have always maintained that, excepting fools, men did not differ much in intellect, only in zeal and hard work'. Luckily, the modern study of heredity managed to eliminate the myth of race-based genetic difference, something Galton tried so hard to maintain.
Galton's major contribution to the field was linear regression analysis, laying the groundwork for much of modern statistics. While we engage with the field of machine learning, Algolit tries not to forget that ordering systems hold power, and that this power has not always been used to the benefit of everyone. Machine learning has inherited many aspects of statistical research, some less agreeable than others. We need to be attentive, because these world views do seep into the algorithmic models that create new orders.
We find ourselves in a moment in time in which neural networks are sparking a lot of attention. But they have been in the spotlight before. The study of neural networks goes back to the 1940s, when the first neuron metaphor emerged. The neuron is not the only biological reference in the field of machine learning - think of the word corpus or training. The artificial neuron was constructed in close connection to its biological counterpart.
Psychologist Frank Rosenblatt was inspired by fellow psychologist Donald Hebb's work on the role of neurons in human learning. Hebb stated that 'cells that fire together wire together'. His theory now lies at the basis of associative human learning, but also unsupervised neural network learning. It moved Rosenblatt to expand on the idea of the artificial neuron.
In 1962, he created the Perceptron, a model that learns through the weighting of inputs. It was set aside by the next generation of researchers, because it can only handle binary classification. This means that the data has to be clearly separable, as for example, men and women, black and white. It is clear that this type of data is very rare in the real world. When the so-called first AI winter arrived in the 1970s and the funding decreased, the Perceptron was also neglected. For ten years it stayed dormant. When spring settled at the end of the 1980s, a new generation of researchers picked it up again and used it to construct neural networks. These contain multiple layers of Perceptrons. That is how neural networks saw the light. One could say that the current machine learning season is particularly warm, but it takes another winter to know a summer.
Some online articles say that the year 2018 marked a turning point for the field of Natural Language Processing (NLP). A series of deep-learning models achieved state-of-the-art results on tasks like question-answering or sentiment-classification. Google’s BERT algorithm entered the machine learning competitions of last year as a sort of 'one model to rule them all'. It showed a superior performance over a wide variety of tasks.
BERT is pre-trained; its weights are learned in advance through two unsupervised tasks. This means BERT doesn’t need to be trained from scratch for each new task. You only have to finetune its weights. This also means that a programmer wanting to use BERT, does not know any longer what parameters BERT is tuned to, nor what data it has seen to learn its performances.
BERT stands for Bidirectional Encoder Representations from Transformers. This means that BERT allows for bidirectional training. The model learns the context of a word based on all of its surroundings, left and right of a word. As such, it can differentiate between 'I accessed the bank account' and 'I accessed the bank of the river'.
- BERT_large, with 345 million parameters, is the largest model of its kind. It is demonstrably superior on small-scale tasks to BERT_base, which uses the same architecture with 'only' 110 million parameters.
- to run BERT you need to use TPUs. These are the Google's processors (CPUs) especially engineered for TensorFLow, the deep-learning platform. TPU's renting rates range from $8/hr till $394/hr. Algolit doesn't want to work with off-the-shelf packages, we are interested in opening up the blackbox. In that case, BERT asks for quite some savings in order to be used.