Contextual stories about Readers
Naive Bayes, Support Vector Machines and Linear Regression are called classical machine learning algorithms. They perform well when learning with small datasets. But they often require complex Readers. The task the Readers do, is also called feature-engineering. This means that a human needs to spend time on a deep exploratory data analysis of the dataset.
Features can be the frequency of words or letters, but also syntactical elements like nouns, adjectives, or verbs. The most significant features for the task to be solved, must be carefully selected and passed over to the classical machine learning algorithm. This process marks the difference with Neural Networks. When using a neural network, there is no need for feature-engineering. Humans can pass the data directly to the network and achieve fairly good performances straightaway. This saves a lot of time, energy and money.
The downside of collaborating with Neural Networks is that you need a lot more data to train your prediction model. Think of 1GB or more of plain text files. To give you a reference, 1 A4, a text file of 5000 characters only weighs 5 KB. You would need 8,589,934 pages. More data also requires more access to useful datasets and more, much more processing power.
Imagine … You've been working for a company for more than ten years. You have been writing tons of emails, papers, internal notes and reports on very different topics and in very different genres. All your writings, as well as those of your colleagues, are safely backed-up on the servers of the company.
One day, you fall in love with a colleague. After some time you realize this human is rather mad and hysterical and also very dependent on you. The day you decide to break up, your (now) ex elaborates a plan to kill you. They succeed. This is unfortunate. A suicide letter in your name is left next to your corpse. Because of emotional problems, it says, you decided to end your life. Your best friends don't believe it. They decide to take the case to court. And there, based on the texts you and others produced over ten years, a machine learning model reveals that the suicide letter was written by someone else.
How does a machine analyse texts in order to identify you? The most robust feature for authorship recognition is delivered by the character n-gram technique. It is used in cases with a variety of thematics and genres of the writing. When using character n-grams, texts are considered as sequences of characters. Let's consider the character trigram. All the overlapping sequences of three characters are isolated. For example, the character 3-grams of 'Suicide', would be, ‘Sui’, ‘uic’, ‘ici’, ‘cid’, etc. Character n-gram features are very simple, they're language-independent and they're tolerant to noise. Furthermore, spelling mistakes do not jeopardize the technique.
Patterns found with character n-grams focus on stylistic choices that are unconsciously made by the author. The patterns remain stable over the full length of the text, which is important for authorship recognition. Other types of experiments could include measuring the length of words or sentences, the vocabulary richness, the frequencies of function words; even syntax or semantics-related measurements.
This means that not only your physical fingerprint is unique, but also the way you compose your thoughts!
The same n-gram technique discovered that The Cuckoo’s Calling, a novel by Robert Galbraith, was actually written by … J. K. Rowling!
- Paper: On the Robustness of Authorship Attribution Based on Character N-gram Features, Efstathios Stamatatos, in Journal of Law & Policy, Volume 21, Issue 2, 2013.
- News article: https://www.scientificamerican.com/article/how-a-computer-program-helped-show-jk-rowling-write-a-cuckoos-calling/
A history of n-grams
The n-gram algorithm can be traced back to the work of Claude Shannon in information theory. In the paper, 'A Mathematical Theory of Communication', published in 1948, Shannon performed the first instance of an n-gram-based model for natural language. He posed the question: given a sequence of letters, what is the likelihood of the next letter?
If you read the following excerpt, can you tell who it was written by? Shakespeare or an n-gram piece of code?
SEBASTIAN: Do I stand till the break off.
BIRON: Hide thy head.
VENTIDIUS: He purposeth to Athens: whither, with the vow I made to handle you.
FALSTAFF: My good knave.
You may have guessed, considering the topic of this story, that an n-gram algorithm generated this text. The model is trained on the compiled works of Shakespeare. While more recent algorithms, such as the recursive neural networks of the CharNN, are becoming famous for their performance, n-grams still execute a lot of NLP tasks. They are used in statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, ...
God in Google Books
This allowed for many socio-linguistic investigations. For example, in October 2018, the New York Times Magazine published an opinion article titled 'It’s Getting Harder to Talk About God'. The author, Jonathan Merritt, had analysed the mention of the word 'God' in Google's dataset using the n-gram viewer. He concluded that there had been a decline in the word's usage since the twentieth century. Google's corpus contains texts from the sixteenth century leading up to the twenty-first. However, what the author missed out on was the growing popularity of scientific journals around the beginning of the twentieth century. This new genre that was not mentioning the word God shifted the dataset. If the scientific literature was taken out of the corpus, the frequency of the word 'God' would again flow like a gentle ripple from a distant wave.
Grammatical features taken from Twitter influence the stock market
The boundaries between academic disciplines are becoming blurred. Economics research mixed with psychology, social science, cognitive and emotional concepts have given rise to a new economics subfield, called 'behavioral economics'. This means that researchers can start to explain stock market mouvement based on factors other than economic factors only. Both the economy and 'public opinion' can influence or be influenced by each other. A lot of research is being done on how to use 'public opinion' to predict tendencies in stock-price changes.
'Public opinion' is estimated from sources of large amounts of public data, like tweets, blogs or online news. Research using machinic data analysis shows that the changes in stock prices can be predicted by looking at 'public opinion', to some degree. There are many scientific articles online, which analyse the press on the 'sentiment' expressed in them. An article can be marked as more or less positive or negative. The annotated press articles are then used to train a machine learning model, which predicts stock market trends, marking them as 'down' or 'up'. When a company gets bad press, traders sell. On the contrary, if the news is good, they buy.
A paper by Haikuan Liu of the Australian National University states that the tense of verbs used in tweets can be an indicator of the frequency of financial transactions. His idea is based on the fact that verb conjugation is used in psychology to detect the early stages of human depression.
Paper: 'Grammatical Feature Extraction and Analysis of Tweet Text: An Application towards Predicting Stock Trends', Haikuan Liu, Research School of Computer Science (RSCS), College of Engineering and Computer Science (CECS), The Australian National University (ANU)
Bag of words
In Natural Language Processing (NLP), 'bag of words' is considered to be an unsophisticated model. It strips text of its context and dismantles it into a collection of unique words. These words are then counted. In the previous sentences, for example, 'words' is mentioned three times, but this is not necessarily an indicator of the text's focus.
The first appearance of the expression 'bag of words' seems to go back to 1954. Zellig Harris, an influential linguist, published a paper called 'Distributional Structure'. In the section called 'Meaning as a function of distribution', he says 'for language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use. The linguist's work is precisely to discover these properties, whether for descriptive analysis or for the synthesis of quasi-linguistic systems.'