Contextual stories about Oracles
Oracles are prediction or profiling machines. They are widely used in smartphones, computers, tablets.
Oracles can be created using different techniques. One way is to manually define rules for them. As prediction models they are then called rule-based models. Rule-based models are handy for tasks that are specific, like detecting when a scientific paper concerns a certain molecule. With very little sample data, they can perform well.
But there are also the machine learning or statistical models, which can be divided in two oracles: 'supervised' and 'unsupervised' oracles. For the creation of supervised machine learning models, humans annotate sample text with labels before feeding it to a machine to learn. Each sentence, paragraph or text is judged by at least three annotators: whether it is spam or not spam, positive or negative etc. Unsupervised machine learning models don't need this step. But they need large amounts of data. And it is up to the machine to trace its own patterns or 'grammatical rules'. Finally, experts also make the difference between classical machine learning and neural networks. You'll find out more about this in the Readers zone.
Humans tend to wrap Oracles in visions of grandeur. Sometimes these Oracles come to the surface when things break down. In press releases, these sometimes dramatic situations are called 'lessons'. However promising their performances seem to be, a lot of issues remain to be solved. How do we make sure that Oracles are fair, that every human can consult them, and that they are understandable to a large public? Even then, existential questions remain. Do we need all types of artificial intelligence (AI) systems? And who defines what is fair or unfair?
A classic 'lesson' in developing Oracles was documented by Latanya Sweeney, a professor of Government and Technology at Harvard University. In 2013, Sweeney, of African American descent, googled her name. She immediately received an advertisement for a service that offered her ‘to see the criminal record of Latanya Sweeney’.
Sweeney, who doesn’t have a criminal record, began a study. She started to compare the advertising that Google AdSense serves to different racially identifiable names. She discovered that she received more of these ads searching for non-white ethnic names, than when searching for traditionally perceived white names.You can imagine how damaging it can be when possible employers do a simple name search and receive ads suggesting the existence of a criminal record.
Sweeney based her research on queries of 2184 racially associated personal names across two websites. 88 per cent of first names, identified as being given to more black babies, are found predictive of race, against 96 per cent white. First names that are mainly given to black babies, such as DeShawn, Darnell and Jermaine, generated ads mentioning an arrest in 81 to 86 per cent of name searches on one website and in 92 to 95 per cent on the other. Names that are mainly assigned to whites, such as Geoffrey, Jill and Emma, did not generate the same results. The word 'arrest' only appeared in 23 to 29 per cent of white name searches on one site and 0 to 60 per cent on the other.
On the website with most advertising, a black-identifying name was 25 percent more likely to get an ad suggestive of an arrest record. A few names did not follow these patterns: Dustin, a name mainly given to white babies, generated an ad suggestive of arrest in 81 and 100 percent of the time. It is important to keep in mind that the appearance of the ad is linked to the name itself. It is independent of the fact that the name has an arrest record in the company's database.
What is a good employee?
Since 2015 Amazon employs around 575,000 workers. And they need more. Therefore, they set up a team of 12 that was asked to create a model to find the right candidates by crawling job application websites. The tool would give job candidates scores ranging from one to five stars. The potential fed the myth: the team wanted it to be a software that would spit out the top five human candidates out of a list of 100. And those candidates would be hired.
The group created 500 computer models, focused on specific job functions and locations. They taught each model to recognize some 50,000 terms that showed up on past candidates’ letters. The algorithms learned to give little importance to skills common across IT applicants, like the ability to write various computer codes. But they also learned some decent errors. The company realized, before releasing, that the models had taught themselves that male candidates were preferable. They penalized applications that included the word 'women’s,' as in 'women’s chess club captain.' And they downgraded graduates of two all-women’s colleges.
This is because they were trained using the job applications that Amazon received over a ten-year period. During that time, the company had mostly hired men. Instead of providing the 'fair' decision-making that the Amazon team had promised, the models reflected a biased tendency in the tech industry. And they also amplified it and made it invisible. Activists and critics state that it could be exceedingly difficult to sue an employer over automated hiring: job candidates might never know that intelligent software was used in the process.
Quantifying 100 Years of Gender and Ethnic Stereotypes
Dan Jurafsky is the co-author of 'Speech and Language Processing', one of the most influential books for studying Natural Language Processing (NLP). Together with a few colleagues at Stanford University, he discovered in 2017 that word embeddings can be a powerful tool to systematically quantify common stereotypes and other historical trends.
Word embeddings are a technique that translates words to numbered vectors in a multi-dimensional space. Vectors that appear next to each other, indicate similar meaning. All numbers will be grouped together, as well as all prepositions, person's names, professions. This allows for the calculation of words. You could substract London from England and your result would be the same as substracting Paris from France.
An example in their research shows that the vector for the adjective 'honorable' is closer to the vector for 'man', whereas the vector for 'submissive' is closer to 'woman'. These stereotypes are automatically learned by the algorithm. It will be problematic when the pre-trained embeddings are then used for sensitive applications such as search rankings, product recommendations, or translations. This risk is real, because a lot of the pretrained embeddings can be downloaded as off-the-shelf-packages.
It is known that language reflects and keeps cultural stereotypes alive. Using word embeddings to spot these stereotypes is less time-consuming and less expensive than manual methods. But the implementation of these embeddings for concrete prediction models, has caused a lot of discussion within the machine learning community. The biased models stand for automatic discrimination. Questions are: is it actually possible to de-bias these models completely? Some say yes, while others disagree: instead of retro-engineering the model, we should ask whether we need it in the first place. These researchers followed a third path: by acknowledging the bias that originates in language, these tools become tools of awareness.
The team developed a model to analyse word embeddings trained over 100 years of texts. For contemporary analysis, they used the standard Google News word2vec Vectors, a straight-off-the-shelf downloadable package trained on the Google News Dataset. For historical analysis, they used embeddings that were trained on Google Books and the Corpus of Historical American English (COHA https://corpus.byu.edu/coha/) with more than 400 million words of text from the 1810s to 2000s. As a validation set to test the model, they trained embeddings from the New York Times Annotated Corpus for every year between 1988 and 2005.
The research shows that word embeddings capture changes in gender and ethnic stereotypes over time. They quantifiy how specific biases decrease over time while other stereotypes increase. The major transitions reveal changes in the descriptions of gender and ethnic groups during the women’s movement in the 1960-1970s and the Asian-American population growth in the 1960s and 1980s.
A few examples:
The top ten occupations most closely associated with each ethnic group in the contemporary Google News dataset:
- Hispanic: housekeeper, mason, artist, janitor, dancer, mechanic, photographer, baker, cashier, driver
- Asian: professor, official, secretary, conductor, physicist, scientist, chemist, tailor, accountant, engineer
- White: smith, blacksmith, surveyor, sheriff, weaver, administrator, mason, statistician, clergy, photographer
The 3 most male occupations in the 1930s: engineer, lawyer, architect. The 3 most female occupations in the 1930s: nurse, housekeeper, attendant.
Not much has changed in the 1990s.
Major male occupations: architect, mathematician and surveyor. Female occupations: nurse, housekeeper and midwife.
Wikimedia's Ores service
Software engineer Amir Sarabadani presented the ORES-project in Brussels in November 2017 during the Algoliterary Encounter.
This 'Objective Revision Evaluation Service' uses machine learning to help automate critical work on Wikimedia, like vandalism detection and the removal of articles. Cristina Cochior and Femke Snelting interviewed him.
Femke: To go back to your work. In these days you tried to understand what it means to find bias in machine learning and the proposal of Nicolas Maleve, who gave the workshop yesterday, was neither to try to fix it, nor to refuse to deal with systems that produce bias, but to work with them. He says that bias is inherent to human knowledge, so we need to find ways to somehow work with it. We're just struggling a bit with what would that mean, how would that work... So I was wondering whether you had any thoughts on the question of bias.
Amir: Bias inside Wikipedia is a tricky question because it happens on several levels. One level that has been discussed a lot is the bias in references. Not all references are accessible. So one thing that the Wikimedia Foundation has been trying to do, is to give free access to libraries that are behind a pay wall. They reduce the bias by only using open-access references. Another type of bias is the Internet connection, access to the Internet. There are lots of people who don't have it. One thing about China is that the Internet there is blocked. The content against the government of China inside Chinese Wikipedia is higher because the editors [who can access the website] are not people who are pro government, and try to make it more neutral. So, this happens in lots of places. But in the matter of artificial intelligence (AI) and the model that we use at Wikipedia, it's more a matter of transparency. There is a book about how bias in AI models can break people's lives, it's called 'Weapons of Math Destruction'. It talks about AI models that exist in the US that rank teachers and it's quite horrible because eventually there will be bias. The way to deal with it based on the book and their research was first that the model should be open source, people should be able to see what features are used and the data should be open also, so that people can investigate, find bias, give feedback and report back. There should be a way to fix the system. I think not all companies are moving in that direction, but Wikipedia, because of the values that they hold, are at least more transparent and they push other people to do the same thing.
One of the infamous stories is that of the machine learning programme Tay, designed by Microsoft. Tay was a chat bot that imitated a teenage girl on Twitter. She lived for less than 24 hours before she was shut down. Few people know that before this incident, Microsoft had already trained and released XiaoIce on WeChat, China's most used chat application. XiaoIce's success was so promising that it led to the development of its American version. However, the developers of Tay were not prepared for the platform climate of Twitter. Although the bot knew how to distinguish a noun from an adjective, it had no understanding of the actual meaning of words. The bot quickly learned to copy racial insults and other discriminative language it learned from Twitter users and troll attacks.
Tay's appearance and disappearance was an important moment of consciousness. It showed the possible corrupt consequences that machine learning can have when the cultural context in which the algorithm has to live is not taken into account.