Contextual stories about Informants
Datasets as representations
The data-collection processes that lead to the creation of the dataset raise important questions: who is the author of the data? Who has the privilege to collect? For what reason was the selection made? What is missing?
The artist Mimi Onuoha gives a brilliant example of the importance of collection strategies. She chose the case of statistics related to hate crimes. In 2012, the FBI Uniform Crime Reporting (UCR) Program registered almost 6000 hate crimes committed. However, the Department of Justice’s Bureau of Statistics came up with about 300.000 reports of such cases. That is over 50 times as many. The difference in numbers can be explained by how the data was collected. In the first situation law enforcement agencies across the country voluntarily reported cases. For the second survey, the Bureau of Statistics distributed the National Crime Victimization form directly to the homes of victims of hate crimes.
In the field of Natural Language Processing (NLP) the material that machine learners work with is text-based, but the same questions still apply: who are the authors of the texts that make up the dataset? During what period were the texts collected? What type of worldview do they represent?
In 2017, Google's Top Stories algorithm pushed a thread of 4chan, a non-moderated content website, to the top of the results page when searching for the Las Vegas shooter. The name and portrait of an innocent person were linked to the terrible crime. Google changed its algorithm just a few hours after the mistake was discovered, but the error had already affected the person. The question is: why did Google not exclude 4chan content from the training dataset of the algorithm?
Labeling for an Oracle that detects vandalism on Wikipedia
This fragment is taken from an interview with Amir Sarabadani, software engineer at Wikimedia. He was in Brussels in November 2017 during the Algoliterary Encounter.
Femke: If you think about Wikipedia as a living community, with every edit the project changes. Every edit is somehow a contribution to a living organism of knowledge. So, if from within that community you try to distinguish what serves the community and what doesn't and you try to generalize that, because I think that's what the good faith-bad faith algorithm is trying to do, to find helper tools to support the project, you do that on the basis of a generalization that is on the abstract idea of what Wikipedia is and not on the living organism of what happens every day. What interests me in the relation between vandalism and debate is how we can understand the conventional drive that sits in these machine-learning processes that we seem to come across in many places. And how can we somehow understand them and deal with them? If you place your separation of good faith-bad faith on pre-existing labelling and then reproduce that in your understanding of what edits are being made, how then to take into account movements that are happening, the life of the actual project?
Amir: OK, I hope that I understood you correctly. It's an interesting discussion. Firstly, what we are calling good faith and bad faith comes from the community itself. We are not doing labelling for them, they are doing labelling for themselves. So, in many different language Wikipedias, the definition of what is good faith and what is bad faith will differ. Wikimedia is trying to reflect what is inside the organism and not to change the organism itself. If the organism changes, and we see that the definition of good faith and helping Wikipedia has been changed, we are implementing this feedback loop that lets people from inside their community pass judgement on their edits and if they disagree with the labelling, we can go back to the model and retrain the algorithm to reflect this change. It's some sort of closed loop: you change things and if someone sees there is a problem, then they tell us and we can change the algorithm back. It's an ongoing project.
How to make your dataset known
NLTK stands for Natural Language Toolkit. For programmers who process natural language using Python, this is an essential library to work with. Many tutorial writers recommend machine learning learners to start with the inbuilt NLTK datasets. It comprises 71 different collections, with a total of almost 6000 items.
There is for example the Movie Review corpus for sentiment analysis. Or the Brown corpus, which was put together in the 1960s by Henry Kučera and W. Nelson Francis at Brown University in Rhode Island. There is also the Declaration of Human Rights corpus, which is commonly used to test whether the code can run on multiple languages. The corpus contains the Declaration of Human Rights expressed in 372 languages from around the world.
But what is the process of getting a dataset accepted into the NLTK library nowadays? On the Github page, the NLTK team describes the following requirements:
- Only contribute corpora that have obtained a basic level of notability. That means, there is a publication that describes it, and a community of programmers who are using it
- Ensure that you have permission to redistribute the data, and can document this. This means that the dataset is best published on an external website with a licence
- Use existing NLTK corpus readers where possible, or else contribute a well-documented corpus reader to NLTK. This means, you need to organize your data in such a way that it can be easily read using NLTK code.
Extract from a positive IMDb movie review from the NLTK dataset
corpus: NLTK, movie reviews
steven spielberg ' s second epic film on world war ii is an unquestioned masterpiece of film . spielberg , ever the student on film , has managed to resurrect the war genre by producing one of its grittiest , and most powerful entries . he also managed to cast this era ' s greatest answer to jimmy stewart , tom hanks , who delivers a performance that is nothing short of an astonishing miracle . for about 160 out of its 170 minutes , " saving private ryan " is flawless . literally . the plot is simple enough . after the epic d - day invasion ( whose sequences are nothing short of spectacular ) , capt . john miller ( hanks ) and his team are forced to search for a pvt . james ryan ( damon ) , whose brothers have all died in battle . once they find him , they are to bring him back for immediate discharge so that he can go home . accompanying miller are his crew , played with astonishing perfection by a group of character actors that are simply sensational . barry pepper , adam goldberg , vin diesel , giovanni ribisi , davies , and burns are the team sent to find one man , and bring him home . the battle sequences that bookend the film are extraordinary . literally .
The ouroboros of machine learning
Wikipedia has become a source for learning not only for humans, but also for machines. Its articles are prime sources for training models. But very often, the material the machines are trained on is the same content that they helped to write. In fact, at the beginning of Wikipedia, many articles were written by bots. Rambot, for example, was a controversial bot figure on the English-speaking platform. It authored 98 per cent of the pages describing US towns.
As a result of serial and topical robot interventions, the models that are trained on the full Wikipedia dump have a unique view on composing articles. For example, a topic model trained on all of Wikipedia articles will associate 'river' with 'Romania' and 'village' with 'Turkey'. This is because there are over 10000 pages written about villages in Turkey. This should be enough to spark anyone's desire for a visit, but it is far too much compared to the number of articles other countries have on the subject. The asymmetry causes a false correlation and needs to be redressed. Most models try to exclude the work of these prolific robot writers.