Actions

Difference between revisions of "Contextual stories about Informants"

From Algolit

(Created page with "== Extract from a positive IMdB movie review from the NLTK dataset == corpus: [https://www.nltk.org NLTK], movie reviews fileid: pos/cv998_14111.txt steven spielberg ' s sec...")
 
Line 42: Line 42:
 
===== Reference =====
 
===== Reference =====
 
https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/
 
https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/
 +
 +
 +
[[Category:Data_Workers]][[Category:Data_Workers_EN]][[Category:Data_Workers_Podcast_EN]]

Revision as of 19:53, 27 February 2019

Extract from a positive IMdB movie review from the NLTK dataset

corpus: NLTK, movie reviews

fileid: pos/cv998_14111.txt

steven spielberg ' s second epic film on world war ii is an unquestioned masterpiece of film . spielberg , ever the student on film , has managed to resurrect the war genre by producing one of its grittiest , and most powerful entries . he also managed to cast this era ' s greatest answer to jimmy stewart , tom hanks , who delivers a performance that is nothing short of an astonishing miracle . for about 160 out of its 170 minutes , " saving private ryan " is flawless . literally . the plot is simple enough . after the epic d - day invasion ( whose sequences are nothing short of spectacular ) , capt . john miller ( hanks ) and his team are forced to search for a pvt . james ryan ( damon ) , whose brothers have all died in battle . once they find him , they are to bring him back for immediate discharge so that he can go home . accompanying miller are his crew , played with astonishing perfection by a group of character actors that are simply sensational . barry pepper , adam goldberg , vin diesel , giovanni ribisi , davies , and burns are the team sent to find one man , and bring him home . the battle sequences that bookend the film are extraordinary . literally .


Datasets as representations

The data collection processes that lead to the creation of the dataset raise important questions: who is the author of the data? Who has the privilege to collect? For what reason was the selection made? What is missing?

The artist Mimi Onuoha gives a brilliant example of the importance of collection strategies. She chooses the case of statistics related to hate crimes. In 2012, the FBI Uniform Crime Reporting Program (UCR) registered 5,796 committed hate crimes. However, the Department of Justice’s Bureau of Statistics came up with 293,800 reports of such cases. That is over 50 times as much. The difference in numbers can be explained by how the data was collected. In the first situation law enforcement agencies across the country voluntarily reported cases. For the second number, the Bureau of Statistics distributed the National Crime Victimization survey directly to the homes of victims of hate crimes.

In the natural language processing field, the material that machine learners, work with is text-based, but the same questions still apply: who are the authors of the texts that make up the dataset? During what period were they collected? What type of worldview do they represent?

In 2017, Google's Top Stories algorithm pushed a misleading 4chan thread at the top of the results page when searching for the Las Vegas shooter. The name and portrait of an innocent person were linked to the terrible crime. Although Google changed its algorithm just a few hours after the mistake was discovered, but the error did affect the person seriously. Another question remains: why did Google not exclude 4chan from the training dataset in the first place?

Labeling for an oracle that detects vandalism on Wikipedia

This fragment is taken from an interview with Amir Sarabadani, software engineer at Wikimedia. He was in Brussels in November 2017 during the Algoliterary Encounter.

Femke: If you think about Wikipedia as a living community, with every edit changing the project. Every edit is somehow a contribution to a living organism of knowledge. So then, if from within that community you try to distinguish what serves the community and what doesn't and to generalise that, because I think that's what the good faith-bad faith algorithm is trying to do, find helper tools to support the project, you do that on the basis of a generalisation that is on the abstract idea of what Wikipedia is and not on the living organism of what happens every day. What I'm interested about in the relationship between vandalism and debate is how we can understand the conventional drive that is in these machine-learning processes that we seem to come across in many places. And how can we somehow understand them and deal with them? If you place your separation of good faith-bad faith on preexisting labelling and then reproduce that in your understanding of what edits are being made, how do then take into account movements that are happening, the life of the actual project?

Amir: Ok, I hope that I understood you correctly. It's an interesting discussion. Firstly, what we are calling good faith and bad faith comes from the community itself, we are not doing labelling for them, they are doing labelling for themselves. So, in many different language Wikipedias, the definition of what is good faith and what is bad faith will differ. Wikimedia is trying to reflect what is inside the organism and not to change the organism itself. If the organism changes, and we see that the definition of good faith and helping Wikipedia has been changed, we are implementing this feedback loop that lets people from inside of their community pass judgement on their edits and if they disagree with the labelling, we can go back to the model and retrain the algorithm to reflect this change. It's some sort of closed loop: you change things and if someone sees there is a problem, then they tell us and we can change the algorithm back. It's an ongoing project.


How to make your dataset known

NLTK stands for Natural Language Toolkit. For programmers who process natural language using Python, this is an essential library to work with. Many tutorial writers recommend machine learning programmers to start with the inbuilt NLTK datasets. It counts 71 different collections, with a total of almost 6000 items. There is for example the Movie Review corpus for sentiment analysis. Or the Brown corpus, which was put together in the 1960s by Henry Kučera and W. Nelson Francis at the Brown University in Rhode Island. There is also the Declaration of Human Rights corpus, which is commonly used to test whether the code can run on multiple languages. The corpus contains The Declaration of Human Rights expressed in 372 languages from around the world.

But what is the process of getting a dataset accepted into the NLTK library nowadays? On the Github page, the nltk team describes the following requirements:

  • Only contribute corpora that have obtained a basic level of notability. That means, there is a publication that describes it, and a community of programmers who are using it
  • Ensure that you have permission to redistribute the data, and can document this. This means that the dataset is best published on an external website with a licence
  • Use existing NLTK corpus readers where possible, or else contribute a well-documented corpus reader to NLTK. This means, you need to organise your data in such a way, that it can be easily read using NLTK code.


The ouroboros of machine learning

Wikipedia has become a source for learning not only for humans, but also for machines. Its articles are prime sources for training prediction models. The material the machines are trained on, is the same content that they helped to write. In fact, at the beginning of Wikipedia, many articles were written by bots. Rambot, for example, was a controversial bot figure on the English-speaking platform. It authored 98% of the pages describing US towns.

As a result of serial and topical robot interventions, the prediction models that are trained on the full Wikipedia dump, have a unique view on composing articles. For example, a topic model trained on all of Wikipedia articles will associate “river” with “Romania” and “village” with “Turkey”. This is because there are over 10000 pages written about the villages in Turkey. This should be enough to spark anyone's desire for a visit, but it is far too much compared to other countries. The asymmetry causes a false correlation and needs to be redressed. It is important to exclude the work of these prolific robot writers.

Reference

https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/