Contextual stories for Cleaners
Project Gutenberg and Distributed Proofreaders
Project Gutenberg is our Ali Baba cave. It offers more than 58,000 free eBooks to be downloaded or read online. Works are accepted on Gutenberg when their U.S. copyright has expired. Thousands of volunteers digitize and proofread books to help the project. An essential part of the work is done through the Distributed Proofreaders project. This is a web-based interface to help convert public domain books into e-books. Think of text files, EPUBs, Kindle formats. By dividing the workload into individual pages, many volunteers can work on a book at the same time; this speeds up the cleaning process.
During proofreading, volunteers are presented with a scanned image of the page and a version of the text, as it is read by an OCR algorithm trained to recognize letters in images. This allows the text to be easily compared to the image, proofread, and sent back to the site. A second volunteer is then presented with the first volunteer's work. She verifies and corrects the work as necessary, and submits it back to the site. The book then similarly goes through a third proofreading round, plus two more formatting rounds using the same web interface. Once all the pages have completed these steps, a post-processor carefully assembles them into an e-book and submits it to the Project Gutenberg archive.
We collaborated with the Distributed Proofreaders project to clean up the digitized files we received from the Mundaneum collection. From November 2018 until the first upload of the cleaned-up book 'L'Afrique aux Noirs' in February 2019, An Mertens exchanged about 50 emails with Linda Hamilton, Sharon Joiner and Susan Hanlon, all volunteers from the Distributed Proofreaders project. The conversation is published here. It might inspire you to share unavailable books online.
An algoliterary version of the Maintenance Manifesto
In 1969, one year after the birth of her first child, the New York artist Mierle Laderman Ukeles wrote a Manifesto for Maintenance Art. The manifesto calls for a readdressing of the status of maintenance work both in the private, domestic space, and in public. What follows is an altered version of her text inspired by the work of the Cleaners.
A. The Death Instinct and the Life Instinct:
The Death Instinct: separation; categorization; avant-garde par excellence; to follow the predicted path to death – run your own code; dynamic change.
The Life Instinct: unification; the eternal return; the perpetuation and MAINTENANCE of the material; survival systems and operations; equilibrium.
B. Two basic systems: Development and Maintenance.
The sourball of every revolution: after the revolution, who’s going to try to spot the bias in the output?
Development: pure individual creation; the new; change; progress; advance; excitement; flight or fleeing.
Maintenance: keep the dust off the pure individual creation; preserve the new; sustain the change; protect progress; defend and prolong the advance; renew the excitement; repeat the flight; show your work – show it again, keep the git repository groovy, keep the data analysis revealing.
Development systems are partial feedback systems with major room for change.
Maintenance systems are direct feedback systems with little room for alteration.
C. Maintenance is a drag; it takes all the fucking time (lit.)
The mind boggles and chafes at the boredom.
The culture assigns lousy status on maintenance jobs = minimum wages, Amazon Mechanical Turks = virtually no pay.
Clean the set, tag the training data, correct the typos, modify the parameters, finish the report, keep the requester happy, upload the new version, attach words that were wrongly separated by OCR back together, complete those Human Intelligence Tasks, try to guess the meaning of the requester's formatting, you must accept the HIT before you can submit the results, summarize the image, add the bounding box, what's the semantic similarity of this text, check the translation quality, collect your micro-payments, become a hit Mechanical Turk.
A bot panic on Amazon Mechanical Turk
Amazon's Mechanical Turk takes the name of a chess-playing automaton from the eighteenth century. In fact, the Turk wasn't a machine at all. It was a mechanical illusion that allowed a human chess master to hide inside the box and manually operate it. For nearly 84 years, the Turk won most of the games played during its demonstrations around Europe and the Americas. Napoleon Bonaparte is said to have been fooled by this trick too.
The Amazon Mechanical Turk is an online platform for humans to execute tasks that algorithms cannot. Examples include annotating sentences as being positive or negative, spotting number plates, discriminating between face and non-face. The jobs posted on this platform are often paid less than a cent per task. Tasks that are more complex or require more knowledge can be paid up to several cents. To earn a living, Turkers need to finish as many tasks as fast as possible, leading to inevitable mistakes. As a result, the requesters have to incorporate quality checks when they post a job on the platform. They need to test whether the Turker actually has the ability to complete the task, and they also need to verify the results. Many academic researchers use Mechanical Turk as an alternative to have their students execute these tasks.
In August 2018 Max Hui Bai, a psychology student from the University of Minnesota, discovered that the surveys he conducted with Mechanical Turk were full of nonsense answers to open-ended questions. He traced back the wrong answers and found out that they had been submitted by respondents with duplicate GPS locations. This raised suspicion. Though Amazon explicitly prohibits robots from completing jobs on Mechanical Turk, the company does not deal with the problems they cause on their platform. Forums for Turkers are full of conversations about the automation of the work, sharing practices of how to create robots that can even violate Amazon’s terms. You can also find videos on YouTube that show Turkers how to write a bot to fill in answers for you.
Kristy Milland, an Mechanical Turk activist, says: 'Mechanical Turk workers have been treated really, really badly for 12 years, and so in some ways I see this as a point of resistance. If we were paid fairly on the platform, nobody would be risking their account this way.'
Bai is now leading a research project among social scientists to figure out how much bad data is in use, how large the problem is, and how to stop it. But it is impossible at the moment to estimate how many datasets have become unreliable in this way.