Data Workers
From Algolit
Exhibition in Mundaneum in Mons from 28 March till 29 April 2019.
About
Data Workers is an exhibition of algoliterary works, of stories told from an ‘algorithmic storyteller point of view’. The works are created by members of Algolit, a group from Brussels involved in artistic research on algorithms and literature. Every month they gather to experiment with F/LOSS code and texts.
Companies create artificial intelligences to serve, entertain, record and know about humans. The work of these machinic entities is usually hidden behind interfaces and patents. In the exhibition, algoritmic storytellers leave their invisible underworld to become interlocutors. The data workers operate in different collectives. Each collective represents a stage in the design process of a machine learning model: there are the Writers, the Cleaners, the Informants, the Readers, the Learners and the Oracles. Robots voice experimental literature, algorithmic models read data, turn words into numbers, make calculations that define patterns and are able to endlessly process new texts ever after.
The exhibition foregrounds data workers who impact our daily lives, but are hard to grasp or imagine. It connects stories about algorithms in mainstream media to the storytelling in technical manuals and academic papers. Robots are invited to go into dialogue with human visitors and vice versa. In this way we might understand our respective reasonings, demystify each other's behaviour, encounter multiple personalities, and value our collective labour. It is also a tribute to the many machines that Paul Otlet and Henri La Fontaine imagined for their Mundaneum, showing their potential but also their limits.
Data Workers is a creation by Algolit, in co-production with Arts², Constant and Mundaneum. With the support Fédération Wallonie-Bruxelles, Arts Numériques, Pneu, Passaporta, Ugent, UA and Distributed Proofreaders Project.
In Mundaneum
The origins of the Mundaneum go back to the late nineteenth century. The project was created by two young Belgian jurists, Paul Otlet (1868-1944), the father of documentation, and Henri La Fontaine (1854-1943), Nobel Peace Prize winner. It aimed at gathering all the world’s knowledge and file it using the Universal Decimal Classification (UDC) system that they had created. At first it was an International Institutions Bureau dedicated to knowledge and fraternity. In the 20th century the Mundaneum became a universal centre of documentation. Its collections are made up of thousands of books, newspapers, journals, documents, posters, glass plates, postcards and other bibliographic cards. These were put together and kept in various buildings in Brussels, including the Palais du Cinquantenaire. The archive only moved to Mons in 1998.
Based on the Mundaneum, the two men designed a World City for which Le Corbusier made scale models and plans. The aim of the World City was to gather, at world level, the institutions of intellectual work: libraries, museums and universities. This project was never be realised. The Mundaneum project soon faced the scale of the technical development of its era. It suffered from its own utopia. The Mundaneum is the result of a visionary dream. It attained mythical dimensions at the time. When looking at the concrete archive that was developed, that collection is very fragmented and incomplete.
The same can be said for artifical intelligences today. When reading about them, the visionary dream has been there since the beginning of their development in the 50s. Nowadays the promise has attained mythical dimensions. When looking at the concrete applications, that collection is truly innovative and fascinating, but rather fragmented and incomplete. By combining the tools and 10% of the publications of the International Institutions Bureau, Algolit hopes to poetically open up the discussion about machines, algorithms, and technological infrastructures.
Zones
Writers
Data workers need data to work with. The data that is used in the context of this exhibition, is written language. Machine learning relies of many types of writing. There is the formalized writing of the digitized publications of the Mundaneum. But we need more than that. We could say that every human being who has access to the internet is an algorithm writer each time they interact with it by adding reviews, writing Wikipedia articles, or writing emails.
Machine learning algorithms are not critics: they take whatever they are given, no matter the writing style, no matter the CV of the author, no matter the spelling mistakes. In fact, mistakes make it better: the more variety, the better it can anticipate. Sometimes, the authors are not particularly aware of what happens to their oeuvre: offline material, such as printed literature and organised archives, is digitized too and turned into prediction fodder.
Some writing is in English, some in French, and some in Python. Programmers are writers with intent. The algorithm can be a writer too, some neural networks write their own rules. And for the rest, the code that is still wrestling with the subtleties of human language, there are human editors who take over. Poets, playwrights, novelists start their exciting new careers as ventriloquists for AI assistants.
Works:
Oracles
Machine Learning is mainly used to analyse and predict situations based on existing cases. In this exhibition we focus on machine learning models for text processing or Natural language processing', in short, 'nlp'. These models have learned to perform a specific task on the basis of existing texts. The models are used for search engines, machine translations and summaries, spotting trends in new media networks and news feeds. They influence what you get to see as a user, but also have their word to say in the course of stock exchanges worldwide, the detection of cybercrime and vandalism, etc.
There are two main tasks when it comes to language understanding. Information extraction looks at concepts and relations between concepts. This allows for recognizing topics, places and persons in a text, summarization and questions & answering. The other task is text classification. You can train an oracle to detect whether an email is spam or not, written by a man or a woman, rather positive or negative.
Here you can see some of those models at work. During your further journey through the exhibition you will discover the different steps that a human-machine goes through to come to a final model.
Works:
- Reverse Algebra
- Pos/Neg Classifier Model
- Topic Analysis Naive Bayes books
- Algoliterator
Cleaners
Algolit chooses to work with texts that are free of copyright. This means that they are published under a Creative Commons 4.0 license - which is rare -, or that they are in the public domain because the author has died more than 70 years ago. This is the case for the publications of the Mundaneum. We received 203 documents that we help to make datasets available for others online. The disadvantage is that we have to deal with poor text formats, and we are often at the mercy of cleaning up documents. We are not alone in this.
Books are scanned at high resolution, page by page. This is intense human work and often the reason why archives and libraries transfer their collections to a commercial company like Google. The photos are converted into text via OCR (Optical Character Recognition), a software that recognizes letters, but often makes mistakes, especially when it has to deal with ancient fonts and wrinkled pages. Again intense human work is needed to improve the texts. This is work for freelancers via little paid platforms like Mechanical Turk; or for volunteers, such as the community around the Distributed Proofreaders Project, who does fantastic work. Whoever does it or wherever it is done, cleaning up texts is a huge job for which there is no structural automation yet.
Works:
Informants
All machine learning algorithms need guidance; whether they are supervised or not. In order to separate one thing from another thing, they need material to extricate relations from.
The study material should be chosen carefully: as we know, a badly written textbook can lead a student to forfeit the whole subject. A good textbook is preferably not a textbook at all. This is where the dataset comes in: neatly arranged, well disciplined rows and columns lining up on the screen waiting to be read by the machine.
Each dataset collects different information about the world, and like all collections, they are imbued with the collector's bias. You will hear this expression very often: data is the new oil. If only data were more like oil! Leaking, dripping and heavy with fat, bubbling up and jumping unexpectedly when in contact with new matter. Instead, data is clean. With each process, each questionaire, each column title, each css selector scraped, it becomes cleaner and more reduced and finds its residence within the performative logic of the dataset.
Some datasets combine the logic of the machine with the logic of humans. The algorithms which require supervision multiply the subjectivities of both data collectors, and annotators, then propel and propagate what they've been taught. You will listen to extracts of some of the datasets that pass as defaults in the machine learning field, as well as other stories of human teaching.
Works:
Readers
We communicate with computers through language. We click on icons that have a description in words, we tap words on keyboards, use our voice to give them instructions. Sometimes we believe that the computer can read our thoughts and forget that they are extensive calculators. A computer understands every word as a combination of zeros and ones. A letter is read as a specific ASCII number: capital "A" is 001.
In all models, rule based, classical machine learning and neural networks, words undergo some type of translation into numbers, in order to understand the semantic meaning of language. This is done by counting. Some models count the frequency of single words, some might count the frequency of combinations of words, some count the frequency of nouns, adjectives, verbs or noun of verb phrases. Some just replace words in a text by an index number. Numbers also optimise the speed of the processes. Here we present a few technologies to do so.
Works:
- [[]]
Learners
A machine learns from data it reads. This learning is also called the training & test phase. The machine searches for patterns in the data by reducing, for example to the most common or unique words. It does this by making a series of calculations according to existing formulas. The formulas, or 'classifiers', often have a long history, which is embedded in mathematics and statistics.
In software packages you don't get to see the individual personality of the classifiers. They are packed in underlying modules or libraries, which you can call up as a programmer with one line of code. For this exhibition, we have therefore developed three party games that show in detail the learning process of three simple, but frequently used classifiers and their evaluators.
Works: