Data Workers
From Algolit
Exhibition in Mundaneum in Mons from 28 March till 29 April 2019.
About
Data Workers is an exhibition of algoliterary works, of stories told from an ‘algorithmic storyteller point of view’. The works are created by members of Algolit, a group from Brussels involved in artistic research on algorithms and literature. Every month they gather to experiment with F/LOSS code and texts.
Companies create artificial intelligences to serve, entertain, record and know about humans. The work of these machinic entities is usually hidden behind interfaces and patents. In the exhibition, algoritmic storytellers leave their invisible underworld to become interlocutors. The data workers operate in different collectives. Each collective represents a stage in the design process of a machine learning model: there are the Writers, the Cleaners, the Informants, the Readers, the Learners and the Oracles. Robots voice experimental literature, algorithmic models read data, turn words into numbers, make calculations that define patterns and are able to endlessly process new texts ever after.
The exhibition foregrounds data workers who impact our daily lives, but are hard to grasp or imagine. It connects stories about algorithms in mainstream media to the storytelling in technical manuals and academic papers. Robots are invited to go into dialogue with human visitors and vice versa. In this way we might understand our respective reasonings, demystify each other's behaviour, encounter multiple personalities, and value our collective labour. It is also a tribute to the many machines that Paul Otlet and Henri La Fontaine imagined for their Mundaneum, showing their potential but also their limits.
Contextual Stories about Algolit
Data Workers is a creation by Algolit.
Works by: Cristina Cochior, Gijs de Heij, Sarah Garcin, An Mertens, Javier Lloret, Louise Dekeuleneer, Florian Van de Weyer, Laetitia Trozzi, Rémi Forte, Guillaume Slizewicz, Manetta Berends, Mia Melvær.
A co-production of: Arts², Constant and Mundaneum.
With the support of: Fédération Wallonie-Bruxelles/Arts Numériques, Passa Porta, Ugent, DHuF - Digital Humanities Flanders and Distributed Proofreaders Project.
Thanks to: Mike Kestemont, Michel Cleempoel, François Zajéga, Raphaèle Cornille, Kris Rutten, Anne-Laure Buisson, David Stampfli.
In Mundaneum
In the late nineteenth century two young Belgian jurists, Paul Otlet (1868-1944), ‘the father of documentation’, and Henri La Fontaine (1854-1943), statesman and Nobel Peace Prize winner, created The Mundaneum. The project aimed at gathering all the world’s knowledge and file it using the Universal Decimal Classification (UDC) system that they had invented. At first it was an International Institutions Bureau dedicated to international knowledge exchange. In the 20th century the Mundaneum became a universal centre of documentation. Its collections are made up of thousands of books, newspapers, journals, documents, posters, glass plates and postcards indexed on millions of cross-referenced cards. The collections were exhibited and kept in various buildings in Brussels, including the Palais du Cinquantenaire. The remains of the archive only moved to Mons in 1998.
Based on the Mundaneum, the two men designed a World City for which Le Corbusier made scale models and plans. The aim of the World City was to gather, at a global level, the institutions of intellectual work: libraries, museums and universities. This project was never realised. It suffered from its own utopia. The Mundaneum is the result of a visionary dream of what an infrastructure for universal knowledge exchange could be. It attained mythical dimensions at the time. When looking at the concrete archive that was developed, that collection is rather eclectic and situated.
Artifical intelligences today come with their own dreams of universality and practice of knowledge. When reading about them, the visionary dreams of their makers have been there since the beginning of their development in the 1950s. Nowadays their promise has attained mythical dimensions. When looking at the concrete applications, the collection of tools is truly innovative and fascinating, but also rather eclectic and situated. For Data workers, Algolit combined some of them with 10% of the digitized publications of the International Institutions Bureau. In this way, we hope to poetically open up a discussion about machines, algorithms, and technological infrastructures.
Zones
Writers
Data workers need data to work with. The data that is used in the context of Algolit, is written language. Machine learning relies on many types of writing. Human authors write in the form of publications. These are part of organised archives and are being digitized. But there are other kinds of writing too. We could say that every human being who has access to the internet is a writer each time they interact with algorithms. Adding reviews, writing emails or Wikipedia articles, clicking and liking.
Machine learning algorithms are not critics: they take whatever they're given, no matter the writing style, no matter the CV of the author, no matter their spelling mistakes. In fact, mistakes make it better: the more variety, the better they learn to anticipate unexpected text. But often, human authors are not aware of what happens to their work.
Most of the writing we use is in English, some is in French, some in Dutch. Most often we find ourselves writing in Python, the programming language we use. Algorithms can be writers too. Some neural networks write their own rules and generate their own texts. And for the models that are still wrestling with the ambiguities of natural language, there are human editors to assist them. Poets, playwrights or novelists start their new careers as assistants of AI.
Works
Contextual stories about Writers
Oracles
Machine Learning is mainly used to analyse and predict situations based on existing cases. In this exhibition we focus on machine learning models for text processing or Natural language processing', in short, 'nlp'. These models have learned to perform a specific task on the basis of existing texts. The models are used for search engines, machine translations and summaries, spotting trends in new media networks and news feeds. They influence what you get to see as a user, but also have their word to say in the course of stock exchanges worldwide, the detection of cybercrime and vandalism, etc.
There are two main tasks when it comes to language understanding. Information extraction looks at concepts and relations between concepts. This allows for recognizing topics, places and persons in a text, summarization and questions & answering. The other task is text classification. You can train an oracle to detect whether an email is spam or not, written by a man or a woman, rather positive or negative.
In this zone you can see some of those models at work. During your further journey through the exhibition you will discover the different steps that a human-machine goes through to come to a final model.
Works
Contextual stories about Oracles
Cleaners
Algolit chooses to work with texts that are free of copyright. This means that they are published under a Creative Commons 4.0 license - which is rare -, or that they are in the public domain because the author has died more than 70 years ago. This is the case for the publications of the Mundaneum. We received 203 documents that we helped turn into datasets. They are now available for others online. Sometimes we have to deal with poor text formats, and we are often forced to clean up documents. We are not alone in this.
Books are scanned at high resolution, page by page. This is time-consuming, laborious human work and often the reason why archives and libraries transfer their collections and leave the job to companies like Google. The photos are converted into text via OCR (Optical Character Recognition), a software that recognizes letters, but often makes mistakes, especially when it has to deal with ancient fonts and wrinkled pages. Yet more wearisome human work is needed to improve the texts. This is often achieved through poorly-paid freelancers via micro-payment platforms like Amazon's Mechanical Turk; or by volunteers, such as the community around the Distributed Proofreaders Project, that does fantastic work. Whoever does it, or wherever it is done, cleaning up texts is a towering job for which there is no structural automation yet.
Works
Contextual stories for Cleaners
Informants
Machine learning algorithms need guidance; whether they are supervised or not. In order to separate one thing from another, they need material to extract patterns from. Humans should carefully choose the study material, and adapt it to the machine's task. It doesn't make sense to train a machine with 19th Century novels if its mission is to analyze tweets. A badly written textbook can lead a student to give up on the whole subject. A good textbook is preferably not a textbook at all.
This is where the dataset comes in: arranged as neatly as possible, organised in disciplined rows and lined up columns, waiting to be read by the machine. Each dataset collects different information about the world, and like all collections, they are imbued with collectors' bias. You will hear this expression very often: 'data is the new oil'. If only data were more like oil! Leaking, dripping and heavy with fat, bubbling up and jumping unexpectedly when in contact with new matter. Instead, data is supposed to be clean. With each process, each questionnaire, each column title, it becomes cleaner and cleaner, chipping distinct characteristics until it fits the mould of the dataset.
Some datasets combine the machinic logic with the logic of humans. The models that require supervision multiply the subjectivities of both data collectors and annotators, then propagate what they've been taught. You will encounter some of the datasets that pass as default in the machine learning field, as well as other stories of humans guiding machines.
Works
Contextual stories about Informants
Readers
We communicate with computers through language. We click on icons that have a description in words, we tap words on keyboards, use our voice to give them instructions. Sometimes we believe that the computer can read our thoughts and forget that they are extensive calculators. A computer understands every word as a combination of zeros and ones. A letter is read as a specific ASCII number: capital "A" is 001.
In all models, rule based, classical machine learning and neural networks, words undergo some type of translation into numbers, in order to understand the semantic meaning of language. This is done by counting. Some models count the frequency of single words, some might count the frequency of combinations of words, some count the frequency of nouns, adjectives, verbs or noun of verb phrases. Some just replace words in a text by an index number. Numbers also optimise the speed of the processes. Here we present a few technologies to do so.
Works
Contextual stories about Readers
Learners
Learners are the algorithms that distinguish machine learning practices from other algorithmic practices. Learners are also called classifiers. They are pattern finders, capable of crawling through data and generating some kind of specific 'grammar'. Learners are based on statistical techniques. Each one of them holds individual characteristics. Some need a large amount of trainingdata in order to function, others can get away with a small set of annotated data. Some perform well in classification tasks, like spam identification, others are better at predicting numbers, like temperatures, distances, stockmarket values, and so on.
The terminology of machine learning is not yet fully established. Depending on the field, statistics, computer science or the humanities, they are called by different words. When we talk about Learners, we talk about the interwoven functions that have the capacity to generate other functions, evaluate and readjust them to fit the data. Learners are good at understanding and revealing patterns. But they don't always distinguish well which of the patterns should be repeated.
In software packages you don't get to see the individual personality of the classifiers. They are hidden in underlying modules or libraries, which you can call up as a programmer with one line of code. For this exhibition, we have therefore developed three party games that show in detail the learning process of three simple, but frequently used classifiers and their evaluators.