Data Workers

From Algolit

Revision as of 20:24, 13 March 2019 by An (talk | contribs)

Exhibition at the Mundaneum in Mons from 28 March until 29 April 2019.

The opening is on Thursday 28 March from 18:00 until 22:00. As part of the exhibition, we have invited Allison Parrish, an algoliterary poet from New York. She will give a talk in Passa Porta on Thursday evening 25 April and a workshop in the Mundaneum on Friday 26 April.


Data Workers is an exhibition of algoliterary works, of stories told from an ‘algorithmic storyteller point of view’. The exhibition was created by members of Algolit, a group from Brussels involved in artistic research on algorithms and literature. Every month they gather to experiment with F/LOSS code and texts. Some works are by students of Arts² and external participants to the workshop on machine learning and text organized by Algolit in October 2018 at the Mundaneum.

Companies create artificial intelligence (AI) systems to serve, entertain, record and learn about humans. The work of these machinic entities is usually hidden behind interfaces and patents. In the exhibition, algorithmic storytellers leave their invisible underworld to become interlocutors. The data workers operate in different collectives. Each collective represents a stage in the design process of a machine learning model: there are the Writers, the Cleaners, the Informants, the Readers, the Learners and the Oracles. The boundaries between these collectives are not fixed; they are porous and permeable. At times, Oracles are also Writers. At other times Readers are also Oracles. Robots voice experimental literature, while algorithmic models read data, turn words into numbers, make calculations that define patterns and are able to endlessly process new texts ever after.

The exhibition foregrounds data workers who impact our daily lives, but are either hard to grasp and imagine or removed from the imagination altogether. It connects stories about algorithms in mainstream media to the storytelling that is found in technical manuals and academic papers. Robots are invited to engage in dialogue with human visitors and vice versa. In this way we might understand our respective reasonings, demystify each other's behaviour, encounter multiple personalities, and value our collective labour. It is also a tribute to the many machines that Paul Otlet and Henri La Fontaine imagined for their Mundaneum, showing their potential but also their limits.

Data Workers was created by Algolit.

Works by: Cristina Cochior, Gijs de Heij, Sarah Garcin, An Mertens, Javier Lloret, Louise Dekeuleneer, Florian Van de Weyer, Laetitia Trozzi, Rémi Forte, Guillaume Slizewicz, Michael Murtaugh, Manetta Berends, Mia Melvær.

Co-produced by: Arts², Constant and Mundaneum.

With the support of: Wallonia-Brussels Federation/Digital Arts, Passa Porta, UGent, DHuF - Digital Humanities Flanders and Distributed Proofreaders Project.

Thanks to: Mike Kestemont, Michel Cleempoel, François Zajéga, Raphaèle Cornille, Kris Rutten, Anne-Laure Buisson, David Stampfli.

At the Mundaneum

In the late nineteenth century two young Belgian jurists, Paul Otlet (1868–1944), the 'father of documentation’, and Henri La Fontaine (1854-1943), statesman and Nobel Peace Prize winner, created the Mundaneum. The project aimed to gather all the world’s knowledge and to file it using the Universal Decimal Classification (UDC) system that they had invented. At first it was an International Institutions Bureau dedicated to international knowledge exchange. In the twentieth century the Mundaneum became a universal centre of documentation. Its collections are made up of thousands of books, newspapers, journals, documents, posters, glass plates and postcards indexed on millions of cross-referenced cards. The collections were exhibited and kept in various buildings in Brussels, including the Palais du Cinquantenaire. The remains of the archive only moved to Mons in 1998.

Based on the Mundaneum, the two men designed a World City for which Le Corbusier made scale models and plans. The aim of the World City was to gather, at a global level, the institutions of knowledge: libraries, museums and universities. This project was never realized. It suffered from its own utopia. The Mundaneum is the result of a visionary dream of what an infrastructure for universal knowledge exchange could be. It attained mythical dimensions at the time. When looking at the concrete archive that was developed, that collection is rather eclectic and specific.

Artifical intelligence systems today come with their own dreams of universality and knowledge production. When reading about these systems, the visionary dreams of their makers were there from the beginning of their development in the 1950s. Nowadays, their promise has also attained mythical dimensions. When looking at their concrete applications, the collection of tools is truly innovative and fascinating, but at the same time, rather eclectic and specific. For Data Workers, Algolit combined some of the applications with 10 per cent of the digitized publications of the International Institutions Bureau. In this way, we hope to poetically open up a discussion about machines, algorithms, and technological infrastructures.



Data workers need data to work with. The data that used in the context of Algolit is written language. Machine learning relies on many types of writing. Many authors write in the form of publications, such as books or articles. These are part of organized archives and are sometimes digitized. But there are other kinds of writing too. We could say that every human being who has access to the Internet is a writer each time they interact with algorithms. We chat, write, click, like and share. In return for free services, we leave our data that is compiled into profiles and sold for advertising and research purposes.

Machine learning algorithms are not critics: they take whatever they're given, no matter the writing style, no matter the CV of the author, no matter the spelling mistakes. In fact, mistakes make it better: the more variety, the better they learn to anticipate unexpected text. But often, human authors are not aware of what happens to their work.

Most of the writing we use is in English, some in French, some in Dutch. Most often we find ourselves writing in Python, the programming language we use. Algorithms can be writers too. Some neural networks write their own rules and generate their own texts. And for the models that are still wrestling with the ambiguities of natural language, there are human editors to assist them. Poets, playwrights or novelists start their new careers as assistants of AI.



Machine learning is mainly used to analyse and predict situations based on existing cases. In this exhibition we focus on machine learning models for text processing or Natural Language Processing' (NLP). These models have learned to perform a specific task on the basis of existing texts. The models are used for search engines, machine translations and summaries, spotting trends in new media networks and news feeds. They influence what you get to see as a user, but also have their say in the course of stock exchanges worldwide, the detection of cybercrime and vandalism, etc.

There are two main tasks when it comes to language understanding. Information extraction looks at concepts and relations between concepts. This allows for recognizing topics, places and persons in a text, summarization and questions & answering. The other task is text classification. You can train an oracle to detect whether an email is spam or not, written by a man or a woman, rather positive or negative.

In this zone you can see some of those models at work. During your further journey through the exhibition you will discover the different steps that a human-machine goes through to come to a final model.



Algolit chooses to work with texts that are free of copyright. This means that they have been published under a Creative Commons 4.0 license – which is rare - or that they are in the public domain because the author died more than 70 years ago. This is the case for the publications of the Mundaneum. We received 203 documents that we helped turn into datasets. They are now available for others online. Sometimes we had to deal with poor text formats, and we often dedicated a lot of time to cleaning up documents. We were not alone in doing this.

Books are scanned at high resolution, page by page. This is time-consuming, laborious human work and often the reason why archives and libraries transfer their collections and leave the job to companies like Google. The photos are converted into text via OCR (Optical Character Recognition), a software that recognizes letters, but often makes mistakes, especially when it has to deal with ancient fonts and wrinkled pages. Yet more wearisome human work is needed to improve the texts. This is often carried out by poorly-paid freelancers via micro-payment platforms like Amazon's Mechanical Turk; or by volunteers, like the community around the Distributed Proofreaders Project, which does fantastic work. Whoever does it, or wherever it is done, cleaning up texts is a towering job for which no structural automation yet exists.



Machine learning algorithms need guidance, whether they are supervised or not. In order to separate one thing from another, they need material to extract patterns from. One should carefully choose the study material, and adapt it to the machine's task. It doesn't make sense to train a machine with nineteenth-century novels if its mission is to analyse tweets. A badly written textbook can lead a student to give up on the subject altogether. A good textbook is preferably not a textbook at all.

This is where the dataset comes in: arranged as neatly as possible, organized in disciplined rows and lined-up columns, waiting to be read by the machine. Each dataset collects different information about the world, and like all collections, they are imbued with collectors' bias. You will hear this expression very often: 'data is the new oil'. If only data were more like oil! Leaking, dripping and heavy with fat, bubbling up and jumping unexpectedly when in contact with new matter. Instead, data is supposed to be clean. With each process, each questionnaire, each column title, it becomes cleaner and cleaner, chipping distinct characteristics until it fits the mould of the dataset.

Some datasets combine the machinic logic with the human logic. The models that require supervision multiply the subjectivities of both data collectors and annotators, then propagate what they've been taught. You will encounter some of the datasets that pass as default in the machine learning field, as well as other stories of humans guiding machines.



We communicate with computers through language. We click on icons that have a description in words, we tap words on keyboards, use our voice to give them instructions. Sometimes we trust our computer with our most intimate thoughts and forget that they are extensive calculators. A computer understands every word as a combination of zeros and ones. A letter is read as a specific ASCII number: capital 'A' is 001.

In all models, rule-based, classical machine learning, and neural networks, words undergo some type of translation into numbers in order to understand the semantic meaning of language. This is done through counting. Some models count the frequency of single words, some might count the frequency of combinations of words, some count the frequency of nouns, adjectives, verbs or noun and verb phrases. Some just replace the words in a text by their index numbers. Numbers optimize the operative speed of computer processes, leading to fast predictions, but they also remove the symbolic links that words might have. Here we present a few techniques that are dedicated to making text readable to a machine.



Learners are the algorithms that distinguish machine learning practices from other types of practices. They are pattern finders, capable of crawling through data and generating some kind of specific 'grammar'. Learners are based on statistical techniques. Some need a large amount of training data in order to function, others can work with a small annotated set. Some perform well in classification tasks, like spam identification, others are better at predicting numbers, like temperatures, distances, stock market values, and so on.

The terminology of machine learning is not yet fully established. Depending on the field, whether statistics, computer science or the humanities, different terms are used. Learners are also called classifiers. When we talk about Learners, we talk about the interwoven functions that have the capacity to generate other functions, evaluate and readjust them to fit the data. They are good at understanding and revealing patterns. But they don't always distinguish well which of the patterns should be repeated.

In software packages, it is not always possible to distinguish the characteristic elements of the classifiers, because they are hidden in underlying modules or libraries. Programmers can invoke them using a single line of code. For this exhibition, we therefore developed three table games that show in detail the learning process of simple, but frequently used classifiers.