Actions

Difference between revisions of "Cleaning for Poems"

From Algolit

 
(9 intermediate revisions by 3 users not shown)
Line 1: Line 1:
    In a typical digitization process of an archive it's documents are scanned or photographed. These processes produce however pictures. To make the documents searchable they are often transformed into text using Optical Character Recognition software (OCR). If we want to use the archive of the mundaneum to train models using machine learning we need these texts rather than the underlying pictures. Luckily the documents were transformed into text when they were being scanned. Unfortunately the software often makes mistakes: it might recognize a wrong character, it might get confused by a stain an unsual font or the other side of the page shining through.
+
by Algolit
  
    In the case of the Mundaneum we recognized several kinds of mistakes, some characters are wrongly recognized, sometimes it puts a space in between every character and often words are split up as there was a linebreak. But most importantly: as the books were scanned in spreads: the left and right page together, the texts are mixed up: the first sentence of the right page is put after the first sentence of the left page.  
+
[https://gitlab.constantvzw.org/algolit/mundaneum/tree/master/exhibition/3-Cleaners/the-cleaner Sources on Gitlab]
 +
 
 +
For this exhibition we worked with 3 per cent of the Mundaneum's archive. These documents were first scanned or photographed. To make the documents searchable they were transformed into text using Optical Character Recognition software (OCR). OCR are algorithmic models that are trained on other texts. They have learned to identify characters, words, sentences and paragraphs. The software often makes 'mistakes'. It might recognize a wrong character, it might get confused by a stain an unusual font or the reverse side of the page being visible.  
  
    In this interface we ask you to help us clean up our dataset. We show you what we detected as seperate pages or mistakes and ask you to verify or improve our solution. Your corrections are directly used in the retraining of the model but will also be part of the dataset on publishing
+
While these mistakes are often considered noise, confusing the training, they can also be seen as poetic interpretations of the algorithm. They show us the limits of the machine. And they also reveal how the algorithm might work, what material it has seen in training and what is new. They say something about the standards of its makers. In this installation we ask your help in verifying our dataset. As a reward we'll present you with a personal algorithmic improvisation.
 +
 
 +
------------------------------------------
 +
 
 +
Concept, code, interface: Gijs de Heij
  
 
[[Category:Data_Workers]][[Category:Data_Workers_EN]]
 
[[Category:Data_Workers]][[Category:Data_Workers_EN]]

Latest revision as of 17:52, 4 June 2019

by Algolit

Sources on Gitlab

For this exhibition we worked with 3 per cent of the Mundaneum's archive. These documents were first scanned or photographed. To make the documents searchable they were transformed into text using Optical Character Recognition software (OCR). OCR are algorithmic models that are trained on other texts. They have learned to identify characters, words, sentences and paragraphs. The software often makes 'mistakes'. It might recognize a wrong character, it might get confused by a stain an unusual font or the reverse side of the page being visible.

While these mistakes are often considered noise, confusing the training, they can also be seen as poetic interpretations of the algorithm. They show us the limits of the machine. And they also reveal how the algorithm might work, what material it has seen in training and what is new. They say something about the standards of its makers. In this installation we ask your help in verifying our dataset. As a reward we'll present you with a personal algorithmic improvisation.


Concept, code, interface: Gijs de Heij