Full email conversation
From November 2018 till February 2019 An Mertens exchanged mails with volunteers from the Distributed Proofreaders Project. It might inspire other people to share books online.
For an exhibition in March in Mundaneum in Mons (http://www.mundaneum.org/en), we received 210 ocr'ed pdfs with texts that represent 3% of the Mundaneum's archive.
All these documents are in the public domain, and we would like to upload them to Gutenberg or other platforms afterwards.
We're looking for a way to proofread these documents. They are mainly in French.
Your interface and community seem to be a very efficient place to do this.
Is it an option to use your platform?
If so, how would we proceed?
Many thanks for your answer!
Thank you for contacting us! It definitely sounds like an interesting project.
Our process can be somewhat slow -- there are 3 rounds of proofreading, 2 of formatting, and a post-processor who brings all the pages together into a single document ready for Project Gutenberg. We do have French volunteers at all levels, but not so many as our English-speaking volunteers. So we could not necessarily meet a March timeframe for all documents.
If you're still interested, however, and could send me some samples of the original images in the PDFs, I can talk to our Project Managers and post-processors and pull together a team to coordinate with you.
Distributed Proofreaders General Manager
On 2018-11-10 06:22, An Mertens wrote:
Many thanks for your generous answer!
I'll send you some samples beginning of this week.
All the best,
On 10/11/18 17:54 firstname.lastname@example.org wrote:
Wonderful! Thanks, An!
On 13/11/18 22:02 Wetransfer wrote:
Your files were sent successfully to email@example.com
On 2018-11-13 16:15, An Mertens wrote:
I just sent you a selection of samples by WeTransfer.
The zip contains:
- a folder with sample pdfs that represent 175 publications of different types:
ARC-MUND-PUB-BULLETIN-IIB-1-1895-1896.pdf: 1 sample of a bulletin published by the Institut International de Bibliographie (Mundaneum), there are 44 of these files
ARC-MUND-PUB-IIB-25-3-1899.pdf & ARC-MUND-PUB-IIB-CODE-TELEGRAPHIQUE-PORTRAITS.pdf: 2 samples of 'other publications' by the same institute, there are 62 of these files
ARC-MUND-PUB-MONITEUR-PHOTOGRAPHIE-15121903.pdf: 1 sample of a magazine on photography, there are 21 files of these, published between 1903 and 1912
ARC-MUND-PUB-PO-AFRIQUE-NOIRS.pdf & ARC-MUND-PUB-PO-ILE-LEVANT.pdf: 2 samples of books by Paul Otlet, there are 10 of these, ao Le Traité de Documentation which is already available on Wikisource
ARC-MUND-PUB-UAI-CONGRES-PROSTITUTION-1910.pdf: 1 sample of a congress report published by Union des Associations Internationales (also Mundaneum), there are 38 of these.
- a folder of original txt files as they were sent to us by the Mundaneum, most of these files have severe column issues, due to ocr
- a folder of utf-8 files based on the original txt files
- a folder of clean files
in order to solve the main column issues, a colleague wrote a script that sorts out the white spaces in the lines and rearranges the text.
Sometimes it worked well, sometimes not. The html files serve as a reference.
If this would help convincing the work group, we can work on cleaning up the columns as soon as possible.
I don't think it is necessary to have all files corrected by March.
We can make another selection, of course.
It would be great to collaborate. And I hope this helps clearing out the work.
Thanks a lot for your assistance!
Looking forward to reading you and exchanging more.
On 18/11/18 21:23, dp-genmgr wrote:
One of our system administrators, Sharon Joiner, has agreed to review the samples and get back to you shortly. I'll be away for part of this month but she will be in touch with you before too long.
On 20/11/18 06:58, Sharon Joiner wrote:
I've done a preliminary evaluation based on extracting images from the pdfs using Photoshop, but there's a very wide range of quality from unreadable to very good, and I can only judge by what I see in the pdf and my extracted images. We always prefer to start with the best possible scans we can get, and if you have access to the original tiff scans, they would be quite welcome. That way I can give you a better evaluation.
As for redoing the OCR -- let's wait on that until I get a look at the tiff files.
On 21/11/18 00:00, WeTransfer wrote:
Your files were sent successfully to firstname.lastname@example.org
On 26/11/18 05:35, Sharon Joiner wrote:
First of all, my apologies for not replying sooner -- I got started on the reply, but got distracted by our long holiday weekend. The tiffs are far superior to the images I was able to harvest from the pdfs. If all the tiffs are of the quality of what you provided, then quality should not be an issue if we can work with the original tiffs.
It looks like the selection of tiffs you sent included the first few pages of only one of the pdfs you provided, "L'Afrique aux noirs", and these are definitely superior to what I was able to get from the pdf you originally sent Linda. One of the examples, ARC-MUND-PUB-PO-BANQUE-MONDIALE, shows a publication date of 1932, and as such would not be considered Public Domain in the US. Before I go further, I'd like to explain Distributed Proofreaders a bit. DP came officially into being in October of 2000, under Project Gutenberg's umbrella, and was originally housed on a server in the founder's garage. As time passed, those running the site learned a fair amount, and as DP grew, it became obvious that if quality was going to be maintained, rules would be needed. There was also the question of balancing "good enough" against "perfection", and "quantity" against "quality". (This is a *very* short synopsis!)
As a result, we have certain rules that project managers and content providers need to follow when creating projects.
1. They need to be public domain in the US, and clearable by Project Gutenberg. 2. If the original source is incomplete, missing pages can only be provided from a matching edition. 3. We work with complete documents. This can be broken down into those documents that are missing pages (see 2, above) and those that are complete, but may be periodical in nature. It is not always obvious how periodicals should be treated -- sometimes they're processed as a complete volume, and sometimes as individual issues. How to treat them is further complicated by libraries' tendency to remove individual covers when sending a set of issues to the bindery to be bound together. Also, if a multi-volume set includes multiple titles in one volume, we don't generally break that volume into multiple ebooks, but match the physical source volume. (Note that this has not always been the case, but is part of what has evolved.)
I wanted to mention the above, because that may affect what we can run. Below are my comments on each of the pdfs you originally sent Linda. I think that we should plan to use the original tiffs for what we do wind up processing.
Of the original sample pdfs you sent Linda, all but ARC-MUND-PUB-PO-AFRIQUE-NOIRS and ARC-MUND-PUB-PO-ILE-LEVANT appear to be individual issues of a periodical, or parts of multi-part publications.
ARC-MUND-PUB-PO-AFRIQUE-NOIRS and ARC-MUND-PUB-PO-ILE-LEVANT -- I was looking on the internet to see if I could find some confirmation of length of these publications, and found this: https://www.ideals.illinois.edu/bitstream/handle/2142/652/Paul%20Otlet%20Bibiliography.htm (I mention it in case it's of any interest to you, of you don't already know about it). Also https://www.researchgate.net/publication/32955347_Bibliography_of_the_Works_of_Paul_Otlet According to that, both of these publications really are that short. With a complete set of tiff images for both (and assuming that the faded text can be repaired with the better scans), these two would be an ideal place to start.
ARC-MUND-PUB-IIB-25-3-1899 -- the pdf was apparently created using the original scans. I'd rather use the original tiffs instead of the extracted images. However, it's one of those multi-part ones, so the questions are, are all the parts available? Are they stored as individual publications, or are they bound together (the University of Texas appears to have the 35 parts bound into two volumes: http://catalog.lib.utexas.edu/record=b6066775~S29). If it's run and posted as an individual part, that means that all of them should be run and posted individually. I should be able to check out the UT Austin copy for comparison, if that would help.
ARC-MUND-PUB-IIB-CODE-TELEGRAPHIQUE-PORTRAITS -- This appears to be part of a series, but a standalone publication? Internet archive has it: https://archive.org/details/uncodetlgraphiq00reisgoog, and it's also available as a direct google document, if you can see it: https://books.google.com/books?id=idMMAAAAYAAJ (some google documents are viewable in Europe, some not). The pdf doesn't have the cover that is in the internet archive/google documents. This one is fairly short, and would likely be a good candidate, too. As with all the others, it would be good to have the tiffs.
ARC-MUND-PUB-MONITEUR-PHOTOGRAPHIE-15121903 -- This one appears to be an individual issue of a periodical. Since my French is limited to a few words and phrases, I can't tell what sort of periodical, but it has front matter and publication information, should be clearable, and can stand alone. Your description says that there are 21 of them, published between 1903 and 1912, so it doesn't sound like there's a complete set, but if all of them are as complete as this one seems to be, they should be doable.
When I look at the pdf of ARC-MUND-PUB-BULLETIN-IIB-1-1895-1896, it appears complete, but when I look at the images that Photoshop extracted, it's very confusing, because some of the pdf pages are broken into multiple images, so I'd really need to see the tiffs for these to know for sure whether it's workable. It's fairly long (over 300 pages), so I doubt that it would be finished quickly. ARC-MUND-PUB-UAI-CONGRES-PROSTITUTION-1910 -- This one has no front matter or cover, and apperas to be a report that was part of a world congress publication? You say that there are 38 of these. The part number for this one appears to be 43. Are these all individual pamphlets that are pretty much as this one appears? I.e. the first page has all the publication information?
I would recommend starting with ARC-MUND-PUB-PO-AFRIQUE-NOIRS and ARC-MUND-PUB-PO-ILE-LEVANT, and see where to go from there. Please let me know if you have any questions about what I've written--I hope I haven't used too much DP-specific "jargon". I look forward to hearing from you.
On 1/12/18 22:48, WeTransfer wrote:
Your files were sent successfully to email@example.com
On 2018 Dec 01, at 07:05, An Mertens <firstname.lastname@example.org> wrote:
You find the zip-files with the tiffs of Afrique Noire & Ile Levant here: http://www.algolit.net/mundaneum/tiffs/
The tiffs of Monnaie Internationale & The IIB will arrive by WeTransfer.
For the series of the International Institute of Bibliography, they are archived as individual files.
The Mundaneum is ready to complete the series with the missin tiffs in January.
The same for ARC-MUND-PUB-BULLETIN-IIB
They don't have the original scans for:
But some of the participants to the exhibition will be working with these texts, so we'll clean them up anyway.
We can send you the cleaned up txt-files if you can add them to gutenberg.
For the series of the magazine Moniteur de la photographie (ARC-MUND-PUB-MONITEUR-PHOTOGRAPHIE-15121903), they don't have the full collection and not all of the numbers they have are digitized. So let's take that out.
Just out of interest: would you consider adding the txt file of Traité de Documentation, that is published on Wikisource?
That would make a nice collection already on Gutenberg.
Thanks for all the work!
On 14/01/19 15:55, An Mertens wrote:
First of all, my very best wishes for 2019!
We were wondering, have you already started the project?
It would be great to get some news and maybe join you.
Also, for the exhibition, we're thinking that it could be nice to interview one of you about the cleaning/revision process.
Would you be interested?
Many thanks for your replies.
All the best,
On 18/01/19 08:20, Susan Hanlon wrote:
Best wishes to you too! I'm not sure why but it seems like my previous emails haven't been getting through (no one on the list has received them) so I'm trying again with a new chain.
The first project (L'Afrique aux Noirs) is ready to go and I will make it available for our volunteers to work on shortly. If you have signed up to the site, you'll be able to see it here. I'll add L'Ile du Levant to the site soon as well.
Before I make it available, can you please provide some additional information about how you would like to be credited by answering the questions below? For some of the questions, I have suggested possible answers.
What website address should we show for volunteers who want to find out more about the Mundaneum? http://www.mundaneum.org
Every Project Gutenberg book can have a credit line at the start with details of the volunteers who produced it. The standard Distributed Proofreaders credit looks something like: "Produced by Susan Skinner and the Online Distributed Proofreading Team at http://www.pgdp.net (This file was produced from images generously made available by the Mundaneum.)" Is this okay for you or would you like us to use a different description?
Do we have permission to continue to store your images on our servers after the projects have been published to Project Gutenberg? Do we have permission to publish your images as part of an open online archive of page images of our projects? Let me know if you need more information about what this would involve.
To answer a question in your earlier emails, we won't be able to work on "Traité de Documentation". It's not yet public domain in the United States because it wasn't published till 1934. (We can generally only work with texts which were published in 1923 or earlier.)
On 6/02/19 19:27, Susan Hanlon wrote:
I wanted to give you an update on the status of these projects since the first project, L'Afrique aux Noirs, has now been posted to Project Gutenberg.
There are three more projects currently making their way through our site:
L'Ile du Levant
Création d'un Répertoire Bibliographique Universel
The only other scans that I have are for Institut Internationale du Bibliographie Publication No. 25: Fascicule No. 1, Exposé et Règles de la Classification Décimale and it will be uploaded to our site shortly.
Do you have other scans available that I could start preparing for upload? As previously discussed with Sharon, these need to be works which were published in 1923 or earlier and we would prefer to work from the original tiff files.
Let me know if you have any questions.