A quick guide to transcribing documents from PDF to HTML

The best tool I have found that generally works without too much effort is the one provided at this service is provided by the US National Library of Medicine and there is more than just PDF to text offered as options, is free and works very well.

Steps to follow when using this service:

1.        Download PDF from a reputable website, there are a number of sites where I would not obtain PDFs from, for a number of reasons.

2.        Log in to the Docmorph website (create an account if needed, I have never received any spam, or any emails at all for that matter after providing my email address) and upload the PDF, the website will then process the PDF (about 10 seconds per page so if you have a large document you might want to get a drink).

3.        The website then provides a download link for the text it was able to recognise with page breaks.

What to do if the resulting text is not what you expected

4.        Although the OCR software used by the site is very good, if the PDF is poor quality (and here I mean very poor as I am amazed at how well Docmorph can work with some bad documents) the document might require some pre-processing to enable Docmorph to read it. I have found that the best method to do this is to save each page in the PDF as a separate image and then convert them all as 2 colour black and white images, this can be done using a batch process using a number of different tools which I won’t go into here.

a.        An additional process you might want to perform is a sharpen to clean up the text, if you have time you can of course individually edit the images to remove extra marks but doing this would probably be slower than re-typing the relevant section of text.

If you have large numbers of documents

5.        If you have a large numbers of documents to process, you might be interested in MyMorph (which is an application for windows that may work with emulation on Linux as well, I am yet to try this) that you can download to convert large numbers of files.

What to do after you have the extracted text

6.        The fastest way I found to convert text to HTML is to paste the text directly into Dreamweaver (not the code section) and Dreamweaver then automatically insets line breaks as needed, then it is a matter of creating the relevant structures (tables etc.) and cutting and pasting the relevant text into each cell.

a.        To speed up this process I have created a number of templates with the relevant page break formatting that speeds up this process a lot.

b.        I am looking for a good Linux alternative for Dreamweaver or a way to get a reasonably new version of Dreamweaver running on Linux with emulation.

Additional information

DocMorph is a publicly available Web site that allows users to convert more than 50 types of electronic files into 5 possible outputs:
Portable Document Format (PDF)
Multi-page Tagged Image File Format (TIFF)
Single-page TIFF
Synthesized speech

I have not tried the synthesized speech as yet but this might prove extremely useful to provide documents in audio format for those who rely on screen readers.

Examples of transcribed documents

I used Docmorph to transcribe IBM's Objections to SCO's Privilege Log  (that table of numbers was a nightmare) but in that case I didn't just upload the PDF I extracted just the parts of the PDF I wanted (the table) as images and it was able to transcribe these images, I used black and white (2 colour) and it seemed to work best.

