PDFs come in a number of forms, and these often pose challenges for translation or reference use. Discussions of PDF handling tend to raise my blood pressure considerably, because very often the parties involved make blanket recommendations that do not consider the type of PDF or potential complications such as protection in some form or bizarre rendering by certain applications that create PDFs. It’s important to remember that no single solution will cope effectively with every possible PDF scenario.
My personal recommendations are based on about 25 years of dealing with these messes and watching so many people fall confidently on their faces with sub-optimal “solutions” that work for them in many cases. And even so, I still find interesting surprises, as I did once again while preparing examples for this article.
There are three basic types of PDF that you are likely to encounter:
Pure image PDFs (essentially a PDF wrapper around some scanned images)
Text-on-image PDFs (same as #1 but with hidden optical character recognition results that make the file searchable and text selectable)
Accessible text PDFs (where blocks of text can be selected and copied, or even edited with the right software)
For the first case, OCR is your best solution in most cases. For years I used ABBYY Finereader for this task, because it was the best for European languages I work with, but the company’s Russian history has given me pause to do so in recent years despite the re-branding as an American company. And good conversion from images is not a trivial skill, but one which requires a lot of practice and some fine-tuning for best results with many formats. Careless use of default settings inevitably causes enormous headaches later in processing.
If you are converting image PDFs for translation, or if you plan to use them for reference, it is a very good idea in most cases to make the second type of PDF out of them as well. Converting image PDFs to text or formats like *.docx
for reference is often a bad idea, because OCR errors may reduce the usefulness for reference purposes or even mislead you as to the original content. Here’s a video I made many years ago showing how this is done:
With a PDF like the example in the video, I would probably produce clean plain text in a DOCX file for translation and use the text-on-image PDF with the memoQ PDF Preview tool so I can see possible conversion errors or formatting to add to the target text as I work.
What if you are given a PDF of the second type (text-on-image) to translate? Well, the conversion to text has already been done, but how would you access that text?
The screenshot above is from a textbook in a graduate course I took a few years ago. I was rather horrified when the professor distributed its 300+ pages as a double-page scanned PDF mess of the first type. We were expected to do a lot with that text, and often it was necessary to quote passages from it in online discussions. I also wanted to study the terminology carefully by doing statistical term extractions with good examples to ensure that my Portuguese would be fit to defend my position in class chats.
Simply selecting and copying the text is not a great idea.
Every line wrap becomes a paragraph break. Now this can be cleaned up by doing a clever find and replace or using TransTools+, but I find that something like iceni InFix with its ability to extract XML from a PDF gives a faster, cleaner result.
The third type of PDF (accessible text) is the easiest to deal with. Unfortunately, this advantage is often squandered by doing bad OCR on the text, introducing errors or formatting problems of many kinds, so having a copy of the original PDF is extremely important still. I see very few cases where translation customers or agencies who prepare an OCR text from a PDF do sufficient checking and correction of the results. This is particularly true in texts involving a lot of subscripts and superscripts or Greek letters for variables, for example.
My favorite approaches to converting PDF files with accessible text is to use iceni InFix. With the unlicensed (demo) version, I can get a decent plain text extract without much ado, and the licensed version offers a “story export” feature that is rather good, its main flaw being occasional breaks in the middle of a sentence (a usual issue with any conversion methods for PDF). The application also offers OCR, though the results generally weren’t as good as ABBYY Finereader in my tests.
I did a rather long webinar showing all the uses of iceni InFix for translation, which I recommend watching on YouTube so you can use the time-coded menu in the Description field to jump to the parts that interest you most:
One thing I really, really like about iceni InFix is that it defeats the protection on a PDF. Several times customers have given me passworded PDFs to translate late on a Friday, then disappeared for the weekend expecting the result on Monday morning. Most conversion methods fail when there is protection like that, but with InFix I was able to get at the text and get to work quickly.
But what about Microsoft Word?
Microsoft Word can not only make PDF files from a word processing document, but in some cases it’s a rather good converter. Not always though.
That was Microsoft Word’s conversion of this document:
If at first you don’t succeed, just go use iceni InFix!
But my CAT tool can import PDFs!
I’ve come close to slapping some people who say that. The same people eventually try to import an image PDF (type #1) and wonder why that doesn’t work.
But even where the imports kinda sorta work there are inevitably problems. A dozen years ago, I wrote an article about this, comparing various CAT tools and their PDF imports, and the situation really hasn’t improved since then. The results of using import filters in a CAT tool are generally worse than any other option discussed above, with the possible exception of memoQ’s plain text import option on a simple, single column file. Many Trados Studio users are happy with the trashy DOCX files that get made from PDFs, but all those extra spaces and other format screwups are a headache that diminishes TM leverage. Clean OCR or an XML extract with InFix (which I might convert to a good plain text file by importing to a CAT tool with paragraph segmentation, then copying source to target and saving as text) is always better.
And what about iceni’s TransPDF technology, which is even integrated in some CAT tools and which makes XLIFF files? It’s stupid in most cases. Well, so is the XML for that matter. Both processes are intended to produce PDF as a translated file, and unless you have a graphics-heavy poster about to go to press or something similar with just a bit of touch-up needed, the end result is usually not something the customer will want.
What are your go-to methods for PDF?
Do you have a better way or one which works well in special cases? Tell us about it in the comments!