The file format PDF - Portable Document Format was developed by Adobe, and is an open standard. PDF files can be read (and in some cases edited) with Adobe Acrobat reader, a free application or with Acrobat Pro. The purpose of the PDF file format is to be able to produce readable documents from whatever application or paper document in a format that can be read on all computers and all platforms, without having access to the source files or having to buy the application. This also means, that there are no documents that are produced in Acrobat as source files. All PDFs originates from other source file formats.
The problem for the translator is, that customers often sends a PDF, and they are not aware of the fact that there always exists another document in the background. In most cases it is easy to obtain the source documents from e.g. FrameMaker, InDesign, AutoCAD or Word, which of course is more adapted to the translation workflow.
But Studio is able to translate PDF directly?
Well, yes I am aware of that the advertisements tell you that. The problem is that if you use the built in PDF converter in Studio, you cannot do a proper pre-editning, which I find is essential. Do feel free to try, it might work ok in some cases. But if you find your document full of rouge tags and in-line paragraph breaks, then follow my advice in this document and pre-process your PDF.
First rule of PDF translation - Do not do it!
Never accept to translate from PDF without first putting in a certain amount of asking, begging, investigating and extorting your customer to find the source files! Tell your customer that the pre-processing and post-processing of every PDF will add 1 h to the translation bill, and still the result will be of poor quality or a text-only file.
If this fails, and you do not have any other customers to work for, then it is actually possible to work also in PDF. In some cases, there actually are good reasons why the source files cannot be obtained.
PDF conversion to text
In contrast to converting other file formats to and from your CAT tool, the conversion from PDF is an irreversible process. You will never get back the same formatting that you started from. So never promise anything else than a plain text file without pictures and formatting when working with PDF. But, it is actually possible in many cases to keep the formatting somewhat close to the source, if you know how to. There are two main types of PDF:s, graphics based and text based. If you do a scan from paper and save as PDF, you get a graphic image of the text. If you make a PDF by electronically "printing" the result from an InDesign or a Word document, you get a better format, as the PDF will still retain the text as editable. In the following I will call the types graphic PDF and text PDF.
Conversion of text PDF
For this type of PDF you will need a good converter. I have tested several, and I now use Nuance PDF Converter Professional. PDF Converter manages to keep sentences in one piece (no paragraph breaks in end of lines) and also fixes most pictures, tables and columns. I always try to save in full graphic format first, but if that does look too ugly, then I save as text only.
Conversion of graphic PDF
If the text is a picture of text, you have to use OCR. There is some kind of OCR function also in the above mentioned PDF converter, but a real OCR program like ABBYY FineReader is much better to use for this kind of documents. In some cases I have also found, that the PDF converter cannot handle a text PDF very good, then it is better to use FineReader instead. Also in FineReader you have the option to try saving as fully formatted or plain text Word document.
Pre-editing of PDF
After the conversion, before translating a PDF, there is always some pre-editing to do. I have found it most time-saving to do as much pre-editing as possible before translation. If you save as plain text, some of the points below can of course be skipped.
1) Set Word to "Show invisibles" and look for misplaced paragraph breaks in the middle of sentences. Also check for new line characters and change them to paragraph or remove them. This will avoid having sentences split in several segments. I also recommend to search and replace soft hyphens with nothing. That reduces the rouge tagging.
2) Sometimes you can also get problems with language specific diacritics, e.g. åäö or ü. In such cases, select all and set a font that contains your specific letters. Note that some special symbols might change doing this, if other fonts are used. This must be restored after translation.
If you save as formatted text:
3) Select all of the text, select Home/Font/Character spacing and do the following settings:
Scale 100 %
Distance Normal
Position Normal
Kerning remove tick
This will prevent much of the rouge tags in the CAT editor.
Post-editing PDF
It is also a good idea to proofread your document and compare with the source PDF. In most cases you will find some deviations to correct, even if you have done a good post-editing.
0 comments:
Post a Comment