Invisible gremlins in translation!
Zero-width characters can induce madness in some text processing and translation environments.
From time to time in the course of more than two decades of commercial translation work, I would encounter some strange phenomena in the text I worked with.
Search functions failed to find a word even when I could see it right in front of me on the screen.
Expected automatic glossary lookups failed to happen.
Full translation memory matches of the source text were shown as fuzzy matches, and not even “high fuzzies”.
Simple regular expressions would fail to perform as expected for some groups of characters but worked fine for others.
Attempts to select a word by double-clicking resulted in only a substring of the word being highlighted. But if I overwrote the word, the subsequent selection behavior was normal.
One day I copied such a “troubled word” onto a page for converting character strings into Unicode code points.
And when I compared the output code points to the number of letters I could see, I found there were more “code points”. I had an extra, unfamiliar code U+200B
in the list. I looked that up and saw that it was a zero width space.
These little gremlins weren’t found in every text, but once I was aware of what they were, I realized that many past problems I had experienced, some of which allowed translation errors to slip through, occurred because of these invisible characters. Well, invisible in some environments at least as you’ll see below.
Stanislav Okhvat provides a “Pre-flight Checker” utility in his Transtools+ suite which detects and eliminates these gremlins in Microsoft Word files before they can inflict mischief on a project. But if these characters occur in other file formats, you may need other tools or techniques for a cleanup.
Some environments can be configured to show these usually invisible zero-width characters.
The response to zero-width spaces or other zero-width characters is different in various translation environment tools. In memoQ (as of the current version 11.2), they are not seen at all, even when the display of non-printing characters is enabled, though unexpected breaking of words can hint at their presence.
Those zero-width spaces can be made apparent by converting them to tags with the memoQ Regex Tagger:
Regex Find & Replace can be used in memoQ to eliminate the zero-width spaces by using the Unicode value (\u200b
) in the Find expression and nothing (empty) in the Replace field. Or simply delete the tags if there are only a few of them (if you have used the Regex Tagger. A tagging configuration like the one above can be saved and employed in a cascading filter to make all those zero-width characters immediately apparent.
OmegaT likewise offers no obvious indications of zero-width characters.
The Phrase TMS makes a real hash of an imported text with zero-width spaces:
That’s no surprise given that I’ve found the programming of Phrase (formerly Memsource) to be sloppy since I first became acquainted with it 12 or 13 years ago.
Trados Studio displays gaps in the translation and editing grid, with the preview pane showing apparently normal text:

I didn’t examine cases of zero-width characters in any browser-based environment other than Phrase in a default configuration.
I have encountered problems caused by zero-width spaces in numerous translation environment tools over the years, with source texts in many different formats deriving from various sources. WalkMe™ content management systems show a lot of them in translation exports I’ve seen in several formats. Older formats for MS Word (DOC) and RTF frequently have them. Macros created by the EU DGT for splitting multilingual files in translation processes caused these zero-width characters to be inserted randomly in words if versions of Microsoft Office later than 2003 were used (this caused hundreds of thousands of euros in losses by a contracted LSP because match rates in the files given to translators for work were lower than the ones used by the DGT for costing).
In short, these invisible characters can mean real trouble if they aren’t eliminated at the right point in a process. And if they are not identified, the problems they cause may go unsolved, resulting in economic damage and considerable frustration.
Development teams for some translation environments such as memoQ and OmegaT should consider explicit measures to reveal and handle these zero-width characters. As the screenshot above shows, Phrase has perhaps the most severe problems with them, and recommendations and measures should be developed to address this, beginning with the segmentation effects.
Have you had experience with problems caused by zero-width characters in your texts? If so, in what environments, and how did you deal with the trouble?
Many non-technical wordworkers who encounter such difficulties think of them like the mysteries of divine (or in this case diabolical) doings of the electronic deities which watch and rule over modern life. But these mysteries have perfectly rational explanations and solutions which generally don’t require black chickens to be sacrificed.
Interesting. I was recently working with a pdf file (in a pdf reader, not a CAT tool) where the search function refused to find words that I could clearly see were present in the file. I'm guessing this could well have been caused by the same (or a similar) issue.
Many times I have run into something similar in German source texts that were not composed with Unicode fonts. You'd think everything would be Unicode now, but sometimes things still arrive in a pre-Unicode font.
In those texts, a word like "früher" will look fine, like just one word, but the programs will perceive it as something like "fru<¨>her". The programs will also not recognize it for glossary display, among other problems. Luckily, Word will select the whole thing as a misspelled word in spellcheck, so a quick spellcheck before import usually fixes the problem.
One time, I handed over a job and was told by the project manager, with the utmost urgency, that I hadn't finished it. It turned out that import into both MemoQ and Trados would stop at a certain point and the remaining text would not display. Trados wouldn't show me the text it was importing (these were text files and not Word files), so I couldn't see what was causing the glitch. However, MemoQ's text import dialog, where you can choose the text encoding shows you exactly what the text you're importing will look like. I scrolled down to the point where import stopped, and I found a strange Chinese-looking character that didn't belong there. I opened up the text file, deleted and rewrote the area where the invisible character was, and then everything imported fine. The client always wanted everything done in Trados, but if I hadn't been using MemoQ, I'd never have found the problem, and the project manager could have blamed me, even though she didn't know how to fix it yourself.