WTF? Regex education in translation :-(

Managing the impenetrable sensibly

Jun 23, 2024

I’ve been at odds with the teaching of regular expressions (regex — /ˈrɛɡɛks/) to translators, project managers and language service support staff for quite some time now. The definition I usually offer to those unfamiliar with regex is that it is a pattern description language used to match recurring structures in a text so that something can be done with the matched text. Maybe not a definition up to Oxford or Websters standards but useful nonetheless.

What kind of patterns are we talking about? Well, currency amounts, for instance. We can usually recognize these in a text based on the context and the ways in which they are written. They might be expressed as whole numbers, or as decimal fractions, typically with two places after the decimal marker. But there can be exceptions depending on the context:

Another typical characteristic of a currency expression in a text is some indication of the type of currency involved. In the example above, euros are indicated by the marker EUR, US dollars by USD. But there are other ways those currencies might be specified, some language-independent, some specific to a given language. And the position of the currency marker may also vary. So that 1 EUR for the conversion might have been written as: EUR 1, €1, 1 €, 1 euro, 1 євро, 1 ευρώ, ١ يورو or something else, and decimal numbers in a currency expression may also be encountered with different decimal separators, perhaps the numbers are written out as text: one euro. Such a simple thing, that single euro, but in the context of a quality assurance check in translation, it might be somewhat complex. In a text written by a careless author or by multiple authors of different backgrounds, the “same” currency amount might be encountered in many different ways, potentially complicating the creation of a regular expression to match the patterns for that amount.

That’s probably already a lot more than most of you want to know about regex. So if I were to start going on about this:

\d+,\d{2}\b

or this:

(?i)^fig[ures\.]*\s*\d+\p{L}?

you would probably excuse yourself to visit the restroom and then discreetly slip out the back door to escape the conversation.

And yet all the time I hear people say to translators and others that they should learn regex!

No they bloody well shouldn’t. At least not the damned syntax, however fascinating some nerd might find it. What a waste of time, really.

A typical regex workshop in Translation World will tell you that a single number might be represented as [0-9] or \d or something else like the Unicode character range for some weird number set you’ve never seen before, but you can look that sort of thing up on a cheat sheet rather quickly if you really want to know.

I would really be wasting your time telling you such things in a class, especially as I know you won’t remember it if I call you on the phone three hours later and ask you about it after you’ve stowed your notes away somewhere. And I would waste far more of your time walking you through examples of how to identify some currency expression or dates in one format or another. You simply won’t learn the syntax well without months or even years of daily practice, and if you’re that mission-focused, well, you probably don’t need my instruction, just a good book like one of those written by Anthony Rudd or a few web page tutorials.

So is there any point to regex education for those involved with translation processes? Why certainly. Just not for syntax.

The most important thing to learn about regular expressions is to recognize when you are confronted with a problem to which they might apply. Can you describe the issue with some level of abstraction that defines, however loosely, some sort of pattern? In a teaching situation, an instructor might offer examples to raise awareness (or you can just open one of Anthony’s books and browse the chapter headings, ignoring all the syntax details).

A typical translation workspace tool like Phrase, Trados Studio, Wordfast Pro or memoQ uses regular expressions in different contexts. Filtering text to find all segments that have only numbers in them, for example. Or identifying dates written according to British conventions in a text intended for readers in Alabama, maybe in some kind of automated quality assurance module. Or offering a ready-to-go insertable currency expression for your target text based on whatever occurs in the corresponding source text. Or finding target text that exceeds certain limits, perhaps a number of characters based on the source text and an allowed expansion percentage.

In each of these applications and others, the regex does not get used as a disembodied expression in your head. It is applied in certain fields, perhaps in particular dialogs, and the way in which it is interpreted might not follow the rules you expect or were taught in a class. If you are doing something that involves a non-breaking space (maybe between a number and its unit), maybe you can represent that non-breaking space by its Unicode value in one form or another (like \u00a0) or maybe you have to copy and paste an actual non-breaking space into a field where you have no hope of seeing non-printing characters. My God. What a mess. But these are details for a technician, not for the average translator or project manager.

So what do these people really need to have? Really?

An understanding that they have a problem that regex might solve
The ability to describe that problem to themselves or someone else
A means of recording and looking up solutions that they find or that are supplied to them
Descriptions in their personal records that tell them exactly what they need to know to use an expression by copying and pasting, perhaps with some minimal adjustment

Nothing more.

“But, but, but… !!!” I hear the earnest lecturegeeks exclaim. No buts. Just leave it at that. My experience as a “teacher” of regex is that if I give you a library of typical expressions that each have descriptive labels and instructions for use, you’ll be able to solve ninety percent or so of typical issues you encounter in your routine work.

And as you apply these pre-cooked solutions, you’ll begin to acquire a familiarity with the structure of this scary pattern description language we call regex. If Stephen Krashen were a math geek, he might say that by exposure to the comprehensible input of your regex library (made comprehensible by those descriptive labels and instructions) you begin to acquire the regex language naturally, just as you would a human language for which you are exposed to input that is not overwhelming. This works. Time and again. And until it does, you just get a lot of stuff done using regex.

Regex Buddy (which I don’t use but may investigate at some point and explain to y’all) does something like this, but it seems rather geeky to me and would probably scare the bejeezus out of many people. memoQ offers the Regex Assistant; my public discussion of this on YouTube might interest some, and I share some tools that can convert the libraries of that assistant into easily readable HTML files where one can look up solutions and copy/paste them into other environments such as Trados Studio or Phrase.

But really, you can do all this in a simple text file or Excel spreadsheet and use the Find function to locate the expression and explanation you need to use it. It can be that simple.

A good filing and lookup system for regexes is far more important than any understanding of regex syntax, and you can solve most routine problems in translation work with that and nothing more. I keep a Kindle copy of both of Anthony’s books on my laptop and just use the search function to locate what I need if it isn’t already in my personal library.

And if you need more, the best thing to have at your disposal is the ability to communicate the problem, with appropriate examples, to an expert who has worked with regex for many years and might be able to dictate a solution just out of her head in a minute or so, while you might take hours or days with the problem and find no good solution. Really.

José

Jun 24, 2024

Your cheat sheet is almost all we need. I would only add two of the most important ones:

"^": segment start; and

"$": segment end

Expand full comment

1 reply by Kevin Lossner

Riccardo Schiaffino

Jul 24

Hi Kevin,

When I present on regex, I have a simpler definition of regex than yours I call it "Search (and Replace) on Steroids" (see my presentations here: https://www.aboutranslation.com/p/regular-expressions.html).

Also, for memoQ and Trados users, I recommend Expresso, a free regex tool (similar to Regex Buddy): https://ultrapico.com/expresso.htm

Generally speaking, for technically-minded translators I actually do recommend that they learn at least the basics of regex, so that they may build a short library of useful expressions, such as

^((?!string).)*$

which allows searching for a string that is not there (useful in filters, when you can filter for a certain expression in the source and only those segment that do not contain a certain translation the target).

For myself, whenever I discover or create a new useful search string, I write it down in a repository, together with an explanation, so that I may reuse it in the future.

Riccardo

2 replies by Kevin Lossner and others

3 more comments...

Translation Tribulations Substack

Discussion about this post