Mo' money in translation!
That last article about finding currency expressions is hardly the last word.
Yesterday’s article discussing examples of how to find currency expressions in German source texts, published on the memoQuickies Substack and cross-posted here on this one, was written in large part to serve as a reference for my upcoming terminology workshops to provide some details which I doubted I would have time for in the 90 minute talks.
I suggested this regex as a useful example to screen documents and corpora (such as translation memories) for expressions of currency amounts:
(?i)((EUR|€|US\s?[D\$]|\$|GBP|(br[\.a-z]+)?\s?Pfund|£|RUB|₽|CHF|S?Fr|₣)\s?(\d+)|(\d+)\s?(EUR|€|US\s?[D\$]|\$|GBP|(br[\.a-z]+)?\s?Pfund|£|RUB|₽|CHF|S?Fr|₣))
And it is in fact useful and would adequately cover common cases encountered by most of the translators I know.
That example is certainly an improvement over the one currently included by the software development team in the built-in regular expressions library of the memoQ Regex Assistant:
(\p{Sc}\s?|((USD|EUR|GBP)\s?))\d+(\.\d{2})?
That one
assumes that the currency symbol or code always precedes the amount, which is often not the case,
doesn’t account for letter-based currency symbols that aren’t included in the Unicode character class
\p{Sc}
, andfails to recognize that decimal commas are used in a lot of countries rather than a decimal point, but the decimal delimiter is really irrelevant in that expression since all it is intended to do is screen possible currency amounts anyway.
And it’s better than an expression I found on Stack Overflow, which gives me a huge number of false positives in my data collections (especially in discussions of ISO standards):
(?:[A-Z]{3} [0-9]+(?:\.[0-9]+)?)|(?:[0-9]+(?:\.[0-9]+)? [A-Z]{3})
I’ve never been fond of writing [0-9]
instead of \d
for numbers, because I’m haunted by other forms of written numbers. That solution also sucks for me because it assumes the presence of spaces that I often see omitted. But the suckiest part of that sucky solution is that it generalizes the three-letter currency codes as any three capital letters. But when I test it against my own TMs, I see a huge number of three letter acronyms before or after numbers, and most of these have nothing to do with money.
Bottom line: using actual, known three letter codes likely to occur in your documents will reduce the number of false positives. And without a word border \b
in front of the regex for the three letters, something like UNESCO 2025 will be selected. Yuck.
The funniest part of my own proposed solution for me was that I wrote that regex in an article in which I had intended to include a discussion of some recent auto-translation rules I created for currency expressions like 3,2 Mrd. € (one way that €3.2 billion might be written in German), and those are missed entirely by my suggestion!
Even for “experts”, developing good, fit-for-purpose expressions is usually an iterative process with enough steps to meet many a daily fitness goal. And fitness for one’s purpose depends very much on the nature of the texts at hand. One of the reasons I have such a big collection of solutions from my past work and consulting is that a change of author with one of my clients very often entails some adaptation or updating of what was previously a 100% accurate solution.
Careful documentation of assumptions and limits in the records for your own resource archives can make this process of adaptation a lot less painless.
As I examined my own TMs to prepare the upcoming presentations and other articles, I also found the currency markers TEUR
and T€
(both for thousands of euros), which I’ve accounted for in the past in QA rules and auto-translation rules, but simply forgot as I drafted the regex for filtering.
Oversights like this are why I sound like a broken record in tutorial sessions where I help others with regex solutions, because no matter how good you are at writing these damned expressions, it’s a sure thing that you’ll forget to apply things you actually know and have used for years. So the safe, responsible thing to do is record and curate your solutions to the best of your ability and focus on improving your organization and solution curation skills over your syntax coding abilities.
So how did I update my currency filtering expression? I’m gonna keep y’all in suspense over that for now. Maybe I’ll talk about that a little in the workshops later this month. But for sure I’ll be incorporating these realizations in the many examples that are part of my upcoming book of regular expression solutions that I had hoped to finish last December before the US election results and the loss of my old dog knocked the wind out of me for a while.
But for those who want something similar now, I’ll remind you of the review article I wrote last May about two books from my esteemed colleague Anthony Rudd:
Regex recipes for translation (books reviewed)
Regular expressions are a somewhat controversial topic in the translation and localization scene. Anyone who has been professionally active hears how one should learn to write them (I disagree in most cases, though I do teach how to make and use them), but the curve to do so proves too steep for most, and even those who have some success are often frust…
That “practical cookbook” guide of his is a diamond mine of ready-to-use solutions, and his books also contain specific, practical tips for a variety of translation environment (CAT) tools. The guide I am working on covers a narrower range of solutions than Anthony’s works do, though they go into more depth and detail for specific languages and talk about specific quirks in memoQ’s regex implementation that aren’t discussed by him.