If ya gotta go there with REGEX....

A few tips on how to keep from going mad

Jun 30, 2024

It’s so much in line with the bang-my-head-on-the-wall experience of discussions about regular expressions applied to translation and localization challenges that in my post last week about how productive one can be applying these expressions in some pre-packaged form without learning any of that awful syntax, the only comment so far is that this cheat sheet

is almost all one needs, with just two more “missing” syntax elements

which actually are in that particular regex cheat sheet (part of memoQ’s Regex Assistant library), well… no, actually.

I’m not going to go deep into the syntax weeds here — I will be doing that a little, another time in that other Substack (memoQuickies, tips for memoQ productivity) — but trust me, to solve the kind of translation placeable and quality assurance issues I’m asked to help with, you need a lot more than that cheat sheet. You also need many months, possibly years of pretty intense practice, and the experience of falling on your face many times because you didn’t understand the scope of a problem as well as you thought you did.

I have hundreds of pre-cooked regex-based resources for use in memoQ and other translation workspaces which I could share and will share as I have time, as I have done on many occasions in the past as part of courses, public lectures, blog posts, private help initiatives and more, and as I do that, along with the ready-to-use solution, I’ll walk you through the weeds a bit to show the logic behind what was done and the vulnerabilities of a particular solution and what might be done about those. And sometimes I’ll talk about some funky syntax you won’t find in that cheat sheet. All those syntax elements you see above related to letters? Pretty much out the window for a lot of things you might want to do with Hebrew, for example. Start getting into the Unicode category notation for that and other things, back references, lookarounds and such and those damned weeds are gonna surround your lost self until some anaconda (or maybe Python) shows up to swallow you whole….

But… but… but… I want to emphasize again a point I made last week about the importance of documentation.

… means of recording and looking up solutions…
… records that tell… exactly [how]… to use an expression by copying and pasting, perhaps with some minimal adjustment

I keep most of that in the entries of my memoQ Regex Assistant library. Not all. You might keep such records in a handy Excel or Word file, maybe organized in a table, but in some easily searchable form you prefer. That’s important, because I promise you will forget a lot of important application details otherwise.

But as I begin to share these regex-based resources, which are mostly for memoQ but which can be easily cannibalized for use elsewhere, you are going to encounter another important form of documenting regex work so it can be understood and adapted as necessary.

What is that scary-looking shite? It’s part of a cleaned-up version of some regex-based auto-translation rules for taking numbers of many forms and converting them to correct numbers as they are typically expected in English texts. With a special rule to avoid messing up section numbering. These rules can be used to generate placeable elements for direct use in translation and review, or they can be part of an automated quality check. But take a look at those green parts. Those are comments explaining parts of that code salad or telling me where I can go find some useful resources to help with further work.

The screenshot above may look more familiar to some memoQ users. It’s the same set of rules, but in a crappy integrated editor in memoQ that not only allows no comments, but strips out any comments from resources imported to memoQ. And you can’t even see all of the long expressions; editing is a nightmare. Other tools have similar issues.

In situations like this, it is important to have a good code editor on your computer. Even if you aren’t planning to edit the code. Free tools (and there are many) such as Notepad++ can be configured to display colors for tags, comments, etc., improving legibility and helping you find exactly the information you want. In an editor like the one in the screenshot just above it’s really hard to tell what’s what.

When I share regex-based resources on that other Substack, they will usually be in *.mqres files, and along with the XML code they will be extensively commented (those green texts) so that those who are interested in understanding what was done, how it might be adapted easily or what other issues there may be can open up those text files (*.mqres files are just text, nothing compiled) and read the commentary like a badly written textbook. And maybe add their own comments for future use.

Treating configuration files as documentation resources is an important “trick” for future maintenance, and resource configurations developed in lousy editor that don’t allow comments should be exported, opened in a code editor and be commented well enough that there will be no doubts about what all the regex code is and why it is written in the way you see it or how you might want to change it sometime.

Awwwww, but that’s so much work! But in this case, the work really will make you free from the madness that might ensue if you have some complicated regexes like the stuff you see in the screenshots of this article and you have to make some improvements to the solution.

Translation Tribulations Substack

Discussion about this post