Preparing half-translated bilingual XML for Trados Studio – with XSLT

More and more translation clients, especially in the Web industry, but also in application I18N/L10N, use the versatile XML standard for translation purposes. The market leader of Computer Aided Translation (CAT) Tools, SDL’s Trados Studio, allows to translate XML with an “Any XML” input filter, which includes an assistant that lets you choose which XML tags and attributes will be visible in the editor as “translatables”. Unfortunately, this means that the source strings will be overwritten with the translation — a bad idea if the source file is already bilingual XML that contains source and target language strings in matched tags.

If the target strings are empty, you can easily copy the content over and translate right away. But if the file is already partly translated, things get a bit more tricky, since you want to avoid overwriting existing translations. Worse, if the client happily announces that the source of some of the translated strings has changed, things get more than just a bit tricky. Let’s have a look at how to prepare those files with XSLT!

Alright, you have seen XML already, don’t you? Right. Looks similar to HTML, but you get to define the valid structure and tags in your own DTD. This basically means that while HTML is mainly used to display structured information to the human user, XML’s primary purpose is to contain structured information of any kind for humans and machines alike, and let separate stylesheets worry about how it will be displayed (e.g., as XHTML, PDF, LaTeX, CSV tables, plain text, you name it). If you want to know more, have a look at the XML and XSLT pages over at W3schools.

The XML file

Let’s first have a look at the file we want to translate with Trados (or the free/open source OmegaT plus Okapi Rainbow combo, or any other CAT tool):

<?xml version="1.0" encoding="UTF-8"?>
<uistrings text="de-DE" translation="en-US">
  <string id="001">
    <text>Die Verbindung konnte nicht aufgebaut werden: {0}.</text>
    <translation>Couldn't establish the connection: {0}.</translation>
  </string>
  <string id="002">
    <text>Falsches Datum {0} im Feld "{1}": Geben sie ein Datum nach dem {2} an.</text>
    <translation></translation>
  </string>
  <string id="003">
    <text>Ihre Eingaben werden von der Heisenberg &amp; Söhne GmbH verarbeitet.</text>
    <translation>Your entries will be processed by Heisenberg &amp; Planck GmbH.</translation>
  </string>
</uistrings>

Sometimes, clients will wrap HTML into those tags as Character Data (<![CDATA[ ]]>), which means you will get to see every tag in the translation environment as plain text. Be careful with those tags! Dear Clients: Using CDATA may lead to messed-up code during the translation, please try to use namespaces instead to enclose HTML in XML, then they will be correctly parsed and displayed as immutable tags and the translator is less likely to forget or mangle a tag somewhere.

The file starts with the XML file declaration including version and encoding. The mandatory “root element” uistrings encompasses all other tags, it also holds the source and target languages as attributes. Inside, we can see three string tags with their IDs as attributes, each with one text and one translation tag with the actual source and target content. Attention: If the file is saved as ANSI instead of UTF-8, the Umlaut and Ampersands might throw parsing errors and should be replaced with Entities!

I have inserted three use cases: The string is already accurately translated, the string is untranslated (empty translation tag) and the string is translated but the translation doesn’t match the text (here: the company name has changed). Unfortunately, our virtual client has not marked that string as modified, for example by setting something like a new or modified="yes" attribute on the string or text element.

So, we have already translated strings, empty strings and strings that need to be edited. Usually, you would want to write your translations into the translation elements. However, telling Trados to parse the translation elements as translatables will lead to English text in TagEditor’s German source column for strings 001 and 003, and you won’t get to see string 002 at all, because it’s empty and nobody would ever need to translate “nothing”, right? And on top of that, you won’t ever get to see the German source text.

File preparation

Apparently, what we need to do before translating the translation elements is to copy the source text, preferably without destroying extant translations. One way to achieve this is by using a text editor with Regular Expression Search&Replace functionality to turn the whole XML thing into a tab-separated table, save as .TXT, use Trados text table input filter to read and translate the file and turn it back into an XML document with another RegEx. Been there (article in German), it works quite nicely and you automatically have the source text in TagEditor’s source column and any existing translations in the target column. But let’s try using only XML this time, shall we?

XML and XSL are like HTML and CSS on steroids. Not only can XSL present XML data in a number of other languages, it also lets you convert one XML file into another, use variables, copy and move elements, and even use control structures such as if. One (good) use is to convert our XML file into an HTML file showing us three columns: ID, text and translation – and tell Trados in the file type options to use that .xsl stylesheet to display the preview window. Trados will even mark the currently edited segment with a red box in that preview, and we have our source and target sitting nicely side by side instead of having to stare at XML code. Example:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<!-- Sample XSL Stylesheet to display the above XML file as a HTML table in Trados' preview window -->
<xsl:output method="html" indent="yes" encoding="utf-8" doctype-public="-//W3C//DTD HTML 4.01 Transitional//EN"/>
<!-- What follows is regular HTML, except in tags with the namespace prefix xsl: -->
<xsl:template match="/uistrings"> <!-- XML root element of our file -->
<html>
  <head>
    <title>Preview</title>
  </head>
  <body>
    <table width="100%" style="border:1px solid #999;" cellpadding="2px" cellspacing="0px">
       <tr style="background-color:black;">
          <th style="color:white;" width="15%">ID</th>
          <th style="color:white;">Source</th>
          <th style="color:white;">Target</th>
       </tr>
      <xsl:for-each select="string"> <!-- Loop through all string elements in XML file and create table rows: -->
       <tr>
          <td color="blue" width="15%"><xsl:value-of select="./@id"/></td>
          <td><xsl:value-of select="./text"/></td>
          <td><xsl:value-of select="./translation"/></td>
       </tr>
      </xsl:for-each>
    </table>
  </body>
</html>
</xsl:template>
</xsl:stylesheet>

But it gets even better: As I have said already, such an XSL sheet can also transform one XML file into another XML file, and that’s where we can make that whole CAT translation thingy work, because Trados actually has a special XML filetype that is bilingual and that is read and displayed and edited correctly: XLIFF, the Translation (abbreviated XL) Interchange File Format, which is used by Trados and almost all other major CAT tools (as an import/export format if not natively). XLIFF is for bilingual texts what TMX is for translation memories.

This is how XLIFF can be generated by XSLT from our XML file:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xliff="urn:oasis:names:tc:xliff:document:1.1" exclude-result-prefixes="xliff">
<!--This XSL Stylesheet will output XLIFF, XLIFF prefixes shall not appear in the resulting file -->
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/uistrings"> <!-- Start with our root element and create XLIFF: -->
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.1" version="1.1">
<file source-language="de-DE" datatype="plaintext" target-language="en-US"> <!-- datatype could also be "html"! -->
  <body>
    <xsl:for-each select="string">
      <trans-unit id="{./@id}">
         <source xml:lang="de-DE"><xsl:value-of select="./text"/></source>
         <target xml:lang="en-US"><xsl:value-of select="./translation"/></target>
      </trans-unit>
    </xsl:for-each>
  </body>
</file>
</xliff>
</xsl:template>
</xsl:stylesheet>

How it works: We begin with our usual xml file type declaration, followed by the declaration that this is going to be an xsl:stylesheet, including the XSL version and namespaces for XSL and XLIFF. We also add that we don’t want to see xliff: prepended to any element in the output file. Then we proceed to the desired output, which is going to be XML and shall be indented for better readability. To define how our XLIFF file should look like, we begin our xsl:template at the uistrings root element (the one which holds all other elements) of our sample XML file.

The first line that will be written into the new file is its own file type declaration (xliff), together with its namespace, followed by a body element. Then, we begin iterating through our strings from the XML file: For each string, we write one trans-unit carrying the id as its attribute. Each one will contain one source and one target element with the content of the original text and translation elements. Then we end the loop, neatly close our body, file and xliff tags and end the xsl:template. And that’s also the end of our xsl:stylesheet. Easy, isn’t it? You just need to know how your desired file must look like and insert the content into the corresponding places – the for-each statement does the rest.

Convert to XLIFF, translate, convert back

Now let’s see if it works! You can download Apache’s Xalan XSLT processor either as a C binary or as a Java app (Xalan-C / Xalan-J). Personally, I find the Xalan-C to be less of a hassle: You download the xalan-comb-… package for your system (usually x86-windows or amd64-windows) right from here, extract the archive, drop the contents from the Xerxes directory into the Xalan directory (integrate the folders bin, include and lib) and there you go. There are other XSLT processors, but Xalan is open source, free, libre and easy to work with.

Once you are done extracting (no real “installation” required), copy the above XML file code (the first code box) and paste it into an empty text file. Save that as test.xml. Likewise, copy the code from the last sample XSL sheet and save that as test2xliff.xsl. From the Xalan “bin” directory, do: xalan.exe -o testoutput.xlf test.xml test2xliff.xsl – be sure to include the full path to where you saved the test files, e.g. xalan.exe -o C:\Users\Me\Documents\XMLtest\testoutput.xlf C:\Users\Me\Documents\XMLtest\test.xml C:\Users\Me\Documents\XMLtest\test2xliff.xsl

Subsequently, you can open the .XLF (short form of .XLIFF) with Trados File/Open command and translate that. For me, it worked without hassles.
Wham!
Now you know how to write an XSL transformation into XLIFF. Will you be able to write a similar XLS transformation to convert the XLIFF back into the original XML file format? Try it out and tell me!

Cheers,
Christopher Köbel of DeFrEnT

Preparing half-translated bilingual XML for Trados Studio – with XSLT