A quick introduction
Perhaps Machine Translation would be solved by now if each word had only one meaning and, thus, only one possible translation. That is obviously not the case in the world we live in. Words have multiple meanings, and that is what we call polysemy. In the same way a concept can be expressed using different words (what we callsynonyms), a word can be associated with multiple concepts or meanings.
Take the word bat, for example. It may have different meanings or senses, and it’s hard to provide a translation for it without any context at all. For this reason, this is hardly ever a problem for human communication – when we communicate with others, the context is there surrounding us.
For machines, on the other hand, this is a big issue – they are not just as good as humans dealing with meaning, which is a very intricate concept. Without the right context information, machines are forced to present either more than one output or, in the case of machine translation, the result with the best probability. This is especially evident with short text strings, like search queries. Try searching for bow in Google Images. Try the same search on eBay, and you will see that you’ll be presented with different categories to search the term in, like Sporting Goods and Crafts. Finding the right translation for a polysemous word is called word-sense disambiguation.
Google Translate results for “bow” – how many possible translations?
Challenges for statistical machine translation
It is then not surprising that polysemy poses certain challenges for machine translation, especially for e-commerce content, the area I work the most in. Let’s examine some typical examples:
- Case: when a user searches for “case”, most probably they are looking for a protective cover, for example, for their smartphones. Since most SMT engines are trained with general content, it is possible that this term gets translated as a legal case. Failing to correctly translate the item being sold is definitely not the best experience for the user.
- Silver: it can be either a material or a color. Without enough context, an MT system may easily produce the wrong translation for this term, and this could lead users to buy an item assuming it is made of silver, when it actually is silver-colored. Same is the case with “gold”.
- Brands: Apple, Coach, Puma, and many others brand names are also common nouns. If an MT system is not properly trained, it’s likely that these words aretranslated when, for most language combinations, these should be keptuntranslated. Machine translation language specialists at eBay work with a list of thousands of brands for Quality Assurance purposes. Here’s a tip: it’s way easier to find translation errors in brands composed of two or more words, as one-word brands may cause a lot of false positives.
Other typical examples are:
- Free (not costing any money, freedom, not containing a certain material, etc.)
- High (high value, high risk, high pitch, high quality, etc.)
- Hot (fashionable, high temperature, sexy, spicy, etc.)
- Light (daylight, bright, having little weight, etc.)
- Ship (boat, airplane, sending something, etc.)
How can you find polysemy-related translation errors?
Here’s my preferred method, and it involves regular expressions. If you want to determine if the translation of a polysemous term is not correct in a given context, you need to find out two things:
1) the incorrect translation(s)
2) context keywords
You can find out both by analyzing your MT output, and a tool like AntConc can be of great help. Let’s use one of our previous examples, the term “case”, and assume it was incorrectly translated into Spanish as “caso” in the context of cell phone cases. Knowing this, we want to create a regular expression that finds “caso” when it’s relatively close to keywords like Samsung, iPhone, Motorola, phone, etc. It could be something like this:
Let’s break this down:
- (?i): case insensitive; find these words ignoring capitalization
- caso.+: the word “caso” (incorrect translation) followed by any one or more characters
- (iphone|samsung|phone|motorola): find iphone or samsung or phone, etc…
- |: stands for “or”; find “caso” followed by these keywords OR these keywords followed by “caso”
These regular expressions can be used in a QA tool, like Checkmate, to automate the process.
How can you find context keywords in a smart way?
Identifying polysemy-related translation errors is not that complicated. Finding context keywords may be a bit more complex. Here’s a technique that may save you time and effort: we can use AntConc, a corpus analyzer, to find collocates (words that usually go together with a frequency greater than chance).
In the example below, I’m assuming “case” is an incorrect translation and I want to find out which terms occur in the same sentence with the highest frequency.
We see here that “for” occurs 27 times near “case”, “cover” 15 times, etc. Prepositions don’t help much and can be ignored. However, terms like Samsung or cover, have richer semantic information, and tell us more about the context, which is just what we need.
Using the Concordance tab, you can see the results in context, and use this information to create QA checks.
Using Polysemy for Quality Estimation
Machine Translation Quality Estimation (QE) is an automated way to estimate the quality of MT output without using a human reference translation. This means, with QE, you don’t need to compare your MT output with one or more human translations (gold standard) to see how similar or different they are. QE uses features of the translation, like length, terminology, or spelling, for this purpose. And one of those features can be polysemous terms: the presence of a polysemous word in the source may be an indication of a potential quality issue. The same is true for a polysemous word and a known incorrect translation.