The Usual Quick Introduction
If you are familiar with our posts, you probably know by now that, at eBay, we use Machine Translation (MT) to translate specific types of content. If you had no idea, thisis a great place to start. Why MT and not Human Translation (HT)? We actually do both, but some types of content have characteristics that make them just impossible to process without the help of a machine. For example, there are over 800 million listings on eBay, each with a title of around 12 words – by the time a legion of human translators completes the translation of those titles only, all the items would probably be already sold.
There are different parts to any typical listing in any e-commerce site, like a title, a description, a review. In this article, we are covering the challenges that product reviews pose for MT.
Reviews are comments that buyers leave for items they have purchased. These are comments written by buyers, and this is precisely the first, general challenge: user-generated content is difficult to translate. Some users will take the time to write a great review, in perfect English, using full sentences and proper grammar. Some just won’t, for many different reasons, like users typing reviews on their phones, for example.
There is sentiment involved in reviews, which makes the content much more difficult to translate correctly. Reviews can even get very intense for buyers that hated a product, and will not hesitate in hurling epithets here and there. Some sites have filters in place to prevent reviews with profanities from being published, but users always come up with creative ways to avoid this.
– Slang and idiomatic expressions: not so common in other content types, slang, colloquial language, idiomatic phrases are all typical traits of reviews. Idiomatic phrases are not compositional, i.e., they have an arbitrary meaning that is actually different from the meaning of each of the words in that phrase. Deciding if a phrase like “hands down” should be translated literally or not is a challenge for MT.
A common issue with these expressions is that most MT systems are not trained with this type of language. On the contrary, they are trained using more formal corpora, like the European Parliament Proceedings. So here, the problem lies in that a Statistical Machine Translation system just can’t translate something it has not seen before, something that is not part of its corpus.
In general, MT doesn’t deal too well with slang, and that is one of the reasons whyFacebook trained their systems in a very specific way. After all, the language used in FB posts and comments is not so different from reviews.
- “My bad, evidently did not read or understand the systems supported.” (In most cases, this will be translated literally.)
- “Identify the new parts and reassemble. Piece of cake. Even a caveman could do it.” (This is a very common phrase, yet it gets mistranslated in major online MT systems. This is possibly due to the lack of context, which is another of the challenges we are covering below).
- “It doesn’t split and rip on the plastic ends like those cheap-@ss import straps.” (A very common issue with terms like this is tone: the resulting translation is usually “stronger” in tone than the original.)
– Context: or lack thereof! Lack of context is a common problem for MT in general. This issue goes hand-in-hand with ambiguity. “Clutch” will have different translations in categories like Clothing, Shoes & Accessories and Business & Industrial. And in a short review like “Very happy with new clutch master” it is really hard to get the translation right.
- “Great buy. This is now the only axe I play. Good luck finding one of the remaining items from this run.” (As you’ve probably guessed, this review was written by a guitar player and not by a lumberjack).
- “This is a sharp looking bat.” (Not one but three polysemous terms that can (and will) probably go wrong here. I can tell you that this review is not about a nocturnal flying animal that has been honed and someone is looking at).
You may want to use context-dependent regular expressions to catch some of these errors.
– Humor: may pose the same challenges as slang above. Jokes, puns, play-on-words, etc., will probably get literally translated. Humor is really hard to resolve even for human translators. This is, again, a known issue in SMT, and I’d say probably not a huge one, but the real problem here is when the translation turns out to be offensive.
- “This pooper scooper is the sh*t.“
- “Great book on hooking for the intermediate level. Ms M. is the consummate wool textile educator! A must have for any serious h**ker!”
You may want to create a blacklist of offensive terms, profanities, etc., to stop these words from being displayed to your audience.
– Odd sentence endings: it is not unusual to find in English reviews with sentences ending in prepositions. Since languages have different sentence structures, prepositions at the end of a sentence tend to cause translation problems in some language combinations.
Other common issues are truncated sentences, elliptical sentences, and symbols and emoji replacing words.
- “Nice case for storing Hot Wheels that are played with.”
- “The modem has already paid for itself compared to renting the one from xFinity. Router on the other hand is $$”
- “It may be mildly entertaining, the ball is a rattle, but the golf club is not.”
In some languages like Spanish, sentences never end in a preposition. You may want to create regular expressions to find unexpected words at the end of a sentence (for example, word$ or word\.).
– Punctuation/Spacing/Typos: they’re everywhere. All these issues may cause untranslated words in the output, simply because the error creates a “word” in the source that the system has not seen before.
- “Just what I needed to blend my penci coloredl artwork using GAMSON blending oil.”
- “Thanku luv it, it’s really been wonderful for health promotion 😉 sleeping so much better & para leaving too!”
- “This productos look beatyfull on the picture but the reality when werecieived them complete diffrent too small plants sprouds cutted on top and the mail pakage one envelope.”
Where possible, it helps to spell-check and fix the source to get a better quality output. At the same time, however, you also want your system to learn how to deal with misspellings. If you train your system only with a perfectly spelled source, it will never be able to handle spelling variations.
– Gender/number agreement: some MT systems seem to lose their train of thought with longer sentences and randomly mix things like gender and number up. Or perhaps a better, yet simplistic explanation is that, currently, most MT systems follow the phrase-based model and don’t process an entire sentence at a time but sequences of words (phrases). If one phrase is far from another in a sentence, some level of agreement may be lost. In a nutshell, anaphoric references and coreferences tend to cause translation issues.
- “The(fem) photo(fem) on the(masc) package(masc) is totally deceiving(fem/masc?)”
- “Very informative books(pl) on the American Revolution(sing). Helpful(sing/plural?) for genealogical purposes.”
- “This hat brought many laughs at the annual thanksgiving pie bakingcontest” (annual thanksgiving, annual pie, annual baking, annual contest?)
LanguageTool will easily find gender and number issues as it uses part-of-speech tagging to find errors. Unlikely combinations, like a preposition followed by a conjugated verb are easily detected.