This article is based on a quality estimation method I developed and presented at AMTA in 2015. The premise of the method is a different approach to machine translation quality estimation (MTQE) created entirely from a linguist’s perspective. I will briefly cover some basic aspects of QE and summarize the main points of this process, but if you are new to the topic, I encourage you to read this post first.
What is MTQE?
MTQE is a method to automatically provide a quality indication for machine translation output without depending on human reference translations. Traditionally, to determine the quality of any given MT output, one or more reference translations created by human translators (a “gold standard”) were needed. The differences and similarities between the MT output and the human reference translation could then be turned into a score to determine the quality of the output. Examples of these methods are BLEU andNIST.
What’s the purpose of MTQE?
MTQE can be used for several purposes. One is to estimate post-editing effort, i.e., how hard it will be to post-edit a text, how long it may take. QE can help you obtain valuable information in an automated manner. For example, which segments have a very low-quality translation and should be discarded instead of post-edited. It can also answer a very common question: Can I use MT for this?
With QE you can:
- estimate the quality of a translation at the segment/file level,
- target post-editing (choose sections, segments, or files to post-edit),
- discard bad content that makes no sense to post-edit,
- estimate post-editing effort/time,
- compare MT systems to identify the best performing system for a given content,
- monitor a system’s quality progress over time, and more.
Why a Linguist’s Approach?
Standard approaches to QE can involve complex formulas and concepts most linguists are not familiar with, like Naive Bayes, Gaussian processes, neural networks, decision trees, etc. So far, QE has been mostly dealt with by scientists. It is also true that traditional QE models are technically hard to create and implement.
For this reason, I decided to try a different approach, developed entirely from a linguist’s perspective. This implies that this method may have its advantages and disadvantages compared to others, but coming from a linguistic background, the aim was to create a process that translators and linguists in the L10N industry could actually use.
I read in Linguistic Indicators for Quality Estimation of Machine Translations how linguistic and shallow features in the source text, the MT system and the target can help estimate the quality of the content. In a nutshell, finding potential issues in these three dimensions of the content can help get an idea of the output quality. These three dimensions are:
- complexity (source text, how complex it is, how difficult it will be for MT to translate it),
- adequacy (the translation itself, how accurate it is), and
- fluency (target text only).
The next step was then trying to identify in my content features in these three dimensions that would provide an accurate estimation of the output quality. After some trial and error, I decided to use this set of features:
- Length: is a certain maximum length exceeded? Is there a significant difference between source and target? The idea here is that the longer a sentence is, the harder it may be for the MT system to get it right.
- Polysemy: words that can have multiple meanings (and therefore, multiple translations). With millions of listings across several categories, this is a big issue for eBay content. For example, if you search for lime on eBay.com, you will get results from Clothing categories (lime color), from Home & Garden (lime seeds), from Health & Beauty (there’s a men’s fragrance called Lime), from Recorded Music (there’s a band called Lime), etc. The key here is that, if a polysemous word is in the source, this is an indication of a potential issue. Another key: if a given translation for a source term is near certain words, that is a potential error too. Let me make that clearer: “clutch” can be translated a) as that pedal in your car or b) as a small handbag; if you have “a” in your target occurring next to words like bag, leather, purse, or Hermes, that’s most likely a problem. Here’s an interesting piece on polysemy if you want to learn more.
- Terminology: basically checking that some terms are correctly translated. For eBay content, things like brands, e-commerce typical acronyms, and company terminology are critical. Brands may be tricky to deal with, as some have common names, like Coach or Apple, as opposed to exclusively proper names like Adidas or Nike.
- Patterns: any set of words or characters that can be identified as an error. Patterns can be duplicate words, tripled letters, missing punctuation signs, formal/informal style indicators, words that shouldn’t occur in the same sentence, and more. The use of regular expressions gives you a great deal of flexibility to look for these error patterns. For example, in Spanish, sentences don’t typically end in prepositions, so it’s not hard to create a regular expression that finds ES prepositions at the end of a sentence: (prep1|prep2|prep3|etc)\.$
- Blacklists: terms that shouldn’t occur in the target language. A typical example of these would be offensive words. In the case of languages like Spanish, this is useful to detect regionalisms.
- Numbers: numbers occurring in the source should also appear in the target.
- Automated post-editing: we use a list of known errors in our MT output to create rules for automatic post-editing, i.e., fixing errors automatically mainly with search and replace operations. This list of known errors can be used to identify potential issues.
- Spelling: misspellings.
- Grammar: potential grammar errors, unlikely word combinations, like a preposition followed by a conjugated verb.
After some trial and error runs, I discarded ideas like named entity recognition and part-of-speech tagging. I couldn’t get any reliable information that would help with the estimation, but this doesn’t mean these two can be completely discarded as features. They would, of course, introduce a higher level of complexity to the method but could yield positive results. This list is not final.
All these features, with all its checks, make up your QE model.
How do you use the model?
The idea is simple, let me break it down for you:
– The goal is to get a score for each segment that can be used as an indication of the quality level. – The presence of any of the above-mentioned features indicates a potential error.
– Each error can be assigned a number of points, a certain weight. (During my tests I assigned one point to each type of error, but this can be customized.)
– The number of errors is divided by the number of words to obtain a score.
– The ideal score, no potential errors detected, would be 0.
Quality estimation must be automatic – it makes no sense to check manually for each of these features. A very easy and inexpensive way to find potential issues is usingCheckmate, which also integrates LanguageTool, a spelling and grammar checker. Both are open source.
There is a way to account for each of the mentioned linguistic features in Checkmate: terminology and blacklists can be set up in the Terminology tab, spelling and grammar in the LanguageTool tab, patterns can be created in the Patterns tab, etc. The set of checks you create can be saved as a profile and be reused. You just need to create a profile once, and you can update it when necessary.
Checkmate will verify one or more files at the same time, and display a report of all potential issues found. By knowing how many errors were detected in a file, you can get a score at the document level.
Getting scores at the segment level involves an extra step. What we need at this point is to add up all the potential errors found for each segment (every translation unit is assigned an ID by Checkmate, and that makes the task easier), count the number of words in each segment, and divide those values to get scores. All the necessary data can be taken from Checkmate’s report, which is available in several formats.
To be able to carry out this step of the process with little effort, I created an Excel template and put together a VBA macro that, after copying and pasting the contents of the Checkmate report gets the job done for you. The results should be similar to this, with highest and lowest scores in red and green:
Several tests were run to check the effectiveness of this approach. We took content samples of roughly the same size with different levels of quality, from perfect (good quality human translation) to very poor (MT output with extra errors injected). Each sample was post-edited by two post-editors, recording the time required for each sample. Post-editors didn’t know that the samples had different levels of quality. At the same time, we obtained the QE score of each sample.
Results showed that the post-editing time and the score were aligned. In the example below, 5 Spanish samples were post-edited. Sample 5 was the golden standard (human translation) and Sample 7 the worst-quality one (due to a file naming error, there’s no Sample 9). All these samples were about 1,000 words. Red bars represent the time taken by post-editor #1 to post-edit each sample; green bars are for post-editor #2. The blue line represents the score obtained for each sample.
With the help of colleagues, similar tests were run for 3 additional languages (BPT, RU, and ZH) with similar results. The only language with inconsistent results was Chinese. We later discovered that Checkmate had some issues with double-byte characters. Also, the set of features we had for Chinese was rather small compared to other languages.
Challenges of using this model
A high number of false positives may occur based on the nature of the content. For example, some brand names may be considered spelling errors by certain spellcheckers. LanguageTool uses an ignore list to avoid incorrectly flagging any terms you add to it. Overall, it’s virtually impossible to avoid false positives in quality checks in any language. Try to minimize them as much as possible.
Another challenge is trying to match a score with a post-editing effort measurement – it’s not easy to come up with a metric that accurately predicts the number of words per second that can be post-edited given a certain score. I’m sure that’s not impossible, but a lot of data is required for precise metrics.
The model is flexible enough to allow you to assign a certain weight to each feature and ensure reliable results.