Quality Estimation is a method used to automatically provide a quality indication for machine translation output without depending on human reference translations. In more simple terms, it’s a way to find out how good or bad are the translations produced by an MT system, without human intervention.
A good point to make before we go into more detail on QE is the difference between evaluation and estimation. There are two main ways in which you can evaluate the quality of MT output: human evaluation (a person will check the translation and provide feedback) and automatic evaluation (there are different methods that can provide a score on the translation quality without human intervention).
Traditionally, to automatically evaluate the quality of any given MT output, a reference translation created by a human translator is required. The differences and similarities between the MT output and the reference translation can then be turned into a score to determine the quality of said output. This is the approach followed by certain methods like BLEU, or NIST.
The main differentiator of quality estimation is that it does not require a human reference translation.
QE is a prediction of the quality based on certain features. These features can be, for example, the number of noun or prepositional phrases in the source and target (and their difference), the number of named entities (names of places, people, companies, etc.), and many more. With these features, using techniques like machine learning, a QE model can be created to obtain a score that represents the estimation of the translation quality.
At eBay, we use MT to translate search queries, item titles and item descriptions. To train our MT systems, we work with vendors that help us post-edit content. Due to the challenging nature of our content (user-generated, diversity of categories, millions of listings, etc.), a method to estimate the level of effort post-editing will require definitely adds value. QE can help you obtain important information on this in an automated manner. For example, one can estimate how many segments have a very low-quality translation and could be just discarded instead of post-edited.
So, what can you do with the help of QE? First and foremost, estimate the quality of translations at the segment- and file-level. Segment-level scores can help you target post-editing, focusing only on content that makes sense to post-edit. You can also estimate post-editing effort/time – it would be rather safe to assume that segments with a low quality score take more time to post-edit. It is also possible to compare MT systems based on QE scores and see which one performs better. This is especially helpful if you are trying to decide which engine you should use, or if a new version of an engine is working better than its predecessor.