Corpus Analysis – Part I

The Introduction

As you probably know, Statistical Machine Translation (SMT) needs considerably big amounts of text data to produce good translations. We are talking about millions of words. At the same time, SMT has the ability to translate millions of words relatively fast (VERY fast, in comparison to human translators). In this scenario, and speaking mainly from a linguist’s perspective, the challenge is how can one make any sense of all of these millions of words? What do you do if you want to find out whether a corpus is good enough to be used in your MT system? How do you know what to improve if you realize a corpus is not good? How can you know what are the main topics covered in your corpus?

It’s unrealistic to try to understand your corpus by reading every single line or word.

Corpus analysis can help you find answers to these questions. It can also help you understand how your MT system is performing and why. It can even help you understand how your post-editors are performing.

In this series, I will cover some analysis techniques and tips that I believe are useful and effective to understand your corpus better. For simplicity purposes, I will call corpus to any text sample, either used to produce translations or being the result of a translation-related process.

The Tool

AntConc. There may be some other tools out there, but AntConc is, as far as I know, the best and the most widely used tool for corpus analysis. As defined by its website, AntConc is a freeware corpus analysis toolkit for concordancing and text analysis. It’s really simple to use, it contains 7 main tools for analysis and has several interesting features. I will not go into more details about the tool as I hope to illustrate how it can be used in the following examples.

AntConc can be used in Windows, MAc and Linux.

Getting a Word List

A great way to know more about your corpus is getting a list of all the words that appear in it. AntConc can easily create a list with all the words that appear in your corpus, and show important additional information about them, like how many tokens are there and the frequency of each. Knowing which words appear in your corpus can help you identify what it is about; the frequency can help you determine which are the most important words.

You can also see how many tokens (individual words) and word types (unique words) are there in a corpus. This is important to determine how varied (how many different words) your text is.

To create a Word List, after loading your corpus file(s), click the Word List tab and click Start. You’ll see a list of words sorted by frequency by default. You can change the sorting order in the Sort by drop down. Besides frequency, you can sort alphabetically and by word ending.

Frequency is often a good indicator of important words – it makes sense to assume that tokens that appear many times have a more relevant role in the text.

But what about prepositions or determiners and other words that don’t really add any meaning to the analysis? You can define a word list range, i.e., you can add stopwords (words you want to exclude from your analysis) individually or entire lists.

Word lists are also a very good resource to create glossaries. You can either use the frequency to identify key words or just go through the list to identify words that may be difficult to translate.

Keyword Lists

This feature allows you to compare a reference corpus and a target corpus, and calculate words that are unusually frequent or infrequent. What’s the use for this? Well, this can help you get a better insight on post-editing changes, for example, and try to identify words and phrases that were consistently changed by post-editors. It’s safe to assume that the MT system is not producing a correct translation for such words and phrases. You can add these to any blacklists, QA checks, or automated post-editing rules you may be using.

A typical scenario would be this: you use your MT output as target corpus, and post-edited/ human translation (for the same source text, of course) as source corpus; the comparison will tell you which words are frequent in the MT output that are not so frequent in the PE/HT content.

Vintage here is at the top of the list. In my file with MT output segments, it occurs 705 times. If I do the same with the post-edited content, there are 0 occurrences. This means post-editors have consistently changed “vintage” to something else. It’s safe to add this word to my blacklist then, as I’m sure I don’t want to see it in my translated content. If I know how it should be translated, it could be part of an automated post-processing rule. Of course, if you re-train your engine with the post-edited content, “vintage” should become less common in the output.

To add a reference corpus, in the Tool Preferences menu, select Add Directory or Add Files to choose your corpus file(s). Click the Load button after adding your files.


Collocates are simply words that occur together. This feature allows you to search for a word in a corpus and get a list of results that show other words that appear next to the search term.  You can see how frequent a collocate is and also choose if results should include collocates appearing to the right of the term, to the left, or both. What’s really interesting about this is that it can help you find occurrences of words that occur nearyour search term, and not necessarily next to it. For example, in eBay’s listing titles, the word clutch can be sometimes mistranslated. It’s a polysemous word and it can be either a small purse or an auto part. I can do some analysis on the collocate results for clutch (auto parts) and see if terms like bag, leather, purse, etc., occur near it.

You can also select how frequent a collocate needs to be in order to be included in the results.

This is very useful to spot unusual combinations of words, as well. It obviously depends on the language, but a clear example could be a preposition followed by another preposition.

To use this feature, load your corpus files, and click the Collocates tab. Select the From and To ranges – values here contain a number and a letter: L(eft)/R(ight). The number indicates how many words away from the search terms should be included in the results,and L/R indicates the direction in which collocates must appear. You can also select a frequency value. Enter a search term and click start.

All the results obtained with any of the tools AntConc provides can be exported into several formats. This allows you to take your data and process it in any other tool.

In the next post, I will cover the N-grams tool, by far one of the most useful features and Concordance.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s