Corpus Analysis – Part II

In the first part of the series, I covered the importance of corpus analysis and how a tool called AntConc can be used to learn more about your corpus in a smart, efficient way. This second part focuses on the Clusters/N-Grams feature in AntConc. Tips and techniques to use it effectively are included.

Clusters/N-Grams

This is perhaps one of the most useful features in AntConc. Why? Because it allows you to find patterns. Remember that, when working with MT output, most of the times it’s not realistic to try to find and/or fix every single issue. There may be tons of errors with varying levels of severity in the MT output (especially considering the volumes of content processed by MT), so it does make sense to focus first on those that occur more frequently or that have a higher severity.

Here’s a simple example: let’s assume that by looking at your MT output you realize that your MT system is translating the word “inches” into “centimeters”, without making any changes to the numbers that usually precede that word, i.e., 10 inches is being consistently translated as 10 centimeters. You could try to find and fix 1 centimeter, 2 centimeters, 3 centimeters, etc. Rather, a much better choice would be to identify a pattern: “any number” followed by the word “centimeter” should be instead “any number” “inches”. This is an oversimplification, but the point is that identifying an error pattern is a much better approach than fixing individual errors.

Once you have identified a pattern, the next step is to figure out how you can create some sort of rule to find/fix such pattern. Simple patterns made of word or phrases are pretty straightforward – find all instances of “red dress” and replace with “blue dress”, for example. Now, you can take this to the next level by using regular expressions. Going back to the inches example you could easily find all instances of “any number” followed by centimeters with a simple regex like \d+ centimeters, where \d stands for any number and the + signs stands for 1 or more (numbers).

Clusters

Using the Clusters/N-Grams tool helps you find strings of text based on their length (number of tokens or words), frequency, and even the occurrence of any specific word. Once you open your corpus, AntConc can find a word or a pattern in it and cluster the results in a list. If you search for a word in your corpus, you can opt to see words that precede or follow the word you searched for.

Results can be sorted:

  • by frequency (ideal to find recurring patterns – the more frequent a pattern is, the more relevant it might be),
  • by word (ideal to see how your MT system is dealing with the translation of a particular term),
  • by word end (sorted alphabetically based off the last word in the string),
  • by range (if your corpus is composed of more than one file, in how many of those files the search term appears), and
  • by transitional probability (how likely it is that word2 will occur after word1; e.g., the probability of “Am” occurring after “I” is much higher than “dishwasher” occurring after “I”.).

Let’s see how the Clusters tool can be used. I’ve loaded my corpus in AntConc and I want to see how my system is dealing with the word case. Under the Cluster/Ngrams tab, let’s check the box Word, as I want to enter a specific search term. I want to see clusters that are 3 to 4 words long. And very important here, the Search Term Position option: if you select Left, your search term will be the first word in the cluster; if you select Right, it’ll be the last one instead. Notice in the screenshots, how the Left/Right option selection affects the results.

On left

On right

We can also use regular expressions here for cases in which we need more powerful searches. Remember the example about numbers and inches above? Well, numbers, words, spaces, letters, punctuation – all these can be covered with regular expressions.

Let’s take a look at a few examples:

Here, I want to see all 2-word clusters that start with the word “original”, so I’m going to use a boundary (\b) before “original”. I don’t know the second word, it’s actually what I want to find out, so I’m going to use \w, which stands for “any word”. All my results will then have the following form: original+word.

Now, I want to see all clusters, regardless of their frequency, that contain the words “price” OR “quality”. So, in addition to adding the boundaries, I’m going to separate these words with | that simply stands for “or”.

This is really useful when you want to check how the system is dealing with certain words – there’s no need to run separate searches since you can combine any number of words with | between them. Check the Global Settings menu for reference.

For seasoned regex users, note that regex capabilities in AntConc are pretty modest and that some operators are not standard.

N-grams

If you are not familiar with this term, in a nutshell, an n-gram is any word or sequence of words of any size; a 1-gram is composed of one element, a 2-gram is composed of 2 elements, etc. It’s a term that defines the length of a string rather than its content.

What’s great about this feature is that you can find recurring phrases without specifying any search terms. That is, you can easily obtain a list of, for example, all the 6-grams to 3-grams that occur more than 10 times in your corpus. Remember that clusters work in the opposite way – you find words that surround a specific search term.

The n-gram search is definitely an advantage when you don’t know your corpus very well and you still don’t know what kind of issues to expect. It’s usually a good choice if it’s the first time you are analyzing a corpus – it finds patterns for you: common expressions, repeated phrases, etc.

When working with n-grams, it’s really important to consider frequency. You want to focus your analysis on n-grams that occur frequently first, so you can cover a higher number of issues.

What can you do with your findings, besides the obvious fact of knowing your corpus better? You can find recurring issues and create automated post-editing rules.Automated post-editing is a technique that consists in applying search and replace operations on the MT output. For instance, going back to our initial inches vs. centimeters example, you could create a rule that replaces all instances of number+centimeters with number+inches. Using regular expressions, you can create very powerful, flexible rules. Even though this technique was particularly effective when working with RBMT, it’s still pretty useful for SMT between training cycles (the process in which you feed new data to your system so it learns to produce better translations).

You can also create blacklists with issues found in your MT output. A blacklist is simply a list of terms that you don’t want to see in your target so, for example, if your system is consistently mistranslating the word “case” as a legal case instead of a protective case, you can add the incorrect terms to the blacklists and easily detect when they occur in your output. In the same way, you can create QA checks to run in tools like Checkmate or Xbench.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s