Corpus Analysis III – Python

In the previous posts of these series, we discussed how you can use AntConc to understand your corpora better. This time, we are going to take a look at how Python can be used, as well, to perform some corpus analysis techniques.

For those of you not familiar with Python, it’s a programming language that has been gaining more and more popularity for several reasons: it’s easy to learn and easy to read, it can be run in different environments and operating systems, and there’s a significant number of modules that can be imported and used.

Modules are files that contain classes, functions, and other data. Without getting too technical, a module is code already written that you can reuse, without having to write it yourself from scratch. Quick example: if you want to write a program that will use regular expressions, you can simply import the re module and Python will learn how to deal with them thanks to the data in the module.

Enter Natural Language Processing Toolkit

Modules are the perfect segue to introduce the Natural Language Processing Toolkit (nltk). Let me just steal the definition from their site, NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries[…].

Using Python and NLTK, there’s quite a few interesting things you can do to learn more about your corpora. I have to assume you are somewhat familiar with Python (not an expert!), as a full tutorial would simply exceed the purpose of this post. If you want to learn more about it, there are really good courses on Coursera, Udemy, Youtube, etc. I personally like Codeacademy’s hands-on approach.

The Mise en Place

= To follow these examples, you’ll need the following installed:

  • Python 3.5 (version 2.7 works too, but some of these examples may need tweaking)
  • NLTK
  • Numpy (optional)

= To get corpora, you can follow these steps:

You have two options here: you can choose to use corpora provided by NLTK (ideal if you just want to try these examples, see how Python works, etc.) or you can use your own files. Let me walk you through both cases.

If you want to use corpora from NLTK, open your Python’s IDLE, import the nltk module (you’ll do this every time you want to use nltk) and then download the corpora:

>>> import nltk

A new window will open, and you’ll be able to download one or more corpora, as well as other packages. You can find the entire list here.

When working in Python, you can import (a) all available corpora at the same time or (b) a single corpus. Notice that (a) will import books (like Moby Dick, The Book of Genesis, etc…)

a	>>> from import *
b	>>> from nltk.corpus import brown

If you want to use your own files, you’ll have to tell Python where they are, so it can read them. Follow these steps if you want to work with one file (remember to import nltk!):

>>> f = open(r'c:\reviews.txt','rU')
>>> raw =
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)

Basically, I’m telling Python to open my file called reviews.txt saved in C: (the “r” in front of the path is required for Python to read it correctly). I’m also telling Python I want to read, not write on, this file.

Then, I’m asking Python to read the contents of my file and store them in a variable called “raw”; to tokenize the content (“identify” the words in it) and to store those tokens in a variable text. Don’t get scared by the technical lingo at this point: a variable is just a name that we assign to a bucket where we store information, so we can later make reference to it.

What if you have more than one file? You can use the Plain Text Corpus Reader to deal with several plaintext documents. Note that, if you follow the example below, you’ll need to replace the red sections with the relevant information, like your path, your file extension, etc…

>>> from nltk.corpus import PlaintextCorpusReader
>>> files = ".*\.txt">>> corpus0 = PlaintextCorpusReader(r"C:/corpus", files)
>>> corpus  = nltk.Text(corpus0.words())

Here, I’m asking Python to import PlaintextCorpusReader, that my files have the TXT extension, where the files are stored, and to store the data from my files into a variable called corpus.

You can test if your data was correctly read just by typing the name of the variable containing it:

>>> corpus
<Text: This black and silver Toshiba Excite is a...>
>>> text
<Text:`` It 's a Motorola StarTac , there...>

corpus and text are the variables I used to store data in the examples above.

Analyzing (finally!)

Now that we are all set up, and have our corpora imported, let’s see some of the things we can do to analyze it:

We can get a wordcount using the “len” function. It is important to know the size of our corpus, basically to understand what we are dealing with. What we’ll obtain is a count of all words and symbols, repeated words included.

>>> len(text)
>>> len(corpus)

If we wanted to count unique tokens, excluding repeated elements:

>>> len(set(corpus))

With the “set” function, we can get a list of all the words used in our corpus, i.e., a vocabulary:

>>> set(corpus)
{'knowledge', 'Lord', 'stolen', 'one', ':', 'threat', 'PEN', 'gunslingers', 'missions', 'extracting', 'ensuring', 'Players', 'player', 'must', 'constantly', 'except', 'Domino', 'odds', 'Core', 'SuperSponge', etc........

A list of words is definitely useful, but it’s usually better to have them alphabetically sorted. We can also do that easily:

>>> sorted(set(corpus))
["'", '(', ').', ',', '-', '--', '.', '3', '98', ':', 'Ancaria', 'Apocalypse', 'Ashen', 'Barnacle', 'Bikini', 'Black', 'Bond', 'Bottom', 'Boy', 'Core', 'Croft', 'Croy', 'D', 'Dalmatian', 'Domino', 'Egyptian', etc...

Note that Python will put capitalized words at the beginning of your list.

We can check how many times a word is used on average, what we call lexical richness. From a corpus analysis perspective, it’s good that a corpus is lexically rich as, theoretically, the MT system will “learn” how to deal with a broader range of words. This indicator can be obtained by dividing the total number of words by the number of unique words:

>>> len(text)/len(set(text))
>>> len(corpus)/len(set(corpus))

If you need to find out how many times a word occurs in your corpus, you can try the following (notice this is case-sensitive):

>>> text.count("leave")
>>> text.count("Leave")

One key piece of information you probably want to get is the number of occurrences of each token or vocabulary item. As we mentioned previously, frequent words say a lot about your corpus, they can be used to create glossaries, etc. One way to do this is usingfrequency distributions. You can also use this method to find how many times a certain word occurs.

>>> fdistcorpus = FreqDist(corpus)
>>> fdistcorpus
FreqDist({',': 33, 'the': 27, 'and': 24, '.': 20, 'a': 20, 'of': 17, 'to': 16, '-': 12, 'in': 8, 'is': 8, ...})

>>> fdistcorpus['a']

A similar way to do this is using the vocab function:

>>> text.vocab()
FreqDist({',': 2094, '.': 1919, 'the': 1735, 'a': 1009, 'of': 978, 'and': 912, 'to': 896, 'is': 597, 'in': 543, 'that': 518, ...})

Conversely, if you wanted to see the words that only appear one time:

>>> fdistcorpus.hapaxes()
['knowledge', 'opening', 'mystical', 'return', 'bound']

If you only want to see, for example, the 10 most common tokens from your corpus, there’s a function for that:

>>> fdistcorpus.most_common(10)
[(',', 33), ('the', 27), ('and', 24), ('.', 20), ('a', 20), ('of', 17), ('to', 16), ('-', 12), ('in', 8), ('is', 8)]

We can have the frequency distributions results presented in many ways:

  • one column:
>>> for sample in fdistcorpus:


Here, I’m using a for loop. Loops are typically used when you want to repeat or iterate and action. In this case, I’m asking Python, for each token or sample in my corpus, to print said sample. The loop will perform the same action for all the tokens, one at the time, and stop when it has covered every single one of them.

  • tab-separated:
>>> fdistcorpus.tabulate()
               ,              the              and                .                a               of               to                -               in               is             with              for               as                '                s              The               by               or               --               be             them               on              has
  • chart
>>> fdistcorpus.plot()

There’s a lot you can do with Python and NLTK. In the next post, I’ll cover collocations, ngrams, and more, so stay tuned.

If you enjoyed this article, please check other posts from the eBay MT Language Specialists series.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s