My Language Lead Bot

I recently started a new initiative. My goal was to create a language lead bot using my knowledge of Python and Natural Language Processing.

Before a localization project starts, language leads are usually required to do a linguistic analysis of the content. The language lead advises on different language aspects, like nature and complexity of the content (is it technical? is it marketing? was something similar translated before?), terminology (do we need a glossary? do we have a glossary we can leverage from?), style and tone (formal/informal? what’s the target audience?), and many others.

Relevant information from this analysis will be then used to put together language resources like style guides.

So, the goal of the Language Lead Bot is to assist in these tasks, checking information automatically and generating reports.

Even though this very first version is an MVP, the features included are:


  • wordcount
  • sentence count
  • paragraph count
  • character count
  • token count
  • unique words count
  • average sentence length
  • longest sentence


  • most frequent words
  • less frequent words
  • show frequent nouns (+ count)
  • show frequent adj. (+count)

Glossaries and Dictionaries:

  • glossary matching
  • less frequent words defined


  • hapaxes are spellchecked
  • list of ignore terms
  • stopwords

Lexical information:

  • frequent collocations
  • lexical richness


  • In the .py file, update the paths to your files: corpus, ignore list, glossary, as required.
  • Running the code will generate a report with all available features.
  • Some features are simple can be run directly from the relevant print() line

Individual functions:

  • fdistlen(): report number of words by length in your corpus
  • findhapaxes(): prints the 50 less frequent words in your corpus
  • hapaxdef(): prints hapaxes (words that appear only 1 time in your corpus) followed by their definition
  • spell(): spellchecks hapaxes – if confidence is not 100%, presents all spelling suggestions
  • longestsent(): prints out longest sentence in your corpus
  • averagesentlen(): reports average sentence length
  • findNN(): print out frequent nouns
  • findJJ(): print out frequent adjectives
  • glossarymatch(): find words from your corpus included in a glossary (CSV, source,target)


The code is written in Python 3 and is available on Github:

For the time being, it only works for EN.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s