Tutorial > Transformation

Conjugation

Conjugation refers to how a verb changes to show a different person, tense, number or mood. The term conjugation is applied only to the inflection of verbs, and not of other parts of speech (inflection of nouns and adjectives is known as declension).

For instance, the verb ‘to be’ is a notable verb for conjugation because it’s so irregular.

I am
You are
He/she/it/one is
We are
You are
They are

We can also conjugate for different tenses (past, present, future).

I was, I am, I will be
You were, you are, you will be
He was, he is, he will be
We were, we are, we will be
They were, they are, they will be


You can use RiTa.conjugate to conjugate a verb for the particular form you want by specifying the tense, number and person.

TENSE:   PAST_TENSE, PRESENT_TENSE, FUTURE_TENSE

NUMBER:   SINGULAR, PLURAL

PERSON:   FIRST_PERSON, SECOND_PERSON, THIRD_PERSON


An example, in JavaScript:

  var args = {
    tense: RiTa.PAST_TENSE,
    number: RiTa.SINGULAR,
    person: RiTa.THIRD_PERSON
  };
  var result = RiTa.conjugate("swim", args);

The outcome of this example will be "swam".


Stemming

Stemming means reducing a word to its base, or stem. For example, the words 'writing', 'wrote' and 'written' all have the stem 'write'. A stemmer takes a word or list of words as input and return the stem(s).

Stemming is useful when you are doing any kind of text-analysis. This is because different verb conjugations, different endings for singular and plural, different adjective forms, etc. can all make it difficult to discern the importance of specific words in a text.

Let's take the following paragraph as an example:

I wrote a book about cats, after I had written a short article on cats. Currently I'm writing about dogs. Next year I'll write poems. But I produced my best writings when I was younger.

It's obvious (to a human) that this text is mainly concerned with writing. But when you run a program to analyze it, the program will come to the conclusion that it's about cats, because the word 'cats' is the only word (aside from words like 'I', 'a', etc.) that occurs more than once.

But if you stem the text before analyzing it, replacing all words with their stems, the program will correctly tell you that it's a text about writing. This because, after stemming, the word 'write' will appear five times (because 'writing', 'written', 'writings' and 'wrote' have been replaced by 'write').

RiTa.stem
RiTa.stem("write written writing");

The result would be: writ writ writ.

In RiTa, there are three different stemming algorithm for you to choose from.

RiTa.LANCASTER (the default), RiTa.PORTER, or RiTa.PLING

Note: see http://text-processing.com/demo/stem/ for comparison of Lancaster and Porter algorithms or http://mpii.de/yago-naga/javatools for info on the PlingStemmer


Plurals/Singulars

RiTa.pluralize is a simple pluralizer for nouns according to pluralisation rules. It uses a combination of letter-based rules and a lookup table of irregular exceptions.

RiTa.pluralize("apple")

'apple' -> 'apples'
'child' -> 'children'
'appendix' -> 'appendices'


RiTa.singularize does the reverse, taking the plural form of noun and returning the singular.

RiTa.singularize("apples");

'apples' -> 'apple'
'children' -> 'child'
'appendices' -> 'appendix'


Splitting-sentences

Sentence-splitting means dividing a span of text into sentences.

A question mark or exclamation mark always ends a sentence. A period followed by an upper-case letter generally ends a sentence, but there are a number of exceptions. For example, if the period is part of an abbreviated title ("Mr.", "Gen.", ...), it does not end a sentence. A period following a single capitalized letter is assumed to be a person's initial, and is not considered the end of a sentence.

RiTa.splitSentences splits 'text' into sentences according to the PENN Treebank conventions.

RiTa.splitSentences("'What's happened to me?' he thought. It wasn't a dream. \
  His room, a proper human room although a little too small, lay peacefully \
  between its four familiar walls.");

This will return an array of three sentences:

[0] 'What's happened to me?' he thought.
[1] It wasn't a dream.
[2] His room, a proper human room although a little too small, lay peacefully between its four familiar walls.

More about PENN Treebank tokenization


Tokenizing/Untokenizing
Tokenizing

Tokenizing is the task of chopping a text up into smaller pieces called tokens. In RiTa such tokens are usually words (and punctuation characters). There are different tokenizing conventions, but the one RiTa uses is called the Penn Treebank convention.

More about PENN Treebank tokenization

An example of tokenizing in RiTa looks like this:

  RiTa.tokenize("I want to have a cup of coffee.");

The output will be: [ 'I', 'want', 'to', 'have', 'a', 'cup', 'of', 'coffee', '.' ]

The default RiTa.tokenize function will split a line of text into words and punctuation. You can also choose to use a RegexTokenizer (with a regular expression pattern of your choice):

  RiTa.tokenize(words, regex);

To go in the other direction, from an array of words and punctuation to a sentence, you can use RiTa.untokenize.


Untokenizing

Untokenizing is simply the reverse process of tokenizing; putting the individual tokens (in our case, words) back into a sequence.

RiTa.untokenize takes an array of word and punctuation and joins them together into a sentence, preserving punctuation position and adding spaces as necessary.

An example of untokenizing in RiTa:

  var words = ['I', 'want', 'to', 'have', 'a', 'cup', 'of', 'coffee', '.'];
  RiTa.untokenize(words);

The output will be: "I want to have a cup of coffee.”