Tutorial > Transformation


Conjugation

Conjugation refers to how a verb changes to show a different person, tense, number or mood. The term conjugation is applied only to the inflection of verbs, and not of other parts of speech (inflection of nouns and adjectives is known as declension).

For instance, the verb ‘to be’ is a notable verb for conjugation because it’s so irregular.

I am
You are
He/she/it/one is
We are
You are
They are

We can also conjugate for different tenses (past, present, future).

I was, I am, I will be
You were, you are, you will be
He was, he is, he will be
We were, we are, we will be
They were, they are, they will be


You can use RiTa.conjugate to conjugate a verb for the particular form you want by specifying the tense, number and person.

TENSE:   PAST, PRESENT, FUTURE

NUMBER:   SINGULAR, PLURAL

PERSON:   FIRST, SECOND, THIRD


An example, in JavaScript:

  let args = {
    tense: RiTa.PAST,
    number: RiTa.SINGULAR,
    person: RiTa.THIRD
  };
  let result = RiTa.conjugate("swim", args);

The outcome of this example will be "swam".


Stemming

Stemming means reducing a word to its base, or stem. For example, the words 'write', 'writes' and 'writing' all have the stem 'write'. A stemmer takes a word as input and return the stem.

Stemming is useful when you are doing text-analysis. This is because different verb conjugations, different endings for singular and plural, different adjective forms, etc. can all make it difficult to discern the importance of words in a text.

Let's take the following paragraph as an example:

I wrote a book about cats, after I had written a short article on cats. Currently I'm writing about dogs. Next year I'll write poems. But I produced my best writings when I was younger.

It's obvious (to a human) that this text is mainly concerned with writing. But when you run a program to analyze it, the program could come to the conclusion that it's about cats, because the word 'cats' is the only word (aside from words like 'I', 'a', etc.) that occurs more than once.

But if you stem the text before analyzing it, replacing all words with their stems, the program can correctly tell you that it's a text about writing. This because, after stemming (depending on the stemmer you use), the word 'write' will appear five times (with 'writing', 'written', 'writings' and 'wrote' have been replaced by 'write'). You can see a similar example below in RiTa, which uses the so-called 'Snowball Stemmer':

RiTa.stem
RiTa.stem("write writes writing writings.");

The result would be: write write write write.


Plurals/Singulars

RiTa.pluralize is a simple pluralizer for nouns according to pluralisation rules. It uses a combination of letter-based rules and a lookup table of irregular exceptions.

RiTa.pluralize("apple")

'apple' -> 'apples'
'child' -> 'children'
'appendix' -> 'appendices'


RiTa.singularize does the reverse, taking the plural form of noun and returning the singular.

RiTa.singularize("apples");

apples -> apple
children -> child
appendices -> appendix


Splitting-sentences

Sentence-splitting means dividing a span of text into sentences.

A question mark or exclamation mark always ends a sentence. A period followed by an upper-case letter generally ends a sentence, but there are a number of exceptions. For example, if the period is part of an abbreviated title ("Mr.", "Gen.", ...), it does not end a sentence. A period following a single capitalized letter is assumed to be a person's initial, and is not considered the end of a sentence.

RiTa.sentences splits 'text' into sentences according to the Penn Treebank conventions.

RiTa.sentences("'What's happened to me?' he thought. It wasn't a dream. 
  His room, a proper human room although a little too small, lay peacefully 
  between its four familiar walls.");

This will return an array of three sentences:

[0] 'What's happened to me?' he thought.
[1] It wasn't a dream.
[2] His room, a proper human room although a little too small, lay peacefully between its four familiar walls.


Tokenizing/Untokenizing
Tokenizing

Tokenizing is the task of chopping a text up into smaller pieces called tokens. In RiTa such tokens are usually words (and punctuation characters). There are different tokenizing conventions, but the one RiTa uses is called the Penn Treebank convention.

An example of tokenizing in RiTa looks like this:

  RiTa.tokenize("I want to have a cup of coffee.");

The output will be: [ 'I', 'want', 'to', 'have', 'a', 'cup', 'of', 'coffee', '.' ]

The default RiTa.tokenize function will split a line of text into words and punctuation. You can also choose to use a RegexTokenizer (with a regular expression pattern of your choice):

  RiTa.tokenize(words, regex);

To go in the other direction, from an array of words and punctuation to a sentence, you can use RiTa.untokenize.


Untokenizing

Untokenizing is simply the reverse process of tokenizing; putting the individual tokens (in our case, words) back into a sequence.

RiTa.untokenize takes an array of word and punctuation and joins them together into a sentence, preserving punctuation position and adding spaces as necessary.

An example of untokenizing in RiTa:

  let words = ['I', 'want', 'to', 'have', 'a', 'cup', 'of', 'coffee', '.'];
  RiTa.untokenize(words);

The output will be: "I want to have a cup of coffee.”