edit upload recent print


* WikiDocs

Main » N-Grams

More Technical Definition

An N-Gram grammar is a representation of an N-th order Markov language model in which the probability of occurrence of a symbol is conditioned upon the prior occurrence of N-1 other symbols. N-Gram grammars are typically constructed from statistics obtained from a large corpus of text using the co-occurrences of words in the corpus to determine word sequence probabilities. N-Gram grammars have the advantage of being able to cover a much larger language than would normally be derived directly from a corpus.

               (from the W3C Stochastic Language Models Specification)

Less Technical Definition

N-Grams (or Markov) are computational models that represent the probabilities that various textual events (short sequences of letters, words, or even phrases) will occur given some number (N) of preceding letters, words, or phrases. N-Gram based approaches can be quite useful in text generation (as well as other areas like categorization and stylometrics) assuming 'good' input texts. For a good introduction to text-generation using N-Grams (with letter sequences) see Hartman's 'Virtual Muse' (chapter 5.)

Generally speaking, word-level N-Grams tend to produce 'better' results for generative applications, but present a different set of challenges. For example, if two texts use very different vocabularies (say a scientific paper and a literary novel), N-Grams generation will be less effective. Further, word-level N-Grams require comparatively large inputs (hundreds or thousands of pages of text) as well as potentially complex strategies for text tokenization, stemming, and/or lemmatization of input texts.

[we'll get more in depth in our class discussion]

Instructor: Daniel C. Howe



Page last modified on October 31, 2007, at 02:48 PM EST