Norwegian version of this page

Types of text mining and digital analysis of text

The variety of text mining methods is continuously growing. Below is a presentation of central types of text mining you will encounter across text mining tools and platforms.

Frequency distribution and n-grams

A frequency distribution tells us the frequency of each vocabulary item in a text or defined corpus.

An n-gram is a sequence of one or more elements (“n number of elements”), generally words, and its pattern over time. N-gram search services permit swift examination of patterns of word occurrences and phrases over time. The Norwegian National Library’ provides an n-gram service for searches in their digitized collection: NB N-gram. Google Books also has an n-gram service called Google Books Ngram Viewer for searches in their corpus.

Corpus comparison

Corpus comparison as a method for digital text analysis consists of examining what words are overrepresented in a given part of the corpus as compared to a larger reference corpus. One way to do corpus comparison is through frequency counts, comparing different corpora in terms of occurrences of different words or expressions.

Concordance analysis

In corpus linguistics, text mining or digital text analysis, a "concordance" is a generated list over every occurrence of a given word in a digital corpus with the context (a certain number of words before and after the keyword) in which the word appears for each occurrence. Concordances are also referred to as “keyword(s) in context” (KWIC).

Collocation analysis

“Collocation” is a term used to describe words that are associated with one another, meaning that they often appear together. In corpus linguistics, text mining, and digital text analysis, collocations are a statistical overview of words that have a relatively high co-occurrence with a particular keyword.

Topic modeling

Topic modeling, sometimes referred to as “theme modeling”, is a method that enables analysis of words’ co-occurrence patterns in texts. Statistical calculations are performed by an algorithm, the output of which allows for grouping, or clustering, of words under the concept of a topic. Despite these clusters being nothing more than words grouped by statistical analysis, a researcher may glean interesting information about the thematic structure of texts through this method.

There are several algorithms used in topic modeling. Latent Dirichlet Allocation (LDA), the most common, and BERTopic, which is newer, are two examples. The outputs of these two algorithms are about equivalent, despite being different algorithms. The LDA algorithm is implemented in the open-source Gensim Python package and the open-source software toolkit Mallet.

Automatic Name Recognition

Also known as Named Entity Recognition (NER), it is based on different models for identifying names of persons, products, places and such in texts. The Natural Language Toolkit (NLTK) provides a classifier that has been trained to recognize named entities.

Part-of-Speech tagging

Tagging different parts of speech (POS) is to extract words that have a particular part of speech, such as a noun or a verb. The POS-tag of a word is a label of the word indicating its part of speech as well as grammatical categories such as tense, number (plural/singular) and case. The Natural Language Toolkit (NLTK) provides a POS-tagger.

Sentiment analysis

Also known as “opinion mining”, sentiment analysis describes automated methods to identify affective states in data sets. This is done through systematic selection of expressions of subjective opinions and emotional evaluations in the material. Sentiment analysis is popular in marketing, advertising and to examine the tone of political communication, public debate, social media as well as studies of plot and genre in literary corpora.

Digital sentiment analysis uses word lists and data sets where words and expressions are given a score based on perceived emotional meaning in sentiment analysis of text data.

Published Sep. 7, 2021 10:42 AM - Last modified July 26, 2023 9:01 AM