Frequency distribution and n-grams
A frequency distribution tells us the frequency of each vocabulary item in a text or defined corpus.
An n-gram is a sequence of one or more elements (“n number of elements”), generally words, and its pattern over time. N-gram search services permit swift examination of patterns of word occurrences and phrases over time. The Norwegian National Library’ provides an n-gram service for searches in their digitized collection: NB N-gram. Google Books also has an n-gram service called Google Books Ngram Viewer for searches in their corpus.
Corpus comparison
Corpus comparison as a method for digital text analysis consists of examining what words are overrepresented in a given part of the corpus as compared to a larger reference corpus. One way to do corpus comparison is through frequency counts, comparing different corpora in terms of occurrences of different words or expressions.
Concordance analysis
In corpus linguistics, text mining or digital text analysis, a "concordance" is a generated list over every occurrence of a given word in a digital corpus with the context (a certain number of words before and after the keyword) in which the word appears for each occurrence. Concordances are also referred to as “keyword(s) in context” (KWIC).
Collocation analysis
“Collocation” is a term used to describe words that are associated with one another, meaning that they often appear together. In corpus linguistics, text mining, and digital text analysis, collocations are a statistical overview of words that have a relatively high co-occurrence with a particular keyword.
Topic modeling
Topic modeling, sometimes referred to as “theme modeling”, is a method that enables analysis of words’ co-occurrence patterns in texts. Statistical calculations are performed by an algorithm, the output of which allows for grouping, or clustering, of words under the concept of a topic. Despite these clusters being nothing more than words grouped by statistical analysis, a researcher may glean interesting information about the thematic structure of texts through this method.
There are several algorithms used in topic modeling. Latent Dirichlet Allocation (LDA), the most common, and BERTopic, which is newer, are two examples. The outputs of these two algorithms are about equivalent, despite being different algorithms. The LDA algorithm is implemented in the open-source Gensim Python package and the open-source software toolkit Mallet.
Automatic Name Recognition
Also known as Named Entity Recognition (NER), it is based on different models for identifying names of persons, products, places and such in texts. The Natural Language Toolkit (NLTK) provides a classifier that has been trained to recognize named entities.
Part-of-Speech tagging
Tagging different parts of speech (POS) is to extract words that have a particular part of speech, such as a noun or a verb. The POS-tag of a word is a label of the word indicating its part of speech as well as grammatical categories such as tense, number (plural/singular) and case. The Natural Language Toolkit (NLTK) provides a POS-tagger.
Sentiment analysis
Also known as “opinion mining”, sentiment analysis describes automated methods to identify affective states in data sets. This is done through systematic selection of expressions of subjective opinions and emotional evaluations in the material. Sentiment analysis is popular in marketing, advertising and to examine the tone of political communication, public debate, social media as well as studies of plot and genre in literary corpora.
Digital sentiment analysis uses word lists and data sets where words and expressions are given a score based on perceived emotional meaning in sentiment analysis of text data.