Norwegian version of this page

Tools and software bundles for text mining

There are several tools for text mining. Below are the two leading programming languages and central software to get you started.

Python (with Jupyter Notebook in Anaconda)

Python is a free open-source programming language that often serves as a foundation for text analysis projects together with the application Jupyter Notebook, available in the navigator suite Anaconda. The Natural Language Toolkit (NLTK), which is a leading platform for building Python programs to work with human language data ("natural language processing" or NLP), also comes with Anaconda as one of many packages and libraries you can select to install and run in Jupyter Notebook.

Download and install Anaconda. Open Anaconda, launch Jupyter Notebook from Anaconda, and select new Notebook in Python.

Get started

Natural Language Processing with Python (updated 2019) by Steven Bird; Ewan Klein; Edward Loper

From the developers of the Natural Language Toolkit (NLTK) for natural language processing (NLP), this book is a solid introduction to natural language processing with the programming language Python and shows how to use it to text mine. Following along using Python in the environment of Anaconda with Jupyter Notebook and NLTK, will make it even easier to follow.

Humanities Data Analysis: Case Studies with Python (online 2022) by Folgert Karsdorp, Mike Kestemeont and Allen Riddell.

A practical guide to textual data analysis with Python in the environment of Anaconda with Jupyter Notebook, this book begins by describing the essential techniques for gathering and cleaning textual data, before presenting a variety of detailed case studies where a range of text mining methods are employed. A comprehensive resource for humanities students and scholars aiming to take their Python skills to the next level.

The Programming Historian publishes open access peer-reviewed tutorials in digital tools and techniques for research in the humanities, in particular using Python, and providing an introduction to Python series. There are several lessons devoted to distant reading

R (with RStudio)

R is a free open-source programming language specifically created for statistical computing and graphics. It is widely adapted in academia, especially among statisticians and social science researchers. It is used with RStudio whose interface allows the user to view R code, output, graphs, and data tables at the same time. It has a rich collection of available packages produced by the R community.

Download and install R and RStudio or access via UiO Programkiosk. Open RStudio and select new R Script under New File. 

Get started

Text Analysis with R for Students of Literature (2020) by Matthew L. Jocke and Rosamond Thalken

Written with students and scholars of literature in mind, this book will also be applicable to other humanists and social scientists wishing to extend their methodological tool kit to include quantitative and computational approaches to the study of text.

The Programming Historian publishes open access peer-reviewed tutorials in digital tools and techniques for research in the humanities, including for distant reading. Initially focused on programming skills in Python, the catalog with lessons in R is growing.

Voyant Tools

Voyant Tools is an open-source, web-based application for performing automated computational text analysis on documents. It can be used to analyze online texts you link to and texts you upload, and it accepts a number of different file formats, such as docx, pdf, txt, etc. Voyant is popular among scholars in the digital humanities and has a large, international user base. It does not perform linguistical text analysis such as part-of-speech tagging or named entity recognition, but is highly user friendly, provides nice visualizations, and has a rich functionality.

Acknowledging the limitations of its pre-programmed functionalities, Voyant welcomes users to develop their own tools using Voyant's functionality and code, and endorses the use of other tools, in particular Python with Jupyter Notebook.

Access Voyant Tools at voyant-tools.org

Get started

Voyant has an extensive help menu for all the functions, and also a tutorial/workshop page

A Beginner’s Guide to Using Voyant for Digital Theme Analysis” (2022) by Randa El Khatib and Shawna Ross, published at the Humanities Commons, provides a case-based illustration for how to use Voyant in literary criticism to carry out a digital thematic analysis.

DH-Lab Python Apps and Notebooks

The DH-Lab at the National Library of Norway has written example code to text mine the National Library’s huge digitized collection, and it is developing web-based apps for a simpler introduction to text mining the National Library's collection. The code is written in Python and shared in Jupyter Notebook, and the apps are made using Streamlit, a free and open-source app framework in Python.

Access DH-Lab Apps and Notebooks at nb.no/dh-lab

Get started

To run code to text mine the National Library’s digitized collections:

Download example notebook from DH-Lab (begin at the top and follow the instructions) for select type of text mining.
Download and install Anaconda. Open Anaconda, launch Jupyter Notebook from Anaconda, and open the downloaded notebook. 
Run all cells in the notebook.

To use an app to text mine the National Library’s digitized collections:

Go to the app page at DH-Lab and select the app for the type of text mining you want to do.

Published Sep. 7, 2021 10:42 AM - Last modified Apr. 9, 2023 7:12 PM