Introduction to Text Mining with Python and Apps

Get a fundamental understanding of how to use Python programming and an introduction to simpler tools such as Voyant and DH-Lab apps to text mine corpora. 

Image may contain: Gesture, Font, Art, Wing, Illustration.

Course Description

This day-long advanced course in methods will give you a fundamental understanding of how to use Python programming to text mine corpora. It will also introduce you to the web applications from Voyant and DH-Lab to text mine corpora. The course will prepare you to conduct your own text mining project, using either Python or web applications, and whether your analysis of text is from a literary, linguistic, cultural, historical, or other humanities research perspective.

After the in-class day, you will receive individual guidance for your assignment where you will text mine a corpus you select with the tool(s) and types of text mining you choose based on the project and goals you define. You will present your project and findings to the other course participants as a conclusion to the assignment.

This is a 1 ECTS course.

In Class

You will find the program for the day, including the code and exercises we will go through, in GitHub, published here.

Instructors

Anne Sæbø, PhD, together with Ragnhild SundsbakElisa Pierfederici, PhD fellow, and Sofie Gilbert (all from UB/Digital Scholarship Centre), and Andrea Dale Wefring (from Humit - Centre for digital development at HF).

Course Preparations

Please prepare for the course by downloading and installing Anaconda (free and basic to install, but a fairly large package). You may need admin rights from your local IT to download Anaconda. DEADLINE: Thursday November 30 at 12. By this time, you need to make sure you have Anaconda ready to run on your computer and can open Anaconda, launch Jupyter Notebook from Anaconda, and select new Notebook in Python as instructed in the guide here. We will not have time to help you with these steps during the course, so if you have any questions about any of this, please contact us for help by the time of the deadline.

Please also prepare by visiting and familiarizing yourself with this course's site published in GitHub, reading at least the front welcome page and the linked sites you are asked to consult: Types of text mining and digital analysis of text and Tools and software bundles for text mining

Please also read the required book chapters and articles from the reading list prior to our in-class day.

Language

This course will be taught in English. We can also answer questions in Norwegian.

Course Readings

Book Chapters:

Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. The book is recommended as a whole. Required chapter readings are 1. Language Processing and Python, 2. Accessing Text Corpora and Lexical Resources and 3. Processing Raw Text.

Humanities Data Analysis: Case Studies with Python. This book is also recommended as a whole. Required chapter readings are three sub chapters of part I, DATA ANALYSIS ESSENTIALS, including the Introduction, Parsing and Manipulating Structured Data, and Exploring Texts using the Vector Space Model, see in particular What you should know (on variables, strings, loops, lists, dictionaries, conditional expressions (if, elif, else), and reading files) and on data formats and text preprocesing.

Articles:

A Beginner’s Guide to Using Voyant for Digital Theme Analysis

The Rise of Health. A Collocation Analysis of Conceptual Changes in News Discourse, 1950-2010 (text mining the National Library of Norway)

Assignment

To understand what text mining is and can be, it is essential to see examples of what other researchers have done. The required readings, in-class demonstrations, and course exercises will contribute to this, and with this assignment all course participants will be able to further inspire each other with ideas for material that can be text mined, how to text mine, and with what goals in mind.

First you need to identify your corpus and what you wish to text mine your corpus for. What type or types of text mining will you need to use, and with what method(s)? What sort of preprocessing will you need to do of your dataset before you can begin the actual text mining?

Then follows the actual text mining. Allow for trial and error and expect having to rethink and revise your approach and project until your text mining results in findings that lend themselves to your analysis.

Write up a brief reflection paper on the entire process, including goals and challenges, and the direction you see your project going in at this point. Prepare to present your project, through the entire process, and with your findings and reflections. Reflection papers are due at the time of presentations.

The presentations will take place towards the end of the semester. Be prepared to give each other feedback and ideas for future work on each project.

Deadlines

In-class course: Friday December 1

Individual guidance on project for the assignment, including corpus selection, type and tool of text mining, and defined research goals: Individual half hour time slots throughout Friday December 8 and Monday December 11.

Presentations: Friday December 15 or Monday December 18. We will decide on a day and time that works for most. If you are unable to make it to the group presentations, you can make a 5-10 minutes video recording of your presentation and submit to me by Friday December 15. Reflection papers are due at presentations.

Registration

The course is open to PhD fellows, completion grant holders, and post-doctoral fellows at the Faculty of Humanities, other UiO faculties, and external PhD fellows. Registration opens on September 20 and priority is given to PhDs and postdocs from the Faculty of Humanities. We ask that PhDs and postdocs from other Faculties at UiO, and other applicants, wait until on October 1 to register. Registration closes when the course has reached its maximum number of participants.

Sign up here

Course Convener

Contact person: Anne Sæbø, PhD at DSC/UB

Published Aug. 23, 2023 11:58 AM - Last modified Mar. 12, 2024 9:49 AM