April 17, 2026 2:00 pm - 4:00 pm ET
Data and Software,
Research Methodology,
Semester Workshops
Online Event - Login credentials via email for registered participants

Instructor: Venkat Dasari

Text analysis is a powerful approach to understanding large collections of documents, from political speeches and news articles to social media posts and historical records. This workshop introduces participants to computational text analysis using R and the quanteda package. Through hands-on examples and real data, participants will learn how to transform raw text into structured data ready for quantitative analysis.

We'll begin with practical text preprocessing techniques: tokenization, lowercasing, removing punctuation and stopwords, lemmatization, and stemming. Participants will then learn to construct document-feature matrices (DFMs)—a key structure in computational text analysis—and explore methods for trimming sparse data. The workshop culminates with applications such as computing term frequencies, identifying top features across documents, and interpreting results for substantive insights.

Learning Objectives:

By the end of this workshop, participants will be able to:

  • Load and manipulate text data in R using tidyverse tools
  • Create a corpus object and understand its structure
  • Tokenize text and apply preprocessing techniques (lowercasing, removing punctuation, stopwords, stemming, lemmatization)
  • Build a document-feature matrix (DFM) and understand its structure
  • Trim and filter a DFM to manage sparsity and focus on meaningful features
  • Extract and interpret descriptive statistics (term frequencies, top features, document statistics)
  • Create reproducible workflows for text preprocessing and exploration

Prerequisites:

Prior experience with R is expected. Familiarity with basic R commands, data frames, and working with libraries will help you get the most out of the session. Some exposure to the tidyverse ecosystem (particularly dplyr) is helpful but not required. No prior experience with text analysis or natural language processing is necessary.