In this two-day course, participants will learn how to efficiently automate the process of collecting data from large numbers of websites and text files using R. Practical and computational issues associated with scraping large amounts of data in a timely manner will be among the topics covered, as well as potential legal issues and how to address them. Additionally, there will be a focus on giving participants the skills to work at a low level with html and text data once it has been collected. This workshop will include a mini-unit on text processing in R, as well as a mini-unit on scraping Twitter using R. By the end of the workshop, participants should posses the basic skills necessary to scrape a large amount of web or text data and extract useful information from that data. No previous experience with web scraping is required, but participants are expected to be familiar with data management in R at the level covered in the Data Management in R short course. These two courses are designed to work in sequence; however, if a participant has a strong R programming background, they should be prepared to step directly into the course.
Matt Denny is a PhD Student in Political Science and Social Data Analytics, and an NSF Big Data Social Science IGERT fellow at Penn State. He holds master's degrees in political science and resource economics from UMass-Amherst, where he was a statistical methods consultant for ISSR from 2013-15. He has taught a number of workshops on topics ranging from social network theory to big data analytics, and his research primarily focuses on developing statistical models for text, networks, and text-valued networks. You can check out more of his work at www.mjdenny.com.