Data Science Student Aims to Improve Inclusion of African-American English

Image
Su Lin Blodgett
Su Lin Blodgett

Computer science Ph.D. student Su Lin Blodgett presented a paper on improving English-language parsing tools by analyzing tweets written in African-American English at the annual meeting of the Association of Computational Linguistics on July 17 in Melbourne, Australia. The paper represents work she did with undergraduate Johnny Tian-Zhen Wei and her advisor, assistant professor Brendan O’Connor.

Current natural language processing (NLP) tools “learn” and are trained on mainstream American English, and as a result don’t perform well on Twitter, where text deviates from this standard in many ways, including non-standard spelling, punctuation, capitalization, syntax and hashtags, the authors point out. They add that dialects such as African-American English, spoken by millions of individuals, contain language features not present in standard English.

By expanding the linguistic coverage of NLP tools to include minority and colloquial dialects, the thoughts and ideas of more individuals and groups can be included in areas such as opinion and sentiment analysis, Blodgett points out. For example, if a political campaign were to use a standard NLP tool to analyze opinions on Twitter, but did not capture what African-Americans are saying, the tools could be missing a significant portion of the electorate and could vastly misinterpret overall sentiment.

The paper concludes that the performance gap when using NLP tools on standard English versus African-American text can be narrowed by using cross-domain strategies, but there remains a trade-off between reducing the disparity and overall accuracy.