Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Honors 214: Interrogating Inequality: Text Analytics

Glossary

 

Corpus:  A collection of written texts.

Data mining: The application of statistical and computational methods to large data sets in order to unearth new information for research purposes.

Data visualization: Using graphics to display information in new and innovative ways.

"Distant reading":  Phrase coined by literary theorist Franco Moretti to describe the process of using computational methods to analyze massive quantities of text.  Often used as a counterpoint to "close reading."

Links:  Represents the collocation of terms in a corpus by depicting them in a network through the use of a force directed graph. In this graph the frequency of the word is indicate by relative size of the term.

Word cloud: A visualization of word frequencies. The more frequently a word appears in a given text, the larger its size in the visualization.

 

 

Embedded Visualization Tools

Many full-text databases now offer embedded tools for basic text analytics.  Two such databases are:

 

Example 1:  Word frequencies over time

Using the New York Times Historical database, search for the term "poverty," with no date limitations. Look for the term frequency chart on the left of the results page. 

What patterns do you notice?  What questions do you have?  How might you go about answering these questions?

Click on one of the decades to see term frequencies by year. What questions do you have?

Example 2:  Word clusters over time

Using the New York Times Historical database, search for the terms:  Islam* and terror*.  (The asterisk tells the database to search for all forms of the words.)  Are the results what you expect?  Why or why not?  What other terms might you want to search?

Example 3:  Word clusters in an archival corpus

Using the Archives of Sexuality and Gender, search for the term "discrimination."  On the left navigation bar, look for "Analyze Results," and then select "Term clusters."

What does the visualization wheel tell you?  What can't it tell you?  What other term searches might you want to do?

Voyant

Voyant Tools is a powerful, free web-based tool for large scale analysis of texts and "distant reading." Voyant is an easy entry point into text analysis because it does not require advanced technical skills.  To begin working with Voyant, first gather the digital text(s) you want to analyze.

Voyant provides excellent online documentation and tutorials.

To practice, we will be using the full text of The Moynihan Report, which can be found here.

1.  Copy the full text of the report and paste it into Voyant Tools.

2.  Refine and apply your stopword list.

3.  Experiment with the various visualization options until you find one that seems to offer the best insight into the text or that raises new avenues of inquiry.

Now, imagine that you'd like to compare this visualization with one of editorial responses to the report that appeared in the African American Press.

Step 1:  Build your corpus.

  • Select the articles you want to include in your corpus.
  • You will need to OCR the PDFs of the articles.  You can do this several ways: 
    • Print the PDF, then use the library's scanner to create "searchable PDFs."
    • Importing a PDF into Google Documents and export it from there as HTML, RTF or another format or another format that Voyant can read.
    • Save the PDF to the desktop or to Dropbox and use an online OCR generator (these tend to be less reliable).
  • Clean up the text.

Step 2:  Copy and Paste the cleaned up text into Voyant and run the program.

Step 3:  Experiment with the various visualization options until you find one that seems to offer the best insight into the text.