Monthly Archives: November 2022

Praxis: An Experiment in Text Mining – A Small Corpus of Scathing Reviews

For my first foray into Voyant, I decided to take a look at a guilty pleasure of mine: the scathing book review. I know I’m not the only one who likes reading a sharp, uncompromising take on a book. Reviews along the lines of Lauren Oyler’s excoriation of Jia Tolentino’s debut essay collection go viral semi-regularly; indeed, when Oyler herself wrote a novel, some critics were ready to add to the genre. (“It’s a book I’d expect her to flambé, had she not written it.”)

I was curious if these reviews differed from less passionate (and less negative) reviews in any way that would show up in text mining.

Initial Process

To investigate, I assembled two corpora: a small set of scathing reviews, and a small set of mildly positive ones. The parameters:

  • Each corpus included 5 reviews.
  • I drew the reviews from Literary Hub’s Bookmarks site, which aggregates book reviews from notable publications. Bookmarks categorizes reviews as “pan,” “mixed,” “positive,” and “rave.” I drew the “scathing” corpus from reviews tagged as pans, and the control set from positive reviews.
  • I used the same 5 reviewers for both sets. (This was tricky, since some of the reviewers I initially chose tended to only review books that they either truly hated or — more rarely — unequivocally loved. Call this “the hater factor.”)
  • I only selected books I haven’t read yet — unintentional at first, but I decided to keep with it once I noticed the pattern.
  • I tried to select only reviews of books that fell under the umbrella of “literary fiction,” (though what that means, if anything, is a debate in itself) though American Dirt is arguably not a great fit there (but was reviewed so scathingly in so many places, I couldn’t bear to leave it out — maybe call this “the hater’s compromise”).
  • I specifically stayed away from memoir/essays/nonfiction, since those are so often critiqued on political, moral, or fact-based grounds, which I didn’t want to mix with literary reviews (but if this post has made you hungry for some takedowns along these lines, the Bookmarks Most Scathing Book Reviews of 2020 list contains memoirs by Michael Cohen, John Bolton, and Woody Allen — you’re welcome).

Here are my corpora:

The Scathing Corpus

Review TitleReview AuthorPublished InBook Author
I couldn’t live normallyChristian LorentzenLondon Review of BooksBeautiful World, Where Are YouSally Rooney
Ad NauseamRebecca PanokovaHarpersTo ParadiseHanya Yanagihara
Yesterday’s MythologiesRyan RubyNew Left ReviewCrossroadsJonathan Franzen
American DirtParul SehgalThe New York TimesAmerican DirtJeaninne Cummims
Pressure to PleaseLauren OylerLondon Review of BooksYou Know You Want ThisKristen Roupenian

The Control/Positive Corpus

Review TitleReview AuthorPublished InBookAuthor
Can You Still Write a Novel About Love?Christian LorentzenVultureThe AnswersCatherine Lacey
Either/OrRebecca PanokovaBookforumEither/OrElif Bauman
When Nations Disappear, What Happens to Nationalities?Ryan RubyNew York Times Book ReviewScattered All Over the EarthYoko Tawada
Lauren Oyler’s ‘Fake Accounts’ Captures the Relentlessness of Online LifeParul SehgalThe New York TimesFake AccountsLauren Oyler
Why are some people punks?Lauren OylerLondon Review of BooksDetransition, BabyTorrey Peters

Analyzing each corpus in Voyant

Voyant was pretty easy to get started with — but I quickly realized how much more I’ll need to know in order to really get something out of it.

I created my corpora by pasting the review URLs into two Voyant tabs. My first observation: an out-of-the-box word cloud or word frequency list, perhaps especially for a small corpus of short texts like mine, is going to contain a lot of words that tell you nothing about the content or sentiment.

I read Voyant’s guide to stopwords, then attempted to eliminate the most obvious unhelpful words, including:

  • Author names (Yanagihara, Franzen) and parts of book or review titles (Paradise, Detransition) – these would have been way less significant if I’d started with a large corpus
  • Words scraped from the host websites, rather than review text (email, share, review, account)
  • Words with no sentiment attached to them (it’s, book, read)

If I were doing this as part of a bigger project, I’d look for lists of stopwords, or do more investigation into tools (within or beyond Voyant) related to sentiment analysis, which would fit better with what I was hoping to learn.

Even with my stopword lists, I didn’t see much difference between the two corpora. However, when I compared the two corpora by adding the Comparison column to my Terms grid on one of them, I did start to see some small but interesting differences — more on that below.

What I learned

Frankly, not much about these reviews — I’ll get to that in a second — but a little more about Voyant and the amount of effort it takes to learn something significant from text mining.

Some things I would do differently again next time:

  • Add much more data to each corpus. I knew five article-length items wasn’t much text, but I hadn’t fully realized how little I’d be able to get from corpora this small.
  • Set aside more time to research Voyant’s options or tools, both before and after uploading my corpora.
  • Create stopword lists and save them separately. Who would have guessed that my stopwords would vanish on later visits to my corpus? Not me, until now.
  • Use what I’ve learned to refine my goals/questions, and pose questions within each corpus (rather than just comparing them). Even with this brief foray into Voyant, I was able to come up with more specific questions, like: Are scathing reviews more alike or unlike each other than other types of reviews? (i.e., could Tolstoy’s aphorism about happy and unhappy families be applied to positive and negative reviews?) I think having a better understanding the scathing corpus on its own would help me come up with more thoughtful ways to compare it against other reviews.

Very minor takeaways about scathing book reviews

As I mentioned, I did gain some insight into the book reviews I chose to analyze. However, I want to point out that none of what I noticed is particularly novel. It was helpful to have read Maria Sachiko Cecire’s essay in the Data Sitters Club project, “The Truth About Digital Humanities Collaborations (and Textual Variants).” When Cecire’s colleagues are excited to “discover” an aspect of kid lit publishing that all scholars (and most observers) of the field already know, she points out that their discovery isn’t a discovery at all, and notes:

“To me, presenting these differences as a major finding seemed like we’d be recreating exactly the kind of blind spot that people have criticized digital humanities projects for: claiming something that’s already known in the field as exciting new knowledge just because it’s been found digitally.”

So here’s my caveat: nothing that I found was revelatory. Even as a non-expert in popular literary criticism, the insights I gained seemed like something I could have gotten from a close read just as easily as from a distant one.

The main thing I really found interesting — and potentially a thread to tug on — came from comparing word frequency in the two corpora. There weren’t any huge differences, but a few caught my eye.

  • The word “bad” appeared with a comparative frequency of +0.0060 in the “scathing” corpus, compared to the “positive” corpus. When I used the contexts tool to see how they occurred, I saw that all but one of the “scathing” uses of bad came from Christian Lorentzen and Lauren Oyler’s reviews. Neither reviewer used the word to refer to the quality of the text they’re reviewing (e.g., “bad writing”). Instead, “bad” was used as a way to paraphrase the authors’ moral framing of their characters’ actions. For Lorentzen, Salley Rooney’s Beautiful World, Where Are You includes “…a relentless keeping score… not only of who is a ‘normal’ person, but of who is a ‘good’ or ‘bad’ or ‘nice’ or ‘evil’ person.” Oyler uses the word to discuss Kristen Roupenian’s focus on stories that feature “bad sex” and “bad” endings. That literary reviewers pan something for being didactic, or moralistic, or too obvious is nothing surprising, but it’s interesting to see it lightly quantified. I’d be interested to see if this trend would carry out with a larger corpus, as well as how it would change over time with literary trends toward or against “morals” in literature.
  • I found something similar, but weaker, with “power” — both Oyler and Ryan Ruby (panning a Jonathan Franzen novel) characterize their authors’ attempts to capture power dynamics as a bit of shallow didacticism, but the word doesn’t show up at all in the positive corpus.
  • Words like “personality” and “motivation” only show up in the negative critiques. It’s also unsurprising that a pan might focus more on how well or poorly an author deals with the mechanics of storytelling and character than a positive review, where it’s a given that the author knows how to handle those things.
  • Even “character” shows up more often in the scathing corpus, which surprised me a bit, since it’s a presumably neutral and useful way to discuss the principals in a novel. To add further (weak!) evidence to the idea that mechanics are less important in a positive review, even when “character” does turn up in an otherwise positive review, it was more likely to be mentioned in a critical context. For example, in Oyler’s overall positive review of Torrey Peters’ Detransition, Baby, she notes that “the situation Peters has constructed depends on the characters having a lot of tedious conversations, carefully explaining their perspectives to one another.” As with everything else, I would want to see if this still held in a larger corpus before making much of it. It’s very possible that the reviewers I chose are most concerned with what they find dull. From my other reading of her work, that certainly seems true of Oyler.

To test these observations further, I’d need to build a bigger corpus, and maybe even a more diverse set of copora. For example, if I were to scrape scathing reviews from Goodreads — which is much more democratic than, say, the London Review of Books — what would I find? My initial suspicion is that Goodreads reviewers have somewhat different concerns than professional reviewers. A glance at the one-star reviews of Beautiful World, Where Are You seems to bear this out, though it’s interesting to note that discussion of character and other mechanics shows up here, too:

This would be a fun project to explore further, and I could see doing it with professional vs. popular reviews on other subjects, like restaurant reviews. Fellow review lovers (or maybe “review-loving haters”?), let me know if you’re interested in poking into this a bit more.

Using Voyant for Insights into Frank Ocean’s Albums Blond/e and Channel Orange

For this praxis assignment, I struggled at first with deciding what texts I wanted to explore. I wanted to explore which corpora were publicly available, in hopes of using an existing corpus instead of building my own. However, I ended up making my own corpora in realizing that I couldn’t force myself to be interested in the existing free-to-the-public corpora I had seen. Though I’m sure there must be something out there that speaks to my interests, I had a tough time effectively searching for resources on my own as a novice to text mining. Despite the fact that building my own corpus would require more work, it was a fun exercise that made the experience more rewarding.

I use Voyant in hopes of gaining some insights regarding themes within Frank Ocean’s Blond/e and Channel Orange albums. These albums are personally some of my favorites of all time. I was interested in seeing what insights Voyant could provide for the “bigger picture” between (and among) the lyrics on each album, especially in comparison to my subjective experience of listening to these albums.

My first instinct was to try and figure out how I would be retrieving the lyrics for Frank Ocean’s songs. I figured I could use an API to pull song lyric information. I was planning to use Genius, but their API focuses more on the annotations, and not necessarily the lyrics themselves. So, I manually copied and pasted the lyrics to all his songs on both albums into .txt files within separate folders representing the albums instead. Initially, I was planning to just pass each album in separately to Voyant. I wanted to explore how Frank Ocean’s lyrics changed from Channel Orange (2012) to Blond/e (2016), especially with regards to his queerness (which I’ve personally felt has been greater explored in his later music in Blond/e, but is definitely subtly present within Channel Orange). But after reflecting more on the insights I was hoping to gain, I decided to pass all the songs from both albums combined, in order to see what themes may have overlapped in the context of one another. Additionally, I passed two separate parts of the Blond/e album split in half.  The album Blond/e is actually split into two separate parts – the album is exactly one hour, with the transition in “Nights” occurring at the 30 minute mark, directly splitting it in half.. Hence, there are two spellings of the album; the album cover art says “blond” while the album title is listed as Blonde. Duality is a major theme explored in Blond/e, with regards to the album branding, the song lyrics, and the musical composition. I think the duality themes present could be interpreted in reference to Frank Ocean’s bisexuality!

After passing 5 different corpora into Voyant, the resulting cirrus and link visualizations follow.

Channel Orange

  • Some notes:
    • With regards to love, it seems as though Frank Ocean is looking to make something real happen
    • In the cirrus, I come to understand Channel Orange to be about thinking and looking for real love, and being lost in the process.

Blond/e

  • Some notes:
    • Blond/e appears to be on his own, and learning to navigate that
    • In the cirrus, I come to understand Channel Orange to be about thinking and looking for real love, and being lost in the process.
    • As a whole, Blond/e could be interpreted as a farewell to a past fond lover, and trying to make it through the days (and nights)

Blond/e Part I

  • A note:
    • In the first part of Blond/e, there tends to be more words regarding struggle, such as solo, night, hell, leave, etc., as well as reference to marijuana to likely cope with heartbreak.

Blond/e Part II

  • Some notes:
    • In the second part of Blond/e, there is more references to day (which ties back in to my original statement about the two album parts representing duality)
    • In the overall word links, there are also links to Frank being “brave” and thinking about “god” – knowing the album myself, I interpret references to god to deal with learning to let go (hence his song “Godspeed”)

Channel Orange + Blond/e

  • Some notes:
    • Overall when combining the two albums, there seems to be prevalent references to love, god, and night/day.

Overall, playing around with Voyant was a fun experience. I hope to explore more, especially with regards to music analysis. I’m wondering if there’s similar analysis tools that can incorporate mining both text AND audio on bigger scales (though I know with audio files, it’s more difficult due to data constraints potentially). I wish I had more time to analyze the visualizations, and to dig deeper into formulating some insights that align (or contradict!) with my own personal close listening.

Quick Dive into Dalloway

Comparing Close and Distant reading outcomes of Virginia Woolf’s Mrs. Dalloway.

For the text analysis praxis, I chose to test out an observation made while close reading Virginia Woolf’s Mrs. Dalloway. For another DH class, I’ve been working on a group project investigating annotation methods using the Mrs. Dalloway as the basis. Despite Woolf suggesting that she simply let the novel flow through her with no great structure in mind, when reading it’s hard to not feel the seemingly concrete structure on top of which flows the seemingly meandering stream of consciousness dance. Of course that stream of consciousness surfaces themes again and again via the minds of various characters until it becomes plain that Woolf wants us to explore certain aspects of our humanity—youth and age, spontaneity and duty among them. 

When close reading I felt and remarked that the color grey turned out repeatedly. At first I suspected that the London setting and widely accepted stereotypical assumptions of English weather made me more attuned it (ahh yes, confirmation of that stereotype, I’m picturing the right thing) —maybe even made me overestimate its prevelance. But then I began to notice it being applied to all manner of things—but most often in a manner to denote stature, wisdom, age, and respectability. I also began to nice the mention of roses—a kind of sprinkling evenly throughout the text in relationship to various characters. These flowers were derided, gifted, and displayed—even used by Septimus, a character with a tenuous connection to this Earth, to ground himself in the moment and counteract his tendency lose himself to incoherent thought. I decided that Voyant would be a great way to see what the algorithm’s had to say about my observations. Would the reinforced or minimized—would it help suss out a pattern I didn’t observe?

Looking at the frequency of grey (38 mentions), which appears more than any other color in the text after white (58 mentions), I did see a continued usage throughout the text, and the context revealed what I had suspected.  It is often used to denote age (grey hair)—and suggest a more regular and fixed time of life. It is, also, used to signal a more refined air—standing in the world. In relation to the text, I’d even go so far as to say it showcased the kind of fixed and respectable striving of 1923 London. Weather does come in, for sure (grey-blue sky), but more often it describes the appearance of understated and refined clothing, vehicles, and homes of established ladies and gentlemen.

Indeed it was—Sir William Bradshaw’s motor car; low, powerful, grey with plain initials’ interlocked on the panel, as if the pomps of heraldry were incongruous, this man beingthe ghostly helper, the priest of science; and, as the motor car was grey, so to match its sober suavity, grey furs, silver grey rugs were heaped in it, to keep her ladyship warm while she waited.

He had worked very hard; he had won his position by sheer ability (being the son of a shopkeeper); loved his profession; made a fine figurehead at ceremonies and spoke well—all of which had by the time he was knighted given him a heavy look, a weary look (the stream of patients being so incessant, the responsibilities and privileges of his profession so onerous), which weariness, together with his grey hairs, increased the extraordinary distinction of his presence and gave him the reputation (of the utmost importance in dealing with nerve cases) not merely of lightning skill, and almost infallible accuracy in diagnosis but of sympathy; tact; understanding of the human soul. 

In considering roses, I wondered if it might not be more congruous, given my investigation of the color grey, to shift to investigating the color red. The roses mentioned throughout the book are red—that classic color of love, emotion, youth and intensity. When I made this adjustment something very striking was revealed. The frequency of the mention of red mirrored that of grey—almost as though they went hand in hand. Instead of black and white, Woolf seemed to have leaned into a contrast of grey and red—old vs young, passion vs resignation. 

Red and Grey going hand in hand

The striking overlap is even more apparent when contrasted with the color most mentioned in the book— white. There is a clear correlation between grey and red.

Red and Grey— best buddies
Peter and Richard chase and try to harness the power of red

Investigating red’s term berry doesn’t give too much away, but digging into the context gives everything away. It’s often mentioned in descriptions with almost riotous abundance of color (in stark contrast to drab and monotonous grey) and often paired with descriptions of flowers and bodily features (again with hair and clothing, but also nostrils, lips, and cheeks). Red is seemingly used to the character experience in the present moment of the novel in contrast to the experience of reminiscence that makes up much of the novel.  

and it was the moment between six and seven when every flower—roses, carnations,
irises, lilac—glows; white, violet, red, deep orange; every flower seems toburn by
itself, softly, purely in the misty beds; and how she loved the grey-white mothsspinning
in and out, over the cherry pie, over the evening primroses!

But she’s not married; she’s young; quite young, thought Peter, the red carnation he had seen her wear as she came across Trafalgar Square burning again in his eyes and making her lips red.

…through all the commodities of the world, perishable and permanent, hams, drugs, flowers, stationery, variously smelling, now sweet, now sour she lurched; saw herself thus lurching with her hat askew, very red in the face, full length in a looking-glass; and at last came out into the street.

And that very handsome, very self-possessed young woman was Elizabeth,over there, by the curtains, in red

Over all this investigation made me curious about mapping and investigating the many many overlapping patterns and structures in the novel. In fact, this exercise and that in my other class has pushed me into a peculiar space of looking at the novel as a specimen to be poked, prodded, and labeled, graphed and displayed. In this way I might possess and express my own experience of it. The idea of thinking one could “master” a novel in this way feels a bit like a delusion. It’s like trying to create a bot of Woolf’s thought process, but when you press go she doesn’t pass the Turing test. I know that there is value to this work—and as I become more comfortable and perhaps apply these tools to non-fiction work as well I can better manage the dissonance that flutters about this exercise.

BONUS:

If you haven’t read Mrs. Dalloway, setting the Voyant Terms Berry to the smallest word sample gives a pretty good summary: 

Mrs. Dalloway TLDR

Voyant and Fortune-telling Poetry

What is Omikuji?

If you visit shrines and temples in Asia, you may often see people praying for good wishes and taking Omikuji (fortunes written on paper strips) from boxes or even coin-slot machines. The Omikuji predicts your fortunes in health, work, study, marriage, etc. There are many kinds of words written on Omikuji to describe fortunes, and I am interested in the method of using classical Japanese poetry (waka) as divination.

Figure 1

I decided to run some fortune-telling poems with Voyant to see the results. The Omikuji strips are usually rolled up and folded; you will need to unroll them to see the result. Before you read the fortune-telling poems, you will see a general indicator to tell you if you are lucky today. Among many categorization methods, the examples I am using are divided as follows,

Figure 2
  • Dai-kichi大吉 (excellent luck)
  • Kichi吉 (good luck)
  • Chu-kichi中吉 (moderate luck)
  • Sho-kichi小吉 (a little bit of luck)
  • Sue-kichi末吉 (bad luck; good luck yet to come)

Failure?

I retrieved the data from the Omikuji-joshidosya website. It is said that 70% of current Omikuji strips in Japan are made by the Nishoyamada Shrine, where the Organization Joshidosha (Women’s Road Company) locates.

Figure 3

My first attempt was a total disappointment. See the Figure 4,

Figure 4

The high-frequency words that appeared in Cirrus, TermsBerry, and Trends are single hiragana characters instead of objects’ names and verbs. These words are similar to determiners and prepositions (stopwords in Voyant) in English (the, a, in, to, form, etc.). I then also realized that stopwords are not the only problem in analyzing Japanese text. Text segmentation is also different in Japanese: this issue is already super complicated in modern Japanese, not to mention that the poems in my mini-project are written in classical Japanese. So I tried to refer to the article “Basic Python for Japanese Studies: Using fugashi for Text Segmentation” and see if I could reframe the textual structure of my text for Voyant. For example, I could clean my text before uploading it to Voyant by removing auxiliary verbs, particles, suffixes, prefixes, etc. I also learned about a more manageable solution about Japanese version of stopwords from Japanese DH scholar Nagasaki Kiyonori in his post.

Inspired by Nagasaki Kiyonori, I started to create a stopword list by myself. (Figure 5) The default setting of the stopword list in Voyant Japanese mode is based on modern Japanese. See some examples here,

あそこ あの あのかた あの人 あります あれ います え おります から

何 彼 彼女 我々 私 私達 貴方 貴方方

Unlike modern Japanese or Japanese in the Meiji period (1868–1912), auxiliary verbs and particles are almost used in a completely different system in classical Japanese. See some examples in my stopword list here,

が て して で つつ つ ぬ たり り き けり む

Figure 5

But I am glad I chose poetry to do the Voyant experiment because the waka poetry has a relatively easier text segmentation method: one poem always breaks into phrases of 5/7/5/7/7.

Example: 朝日かげ たゞさす庭の 松が枝に 千代よぶ鶴の こえののどけさ

Asahikage (5)   tadasasuniwano (7)     matsugaeni (5)   chiyoyobutsuruno (7)     koenonodokesa (7)

Research Questions and Result

Okay, now we have a feasible approach! The next question is about the purpose of this analysis. Should I do a full-text analysis, or should I do several studies with questions that could be asked about those poems? For example, what seasons and figurative language are chosen for good luck and bad luck respectively?

I decided to do a comparison of imagery/actions used in the excellent luck group and the bad luck group. See the number of poems in each group:

  • Dai-kichi大吉 (excellent luck) 17
  • Kichi吉 (good luck) 6
  • Chu-kichi中吉 (moderate luck) 7
  • Sho-kichi小吉 (a little bit of luck) 9
  • Sue-kichi末吉 (bad luck; good luck yet to come) 11

The result of Dai-kichi大吉 (excellent luck)

Keywords: Spring, breeze, sakura, shadow, rain, sunny, garden, peacefully

Figure 6

The result of Sue-kichi 末吉 (bad luck)

Keywords: autumn, quiet, see, moon, shadow, flower, scatter, reality, top of a tree

Figure 7

The keywords mentioned above have already shown us a sharp comparison between what the creators believe as good luck and bad luck. I am very satisfied with the result, even though I know there are a lot to be improved. I also went to try TermsBerry and Trends in Voyant and realized that I can do further studies using these features. For example although the keyword “flower” and “shadow” both appear in two groups, what associations they have that make them different in good and back luck groups? The example in Figure 8 shows a clear association between flower, sakura (both in hiragana and kanji characters), and peach flower,

Figure 8

Conclusion

The Getting Started with Text Mining is very helpful. I started my mini-project without big data but with the idea that I need to prepare my data (cleaning and removing). If I want to use Voyant to do deeper and larger scale analysis of poems and classical Japanese texts, it definitely requires a huge preparation work. For example, I believe if I do more stopwords considering conjugations, the result probably will be more accurate. I think this tool is great for learning intertextuality and imagery in poetry writing.

There are also Sinitic poetry (kanshi 漢詩) fortune-telling Omikuji! Oh, that would be in a totally different linguist structure, but worth a try next time.

PRAXIS_a dispatch from the mines of my text

Throughout this week’s readings, I noticed the separation between those who analyze and those who provide the “content” that is analyzed, i.e., the separation between the distant readers/researchers and the authors.

Shall these two never meet?

As a person who writes, I began to wonder whether distant reading my own novel draft might yield some productive insight.

While writing, I often find myself in forest-for-the-trees situations, meaning I am deeply in the mud of the moment of my creation (the frog’s perspective) and feel like I am losing my grasp on the story’s overall arc (the bird’s perspective).

To stay connected to the bird’s eye view, authors who work on longer creative projects (and I suppose longer academic projects, too) will often have either a pinboard with index cards or a writing program like Scrivener with features showing the spine of a story digitally.

However, both of these approaches (index card and/or writing software) are still tied to chronology. And one of the intriguing aspects of distant reading is its promise of simultaneity, of translating a time-based piece into a single image. (Or if not entirely ditching chronology, distant reading at least speeds things up.)

How much of a literary work’s overall concept trickles down/is visible in its more fundamental building blocks (words and sentences)? This is a point of curiosity, a question I had not considered before familiarizing myself with distant reading.

I decided to use Voyant and Word Tree to learn a bit more about my own text-in-progress and see how its micro and macro aspects inform each other.

I uploaded the first chapter of a novel to Voyant I am working on and tried to see whether anything interesting would emerge in the “reveal”. I did not have any specific questions, only many vague curiosities. My belief (based on various experiments conducted for Praxis Assignments throughout the semester) remains that having good initial questions is necessary to find a way for these tools to serve us well. Hopefully, the questions get refined in the process of working with the tools, but an initial curiosity is productive and propulsive.

Here is what I learned about my text:

Most insightful was the Mandala feature in Voyant:
It centered the chapter title “The Idea” and showed a list of salient/defining terms within the chapter. The resulting diagram gave me a snapshot of the chapter as a network. It was satisfying to see the story’s main ingredients, almost as if someone had reverse-engineered the text and created the beginnings of a recipe.

Via the Cirrus feature, I learned that I had certainly established my protagonists/main players in the first chapter. Via Trends I saw that the arcs of the main players intersected in ways that confirmed my intentions and intuitions.

So far, I had received mostly confirmations.

More interesting insight came from looking at the “second tier” of usages, the second largest words in the cloud. I noticed that the program treats the possessive of a noun as a distinct entity. I.e., “Paul” and “Paul’s” are different entities as far as Voyant is concerned. Considering both forms together influences (and in this case amplifies) the overall presence of Paul — which aligns with my intentions but was more difficult to see. The strong presence of “Paul’s” also says something about Paul that I hadn’t explicitly considered: He owns more than others. (More characteristics or more goods? Tbd.)
I can see fruitful research questions emerging around the use of the possessive form in my text and the texts of others.

Another aspect that surprised me was the frequent presence of the word “like”. I would not have anticipated this. Here, Word Tree provided an opportunity to look at the context of these “likes” in more detail.

Based on the specific usages, which I could easily surveil via the word tree above, I might consider stylistic changes. Or perhaps I might notice that simile carries outsized responsibility in my text. The frequency of “like” might point to a theme I could make more explicit in revision. (I am thinking about this.)

In summary:

I can see Voyant being especially helpful in the later stages of the revision process for a novel & when evaluating and implementing editorial feedback. Even when using tools like Voyant on your own writing, the insights distant reading affords are most useful paired with close reading. The data visualization can be an impetus for returning to specific sections and close-reading those. (See also the Richard Jean So and Edwin Roland text “Race and Distant Reading” which details a constructive relationship between close and distant reading by looking more closely at Balwin’s Novel “Giovanni’s Room”)

And one sidebar Q:

I am wondering how Voyant’s Cirrus chooses colors. My text is very much about gender, and I noticed that nouns I had designated male kept coming up as blue and green, female coded nouns as pink. Hmmm. Coincidence? This observation made me want to try the software with a text in a very gendered language (like German).

Praxis Assignment: The Decline and Fall of my nascent Text Analysis abilities? I hope not!

For my text analysis assignment, I decided to use Voyant to look at one of the English language’s great historical works: The History of the Decline and Fall of the Roman Empire by Edward Gibbon, published in six volumes (further divided into 71 chapters) over thirteen years from 1776 to 1789. Gibbon’s magisterial study spans a period of over 1,300 years, examining the Roman-Mediterranean world from the height of the classical empire to the fall of Constantinople, capital of the eastern Roman empire (which western authors have anachronistically called the ‘Byzantine’ empire since early modernity), to the Ottoman armies of Mehmet II in 1453. Gibbon’s scholarly rigor, sense of historical continuity, and dispassionate, meticulous examination of original, extant sources contributed to the development of the historical method in western scholarship. Nevertheless, some of Gibbon’s conclusions in the Decline and Fall are also a product of the eighteenth century in which the author lived and wrote, and Gibbon’s writing is occasionally punctuated with moralizing statements (briefly touched on below).

The majority of text analysis experiments seem to focus on works of fiction, paying particular attention to their stylistic and aesthetic dimensions. I asked myself: what about an historical work, which is also a narrative construction? Would running Decline and Fall through Voyant allow a reader to observe trends in the stylistic or moralizing dimensions of Gibbon’s grand historical narrative, beyond what might already be grasped by an ordinary reading the text? (disclosure: I have by no means read all six volumes of Decline and Fall in their entirety). 

I used a plaintext file of Decline and Fall from Project Gutenberg that features an 1836-45 revised American edition containing the text of all six volumes. Prior to uploading this file in Voyant, I removed the hundreds of instances of the word ‘return’ in parentheses (which in the HTML version of the file link to the paragraph location in the text), in addition to the preface and legal note at the end of the work authored by Project Gutenberg. After uploading the file, I added additional stopwords to Voyant’s auto-detected list. The terms I removed relate to chapter headings (e.g., Roman numerals), citations (‘tom’ for tome, ‘orat’ for oration, ‘epist’ for epistle, ‘hist’ for historia’, ‘edit’ for edition and so on), and occasional Latinate (‘ad’, ‘et’) and French (‘sur, ‘des’) words. To this end, the word cloud tool was helpful for identifying terms that should be added to the stop-word list.

The resulting word cloud was, to say the least, neither surprising nor particularly revealing nor useful, most of the terms referring to the work’s predominant subjects: 

Standing out as one of the only abstract terms visible in the cloud limited to the top 75 words, however, was “character,” with 828 occurrences. Navigating to the ‘Contexts’ tab, I generated a concordance displaying instances of ‘character,’ which revealed a plethora of specific adjectives used in Gibbon’s text, for example, “manly.” Running ‘manly’ through the ‘links’ generator reveals a network of terms that reflect classical Roman definitions of masculine virtue (‘virtue,’ ‘spirit,’ ‘resolution,’ ‘freedom’) and one usage related to physical appearance (‘countenance’):

These results are once again neither surprising nor interesting, since the ancient (male) writers informing Gibbon’s work were themselves concerned with writing about the meritorious qualities and/or vices of individual male leaders for moralizing, didactic purposes, be they emperors, generals or bishops. This calls to mind Michael Witmore’s observation that “what makes a text a text–its susceptibility to varying levels of address–is a feature of book culture and the flexibility of the textual imagination” (2012). These particular examples may demonstrate the influence of classical authors on Gibbon’s narrative, but they do not necessarily convey anything about what is original in Gibbon’s prose (or, differently stated, original to Gibbon’s contemporary setting).

Perhaps one could get closer to an analysis that better reflects Gibbon’s original, polemical thesis and writing style by first 1) identifying a comprehensive list of moralizing terms (including adjectives like ‘superstitious’ and its variants) harvested from the whole text, and then 2) analyzing the occurrences of those terms in the text, and 3) looking for trends in how those terms are employed throughout the text to describe different social, ethnic, religious or occupational groups. As an enlightenment scholar critical of organized religion, Gibbon maintained that the rise of Christianity led to the fall of the western empire by making its denizens less interested in the present life, including in things civic, commercial, and military, the latter of which would have obvious consequences for the defense of the empire against invasion (Gibbon’s thesis is not generally shared by scholars today). Would such an experiment reveal more exactingly where Gibbon’s moralizing emphases change, based on the chapter or volume of the text where such terms occur?

Praxis: Topic Modeling of Historical Newspaper

“What is Distant Reading?”, the title of a NY Times article by Kathryn Schulz has provided one of the simplest ways to understand the topic, “understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data.” One might be wondering how to utilize distant reading. In this praxis assignment, I have used Topic modeling, a distant reading approach to analyze historical newspapers.

Newspapers that have survived the course of time are among the most valuable sources of information accessible to academics for researching civilizations and cultures of the past, especially in the context of historical research. Virtually all significant dialogues and disputes in society and culture throughout history were brought to light in newspapers. This is due to the fact that,  as early as the mid-nineteenth century, almost every town regardless of its size saw the establishment of a minimum of one newspaper. Within the newspaper, every facet of the social and daily life of the people is covered with articles such as regarding political debate, promotion of goods and services, and so on and so forth. To this date, no other repository was found with scholarly editorials from history covering controversial topics like political issues, marketing promotion for fashionable clothing, news on major sporting events, and poetry by a local poet all in one place. In a nutshell, for contemporary historians, newspapers record the entire spectrum of human experience better than any other source, giving them access to the past, unlike any other medium.

However, despite their importance, newspapers have remained one of the most underused historical resources for a long time. Historians have found it a daunting task and not to mention sometimes impossible to study historical newspapers page by page for a specific research topic due to the enormous amount and range of material they provide. For instance, just to address one research topic, a historian might have to go through hundreds of thousands of newspaper articles. Ironically, after all these efforts there is still no guarantee of finding the required information.

In this praxis, an attempt will be made to uncover the most interesting and potentially important topics from a period of time using topic modeling on the Paper Past database from DIgitalNZ. As opposed to utilizing the knowledge of historians at the very beginning to find out relevant topics by going through an abundance of newspapers, this praxis relies on a reverse approach. A computational approach will be used for clustering data into topics which will then be evaluated from a historian’s point of view to identify specific patterns in the dataset.

The Experiment

Dataset

Paper Past DigitalNZ is comprised of the following four sections:

  1. Newspapers

This section contains digitized newspaper issues from the eighteenth, nineteenth, and twentieth centuries from the New Zealand and Pacific regions. Each newspaper has a page dedicated to providing details about the publication, such as the period range in which it was accessible online. Also, there is an  Explore all newspapers page in which one can discover the URL of all the newspapers. Papers Past contains only a small sample of New Zealand’s total newspaper output during the period covered by the site. But it is more than sufficient for the intended term paper.

During the year 2015, the National Library of New Zealand incorporated a compilation of historical newspapers into their collection that was predominantly targeted at a Mori audience during 1842 and 1935. For this task to be carried out, the University of Waikato’s Computer Science Department used the digital Niupepa Archive, which was created and made accessible by the New Zealand Digital Library Project in 2000.

  1. Magazines and Journals
  2. Letters and Diaries
  3. Parliamentary Papers

Newspaper articles are the particular topic of interest for this praxis. More specifically, the praxis will build a topic model with newspaper articles ranging from 1830-1845. This timeframe is selected because New Zealand announced its declaration of independence in 1835 and this praxis is particularly targeted to find out the topics that emerged in the society during the pre-independence and post-independence declaration period. Paper past provides an excellent API that is open to all. I have gathered 12596 newspaper articles available in the paper past database using the API. The data was migrated into the pandas data frame for further processing and building a topic model on top of it.

I will not be going to discuss the nitty gritty of model building and technical stuff in this article. Instead, I will focus on evaluation and discussion.

Topic Visualization

The visualization is interactive. If you want to check out the visualization, please follow the URL below.

https://zicoabhidey.github.io/pyldavis-topic-modeling-visualization#topic=0&lambda=1&term=

Evaluation and Discussion

The evaluation of the topic model results in this praxis has been done through online browsing and looking for the history of New Zealand during the time period of 1830-1845 along with using general intuition. A historian with special knowledge of New Zealand’s history might have judged better. Some of the topic groups emerged from the topic model results along explanation provided in table 1 for gensim LDA model and table 2 for gensim mallet model. 

TopicsExplanation
“gun”, “heavy”, “colonial_secretary”, “news”, “urge”, “tax”, “thank”, “mail”, “night”‘Implying political movement and communication during the pre-independence declaration period.
“bill”, “payment”, “say”, “issue”, “sum”, “notice”, “pay”, “deed”, “amount”, “person”Business-related affairs after the independence declaration.
“distance”, “iron”, “firm”, “dress”, “black”, “mill”, “cloth”, “box”, “wool”, “bar”Representing industrial affairs mostly related to garments.
“Vessel”, “day”, “take”, “place”, “leave”, “fire”, “ship”, “native”, “water”, “captain”Represent maritime activities or war from a port city like Wellington.
“land”, “acre”, “company”, “town”, “sale” , “road”, “country”, “plan”, “district”, “section”Representing real-estate-related activities.
“year”, “make”, “receive”, “take”, “last”, “state”, “new”, “colony”, “great”, “give”No clear association.
“sail”, “master”, “day”, “passage”, “auckland”, “port”, “brig”, “passenger”, “agent”, “freight”Representing shipping activities related to Auckland port.
“Say”, “go”, “court”, “take”, “kill”, “prisoner”, “try”, “come”, “witness”, “give”Representing judicial activities and crime news.
“boy”, “pull”, “flag_staff”, “mount_albert”, “white_pendant”, “descriptive_signal”, “lip”, “battle”, “bride”, “signals_use”Representing traditional stories about Maori Myth and Legend regarding mount Albert.

Table 1: some of the topics and explanations from gensim LDA model

TopicsExplanation
‘land’, ‘company’, ‘purchase’, ‘colony’, ‘claim’, ‘price’, ‘acre’, ‘make’, ‘system’, ‘title’Representing real-estate-related activities.
‘native’, ‘man’, ‘fire’, ‘captain’, ‘leave’, ‘place’, ‘officer’, ‘arrive’, ‘chief’,  ‘make’Representing news regarding New Zealand War. 
‘government’, ‘native’, ‘country’, ‘settler’, ‘colony’, ‘man’, ‘act’, ‘people’, ‘law’Representing news about the sovereignty treaty signed in 1835.
‘mile’, ‘water’, ‘river’, ‘vessel’, ‘foot’, ‘island’, ‘native’, ‘side’, ‘boat’, ‘harbour’Representing maritime activities from a port city like Wellington.
‘settlement’, ‘company’, ‘make’,’war’, ‘place’, ‘port_nicholson’, ‘settler’, ‘state’, ‘colonist’, ‘colony’Representing news about Port Nicholson during the war in Wellington 1839

Table 2: some of the topics and explanations from gensim Mallet model

PRAXIS: Tinkering and Text Mining

Since starting my academic journey in DH over two years ago, I’ve been awaiting the moment when I’ll get to learn about text mining/analysis tools. I’ve worked in the “content” space my entire career and I’ve always been interested in the myriad tools out there that allow for new ways to look at the written word. I spent nearly a decade as an editor in the publishing world, and I never leveraged an actual text analysis tool, but I jerry-rigged my own approach for scouring the web for proper usage when I found myself confused on how to best render a phrase. My go-to text analysis “hack” has always been to search google for a phrase I’m unsure of in quotes, and then to add “nytimes.com” to my search query. This is based on my trust that the copyeditors at NYT are top notch and whatever usage they use most often is likely the correct usage. For instance, if I encounter the usage of “effect change” in some copy I’m editing and I’m not sure whether it should be “affect change,” I would do two separate searches in Google.

  1. “affect change” nytimes.com
  2. “effect change” nytimes.com

The first search result comes up with 72,000 results. The second result comes up with 412,000 results. Thus, after combing through the way the term is used in some of the top results, I can confidently STET the use of effect change and move on without worrying that I’ve let an error fly in the text. I’ve used this trick for years and it’s served me well, and it’s really as far as my experiments in text mining have gone until this Praxis assignment.

Diving into this Praxis assignment, I was immediately excited to see the Google NGram Viewer. I had never heard of this tool despite working in a fairly adjacent space for years. Obviously, the most exciting aspect of this tool is its absolute ease of use. It runs on simple Boolean logic and spits out digestible data visualizations immediately. I decided to test it out by using some “new” words to see how they’ve gained in published usage over the years. I follow the OED on Twitter and recall their annual new words list announcement, which for 2022 was produced as a blog post doing its best to leverage the newest additions in its text. You can read the post here: https://public.oed.com/blog/oed-september-2022-release-notes-new-words/

The NGram has a maximum number of terms you can input, so I chose the words and phrases that jumped out at me as most interesting.

The words I chose from the post were (in order of their recent frequency as spit out by NGram): jabbed, influencer, energy poverty, side hustle, top banana, Damfino, mandem, and medical indigency. As you can see, most of these terms are all quite new to the published lexicon — all but “jabbed.” However, jabbed in the early 20th century likely had more to do with boxing literature than vaccinations.

Moving along in this vein, I then looked up the “word of the year” winners dating back the last decade. These words were: omnishambles, selfie, vape, post-truth, youthquake, toxic, climate emergency, and vax. 2020 did not have a word of the year for reasons I suspect have to do with the global pandemic. Looking at the prominence of these words in published literature over the years showed a fairly similar result as the “new” words list.

What I found surprising is that these words and phrases are actually “newer” than the ones I pulled from the new words list. There’s barely a ripple for all these words outside of “toxic,” which has held populat usage for over a century now according to NGram.

Despite to say, as a person who routinely looks up usages for professional purposes, I’m elated to discover this tool. It will not only help me in my DH studies, but will also assist me in editorial work as I look for the more popular usage of terms. Instead of having to us Google’s own search engine and discern the results myself, I can now see simple visualizations that will prove one usage’s prominence over another.

NGram is well and good, but I could tell this was a bit of a cop out when it came to learning the ins and outs of text mining. So I decided to test out Voyant Tools to see if I could get a handle on that. As was noted in the documentation, it is best to use a text I am familiar with so I can make some qualitative observations on the data this is spit out. I decided to use my recently submitted thesis in educational psychology, as there’s likely not much else I’m more familiar with. My thesis is titled, “User Experience Content Strategies for Mitigating the Digital Divide on Digitally Based Assessments.” Voyant spat out a word cloud that basically spelled out my title in via word vomit in a pretty gratifying manner.

This honestly would have been a wonderful tool to leverage for the thesis itself. As I tested 200 students on their ability to identify what certain accessibility tools offered on digital exams do, I had a ton of written data from these students and I could have created some highly interesting visualizations of all the different descriptive words the students used when trying to describe what a screen reader button does.

I’ve always known that text analysis tools existed and were somewhat at my disposal, yet I’ve never even ventured to read about them until this assignment. I’m surprised by how easy they are to get started with and am excited about leveraging more throughout my DH studies.

Praxis assignment: how to ‘read’ a book without reading it

For my text mining praxis assignment, I decided to use Python’s Natural Language Processing (NLP) package, also called Natural Language Toolkit (NLTK). Further to last month’s text analysis workshop, I thought it would be a good idea to put into practice what learnt.

I picked Jane Eyre, a book I read few times in both the original language and a couple of translations, to ensure I could review the results with a critical eye. The idea was to utilise NLP tools to get an understanding of the sentiment of the book and a summary of its contents.

In an attempt to practice as much as possible, I opted for an exploratory approach to trial as different features and libraries. At the bottom of this post, you will find my Jupyter notebook (uploaded on Google Colab) – please note that some of the outputs exceed the output space limit, hence you might need to open them in your editor. In terms of steps undertaken, I was able to create a way to “assess” words (positive, negative, neutral) and run a cumulative analysis on big part of texts to gather a sense on how the story develops. Separately, I wrote a function to summarise the book, which drove the length of it from 351 pages to 217 (preface, credits, and references included), however I am not sure about the quality of the result!! Here in PDF this “magic” summary!

Clearly the title of my blog post is meant to be provocative, but I can see how these algorithms could be used as shortcuts!

Before diving into the notebook though, I would like to share a few of thoughts on my takeaways on text analysis. On the one hand, it is impressive to witness the extent to which the latest technologies can take unstructured data, interpret it, translate it into a machine-readable format, and then make it available to the user for further analysis or manipulation. Machines can now operate information extraction (IE) tasks by mining meaningful pieces of information from texts with the aim of identifying specific data, or targets, in contexts where the language used is natural and the starting dataset is either semi or fully unstructured. On the other hand, I personally have concerns around the fact that text mining softwares perform semantic analyses that often can only leverage a subsection of the broader spectrum of knowledge. This is to say that the results produced by these technologies can certainly be valid, however, could be limited by the inputs and related pre-coded semantics, hence potentially translating into ambiguous outputs.

There is a chance that the below HTML will not render the notebook, if so, you can download it directly from myGitHubGist.

Cumulative distribution anlysis
Positive vs negative “sentiments” as the story develops

Praxis Assignment: Text Mining with Map Lemon

A/N: This post contains a lot of information about my project, Map Lemon. If you don’t want to be deeply confused about what Map Lemon is and why it is, you can head on over to my blog at https://noveldrawl.commons.gc.cuny.edu/research/, as I’m not explaining it for the sake of brevity in this post. The corpus itself is not yet publicly available, so you’ll just have to trust me on the docs I’m using for now.

I’ll start this post with a bit of introduction on my background with text mining. I’m a Linguist. I’m a Computational Linguist. And more importantly than either of those two really nebulous phrases that I’m still convinced don’t mean much, I made a corpus. I am making a corpus, rather (everything is constantly moving changing and growing). It happened by accident, out of necessity—it wasn’t my original research interest but now that I’m deep in it I love it.

My corpus, Map Lemon, is #NotLikeOtherCorpuses (I’m sorry for that joke). It’s not text mined. A LOT of linguistic corpuses are text mined these days and that gets on my nerves in a real bad way. Here’s why:

Let’s use the example that you’re text mining on Twitter to get a better idea of jargon used within a niche community, since this use-case is quite common.

  1. Text mining often takes phrases out of their contexts because of the way platforms like Twitter are structured.
  2. These aren’t phrases that, generally speaking, are used in natural speech or writing. While cataloging internet speak is important, especially to understand how it affects natural S&W, we’re not cataloging as much natural S&W as a result, and I don’t think I need to explain why that’s important.
  3. It’s not situational. You’re not going to find recipes for lemonade, or directions to a lemonade stand yes I’m making a joke about my own research here on Twitter.
  4. You’re often missing demographics that are unknown that can affect the content of the corpus.

I chose to do this praxis assignment to debunk, or at least attempt to, all of those things. I want text mining to work for me. It probably won’t for my use-case, but I should at least be versed in doing it!

Now let’s get into it.

I decided to text mine my own corpus. Yup. I’m at a stand-still with the results I’ve been getting and need material for an abstract submission. Here we go.

So, since my data has already been cleaned before, I went ahead and just plopped it into Voyant. The following ensued:

  • Oh, rats I forgot I can’t just do that with all the demographics and stuff in there and confuse Voyant.
  • Okay, copy the file. Take out all the not responses. Might as well separate them into their respective experiments while I’m at it.

So, the final version of what I’m analyzing is: 1) Just the directions to the lemonade stand 2) Just the recipes for lemonade. I’m not analyzing the entire corpus together since it wouldn’t yield coherent results for this specific purpose due to the difference in terminology used for these two tasks and lack of context for that terminology.

The results from the directions were really neat in that you can follow the correlations and word counts as directions given, basically. Here’s the map from Experiment I so you can follow along:

Here are the most common phrases in the directions given, according to analysis with Voyant:

  • “before the water fountain”; count: 4
  • “take the first left”; count: 4
  • “a carousel on your left”; count: 3
  • Some other phrases that are all count 3 and not as interesting until…
  • “at the water fountain”; count: 3
  • “between the carousel and the pond”; count: 3

Now, okay, these numbers aren’t impressive at first glance. Map Lemon only has 185 responses at present, so numbers like this maybe aren’t all that significant, but they sure are interesting. Map Lemon contains exclusively responses from North Americans, so from this we could postulate that North Americans tend to call “that thing over yonder” a water fountain or a carousel. But also from this we can see the directions Chad gets most commonly: people often send him down the first left on the street; of the group that does not, and has him cut through the park, they let him know that he should pass the carousel on the left; and the lemonade stand is just before the water fountain. All these directions are reiterated in two different ways, so it seems. That sure is neat! Not particularly helpful, but neat.

So let’s look at those cool correlations I mentioned.

  • ‘gym’ and ‘jungle’ – correlation: 1 (strongest)
  • ‘clearing’ and ‘paved’ – correlation: 1
  • This one I’m unsure what is really meant by it, if that makes sense, but it was ‘enter’ and ‘fork’ corr. 1
  • ‘home’ and ‘passed’ – correlation: 1

These look like directions alright! Okay, of course there’s the phrase ‘jungle gym’, but we do see, okay, there’s a paved clearing. I’m sure at some point Chad has to enter a fork, although I’m a bit confused by that result, and yes, many people did have Chad pass the house. Neat!

I’m a bit skeptical of some of these correlations as well, because it’s correlating words strongly that only appear once, and that’s just not a helpful strong correlation. But that’s just how the tool works.

Looking at contexts wasn’t particularly helpful for the directions, as lot of the contexts were for the words ‘right’ and ‘left’.

Now, here’s what was really freakin’ cool: the links. Voyant made this cool lil graphic where I can see all the most common words and their links. And it actually shows… directions! The 2/3 most common paths, all right there, distilled down. Try giving Chad directions for yourself and see what I mean, ‘cause it’ll probably look something like this:

Voyant’s link map for Map Lemon Experiment I

Okay, so the directions didn’t show anything revolutionary, but it was pretty darn cool. Let’s move onto the recipe analysis.

NOW THIS IS FREAKIN’ COOL!!! According to the phrase count tool, everybody makes lemonade about the same! Including using a lot of the same amounts of ingredients and even the same filler phrases!

Ingredients:

  • 1 cup sugar; count: 3 (the semantics of this compared to the other two is really interesting!)
  • 3 cups of water; count: 3
  • 4 lemons; count: 3

Filler phrases:

  • “a lot of”; count: 5
  • “make sure you have”; count: 5
  • “kind of”; count: 4 (context for this one is tricky)

Perhaps that’s a recipe I could try!

Now, okay. If we’re skinning cats, there’s not a lot of ways to skin this one, actually. We specifically chose lemonade for this experiment because it’s ubiquitous in North America and really easy. And the correlations feature wasn’t really helpful this time around for that exact reason (lack of distinguishing words). But look at this cool link map thingy!!

Voyant’s link map for Experiment II

Very helpful! You either squeeze, cut, juice, or halve your lemons—not all four (and two of those are only different out of semantics). Add several cups of water to a pitcher, and stir in (or add—again, semantics) at least a cup of sugar. Boom! There’s our lemonade recipe. This was so cool to synthesize!

At the end of this little project, I am still as annoyed with the inability to use demographics with text mining as I was before. This means that demographics for text mining for my purposes would have to be very carefully limited to control for individual linguistic variants. However, I also see the benefit in this tech; although, I definitely think it’d be better for either larger datasets or datasets that have a very controlled demographic (so that small numbers like 3 and 4 are actually statistically significant). Mostly, for now, it seems that the results just make me say “cool!”. I love the visual links a lot; it’s a really good example of graphics that are both informative and useful. I think it would be a fun side project to try and synthesize a true “All-American Man” using text mining like this. (Side note, that exact sentence reminds me that in my Computational Linguistics class in undergrad, for our final project about half the class teamed up and scraped our professor’s Twitter account and made a bot that wrote Tweets like he did. It actually wasn’t all that bad!)

I think this could potentially be valuable in future applications of my research, but again, I think I need to really narrow down the demographics and amount of data I feed Voyant. I’m going to keep working with this, and maybe also try Google Ngram Viewer.