Praxis: Topic Modeling of Historical Newspaper

“What is Distant Reading?”, the title of a NY Times article by Kathryn Schulz has provided one of the simplest ways to understand the topic, “understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data.” One might be wondering how to utilize distant reading. In this praxis assignment, I have used Topic modeling, a distant reading approach to analyze historical newspapers.

Newspapers that have survived the course of time are among the most valuable sources of information accessible to academics for researching civilizations and cultures of the past, especially in the context of historical research. Virtually all significant dialogues and disputes in society and culture throughout history were brought to light in newspapers. This is due to the fact that,  as early as the mid-nineteenth century, almost every town regardless of its size saw the establishment of a minimum of one newspaper. Within the newspaper, every facet of the social and daily life of the people is covered with articles such as regarding political debate, promotion of goods and services, and so on and so forth. To this date, no other repository was found with scholarly editorials from history covering controversial topics like political issues, marketing promotion for fashionable clothing, news on major sporting events, and poetry by a local poet all in one place. In a nutshell, for contemporary historians, newspapers record the entire spectrum of human experience better than any other source, giving them access to the past, unlike any other medium.

However, despite their importance, newspapers have remained one of the most underused historical resources for a long time. Historians have found it a daunting task and not to mention sometimes impossible to study historical newspapers page by page for a specific research topic due to the enormous amount and range of material they provide. For instance, just to address one research topic, a historian might have to go through hundreds of thousands of newspaper articles. Ironically, after all these efforts there is still no guarantee of finding the required information.

In this praxis, an attempt will be made to uncover the most interesting and potentially important topics from a period of time using topic modeling on the Paper Past database from DIgitalNZ. As opposed to utilizing the knowledge of historians at the very beginning to find out relevant topics by going through an abundance of newspapers, this praxis relies on a reverse approach. A computational approach will be used for clustering data into topics which will then be evaluated from a historian’s point of view to identify specific patterns in the dataset.

The Experiment


Paper Past DigitalNZ is comprised of the following four sections:

  1. Newspapers

This section contains digitized newspaper issues from the eighteenth, nineteenth, and twentieth centuries from the New Zealand and Pacific regions. Each newspaper has a page dedicated to providing details about the publication, such as the period range in which it was accessible online. Also, there is an  Explore all newspapers page in which one can discover the URL of all the newspapers. Papers Past contains only a small sample of New Zealand’s total newspaper output during the period covered by the site. But it is more than sufficient for the intended term paper.

During the year 2015, the National Library of New Zealand incorporated a compilation of historical newspapers into their collection that was predominantly targeted at a Mori audience during 1842 and 1935. For this task to be carried out, the University of Waikato’s Computer Science Department used the digital Niupepa Archive, which was created and made accessible by the New Zealand Digital Library Project in 2000.

  1. Magazines and Journals
  2. Letters and Diaries
  3. Parliamentary Papers

Newspaper articles are the particular topic of interest for this praxis. More specifically, the praxis will build a topic model with newspaper articles ranging from 1830-1845. This timeframe is selected because New Zealand announced its declaration of independence in 1835 and this praxis is particularly targeted to find out the topics that emerged in the society during the pre-independence and post-independence declaration period. Paper past provides an excellent API that is open to all. I have gathered 12596 newspaper articles available in the paper past database using the API. The data was migrated into the pandas data frame for further processing and building a topic model on top of it.

I will not be going to discuss the nitty gritty of model building and technical stuff in this article. Instead, I will focus on evaluation and discussion.

Topic Visualization

The visualization is interactive. If you want to check out the visualization, please follow the URL below.

Evaluation and Discussion

The evaluation of the topic model results in this praxis has been done through online browsing and looking for the history of New Zealand during the time period of 1830-1845 along with using general intuition. A historian with special knowledge of New Zealand’s history might have judged better. Some of the topic groups emerged from the topic model results along explanation provided in table 1 for gensim LDA model and table 2 for gensim mallet model. 

“gun”, “heavy”, “colonial_secretary”, “news”, “urge”, “tax”, “thank”, “mail”, “night”‘Implying political movement and communication during the pre-independence declaration period.
“bill”, “payment”, “say”, “issue”, “sum”, “notice”, “pay”, “deed”, “amount”, “person”Business-related affairs after the independence declaration.
“distance”, “iron”, “firm”, “dress”, “black”, “mill”, “cloth”, “box”, “wool”, “bar”Representing industrial affairs mostly related to garments.
“Vessel”, “day”, “take”, “place”, “leave”, “fire”, “ship”, “native”, “water”, “captain”Represent maritime activities or war from a port city like Wellington.
“land”, “acre”, “company”, “town”, “sale” , “road”, “country”, “plan”, “district”, “section”Representing real-estate-related activities.
“year”, “make”, “receive”, “take”, “last”, “state”, “new”, “colony”, “great”, “give”No clear association.
“sail”, “master”, “day”, “passage”, “auckland”, “port”, “brig”, “passenger”, “agent”, “freight”Representing shipping activities related to Auckland port.
“Say”, “go”, “court”, “take”, “kill”, “prisoner”, “try”, “come”, “witness”, “give”Representing judicial activities and crime news.
“boy”, “pull”, “flag_staff”, “mount_albert”, “white_pendant”, “descriptive_signal”, “lip”, “battle”, “bride”, “signals_use”Representing traditional stories about Maori Myth and Legend regarding mount Albert.

Table 1: some of the topics and explanations from gensim LDA model

‘land’, ‘company’, ‘purchase’, ‘colony’, ‘claim’, ‘price’, ‘acre’, ‘make’, ‘system’, ‘title’Representing real-estate-related activities.
‘native’, ‘man’, ‘fire’, ‘captain’, ‘leave’, ‘place’, ‘officer’, ‘arrive’, ‘chief’,  ‘make’Representing news regarding New Zealand War. 
‘government’, ‘native’, ‘country’, ‘settler’, ‘colony’, ‘man’, ‘act’, ‘people’, ‘law’Representing news about the sovereignty treaty signed in 1835.
‘mile’, ‘water’, ‘river’, ‘vessel’, ‘foot’, ‘island’, ‘native’, ‘side’, ‘boat’, ‘harbour’Representing maritime activities from a port city like Wellington.
‘settlement’, ‘company’, ‘make’,’war’, ‘place’, ‘port_nicholson’, ‘settler’, ‘state’, ‘colonist’, ‘colony’Representing news about Port Nicholson during the war in Wellington 1839

Table 2: some of the topics and explanations from gensim Mallet model