Text Analysis with NLTK Workshop

I attended the Introduction to NLTK (Text Analysis) with Python workshop this last Friday. The workshop was overall very useful, and helps you get started with the right suite of tools to start conducting some basic exploratory analysis on text data.

The workshop asked users to download Jupyter Notebook. I already had this installed as part of Anaconda Navigator, and had used it in a previous class. I find Jupiter notebook to be a great user-friendly tool to work and learn Python with!

We imported the NLTK Library and the matplotlib for data visualization in the notebook. Using the capabilities of these libraries and magic functions, the instructor showed us how to do some basic plotting and text analyses, such as calculating lexical density, frequency distributions of certain words, and dispersion plots.

Some of these commands are case sensitive, so the instructor taught us how to make all words lowercase to allow for proper counts. We went through the process of cleaning the data, with lemmitization and stemming, as well as removing stop words.

The steps we used are described in the code below. For this blog post, I decided to re-do the workshop assignment with looking at the The Book of Genesis instead of Moby Dick, which we used in the workshop.

import nltk

In [2]:

#for dispersion plot, #to tell Jupyter notebook to display the graph in the notebook
import matplotlib 
%matplotlib inline

In [3]:

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

Out[3]:

True

In [4]:

from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [5]:

type(text3)

Out[5]:

nltk.text.Text

In [6]:

text3.concordance("whale")

no matches

In [7]:

text3.similar("life")

house name brother cattle blood son fowls money thigh god heaven it
good land fruit fowl beast lord sight knowledge

In [8]:

text3.similar("love")

went drank earth darkness morning se them give nig hath man had thus
not took keep die call sle woman

In [9]:

text3.similar("queer")

No matches

In [10]:

text3.similar("death")

face place cattle image host generations sight father mother wife eyes
voice presence head children hand way brother sheep flock

In [11]:

text3.dispersion_plot(["life","love","loss","death"]) #the following graph shows where the word appears based on the number of words or pages in. e.g. between 0 and 250000

In [12]:

text3.count("Love")

Out[12]:

In [13]:

text3.count("love")

Out[13]:

In [14]:

text3_lower = []
for t in text3:
    if t.isalpha():
        t = t.lower()
        text3_lower.append(t)

In [15]:

from nltk.corpus import stopwords

In [16]:

stops = stopwords.words('english')

In [17]:

print(stops)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In [18]:

text3_stops = []
for t in text3_lower:
        if t not in stops:
            text3_stops.append(t)

In [19]:

print(text3_stops[:30])

['beginning', 'god', 'created', 'heaven', 'earth', 'earth', 'without', 'form', 'void', 'darkness', 'upon', 'face', 'deep', 'spirit', 'god', 'moved', 'upon', 'face', 'waters', 'god', 'said', 'let', 'light', 'light', 'god', 'saw', 'light', 'good', 'god', 'divided']

In [20]:

len(set(text3_stops)) #to see how many unique stopwords there are use set()

Out[20]:

In [21]:

# lemmatization to take the root word
from nltk.stem import WordNetLemmatizer

In [22]:

wordnet_lemmatizer = WordNetLemmatizer()

In [23]:

wordnet_lemmatizer.lemmatize("waters")

Out[23]:

'water'

In [24]:

# create clean lemmatized list of text 3
text3_clean = []
for t in text3_stops:
    t_lem = wordnet_lemmatizer.lemmatize(t)
    text3_clean.append(t_lem)

In [25]:

print(len(text3_clean))

In [27]:

#lexical density
len(set(text3_clean)) / len(text3_clean)

Out[27]:

0.12838832833378783

In [28]:

#sorting the first 30 unique 'cleaned' words of the text 
sorted(set(text3_clean))[:30]

Out[28]:

['abated',
 'abel',
 'abelmizraim',
 'abidah',
 'abide',
 'abimael',
 'abimelech',
 'able',
 'abode',
 'abomination',
 'abr',
 'abrah',
 'abraham',
 'abram',
 'abroad',
 'absent',
 'abundantly',
 'accad',
 'accept',
 'accepted',
 'according',
 'achbor',
 'acknowledged',
 'activity',
 'adah',
 'adam',
 'adbeel',
 'add',
 'adder',
 'admah']

In [30]:

#Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()

In [31]:

print(porter_stemmer.stem('accept'))
print(porter_stemmer.stem('accepted'))

accept
accept

In [32]:

t3_porter = []
for t in text3_clean:
    t_stemmed = porter_stemmer.stem(t)
    t3_porter.append(t_stemmed)

In [33]:

print(len(set(t3_porter)))
print(sorted(set(t3_porter))[:30])

2113
['abat', 'abel', 'abelmizraim', 'abid', 'abidah', 'abimael', 'abimelech', 'abl', 'abod', 'abomin', 'abr', 'abrah', 'abraham', 'abram', 'abroad', 'absent', 'abundantli', 'accad', 'accept', 'accord', 'achbor', 'acknowledg', 'activ', 'adah', 'adam', 'adbeel', 'add', 'adder', 'admah', 'adullamit']

In [34]:

my_dist = FreqDist(text3_clean)

In [35]:

type(my_dist)

Out[35]:

nltk.probability.FreqDist

In [36]:

my_dist.plot(20)

Out[36]:

<AxesSubplot:xlabel='Samples', ylabel='Counts'>

In [37]:

my_dist.most_common(20)

Out[37]:

[('unto', 598),
 ('said', 477),
 ('son', 294),
 ('thou', 284),
 ('thy', 278),
 ('shall', 259),
 ('thee', 257),
 ('god', 236),
 ('lord', 208),
 ('father', 207),
 ('land', 189),
 ('jacob', 179),
 ('came', 177),
 ('brother', 171),
 ('joseph', 157),
 ('upon', 140),
 ('day', 133),
 ('abraham', 129),
 ('wife', 126),
 ('behold', 118)]

In [ ]:

GCDI Digital Fellows: Text Analysis with NLTK

On Friday, the GCDI Digital Fellows hosted a workshop on how to use online libraries and programming languages for text analysis purposes. The session revolved around Python and its Natural Language Tool Kit (NLTK), an inbuilt platform that specifically works with human language data. NLTK allows the conversion of texts, such as books, essays and documents, into digitally consumable data, which implies a transformation of qualitative contents into quantitative, and therefore decomposable and countable, items. The workshop did not require prior experience with NLTK and resulted in an extremely effective session.

I thought of different ways of summarising what was covered during the seminar in a way that other people would find useful and decided to write a step-by-step guide that incapsulates the main aspects discussed and some additional pieces of information that can be helpful when approaching Python and Jupyter Notebook from scratch.

Required downloads/installations

A Text Editor (aka where the user writes their code) of your choice, I opted for Visual Studio Code, alternatively Notepad would do – FYI: Google Docs & Microsoft Word will not work since they are not text editors, but word processors;
Anaconda;
Git for Windows , for Windows users only. This is optional, but I would still recommend it.

About Python

There is extensive documentation available, including the official Beginner’s Guides.
It is an interpreted language, hence it does what the user tells it to.
It is object-oriented; mostly Python has ‘classes’ of objects, and almost everything is an object.
Python is designed to be readable.

Jupyter Notebook

Jupyter Notebook is structured data that represents the user’s code, metadata, content, and outputs. When saved to disk, the notebook uses the extension .ipynb, and uses a JSON structure. Jupyter Notebook and its interface extend the notebook beyond code to visualisation, multimedia, collaboration, and more. In addition to running codes, it stores code and output, together with markdown notes, in an editable document called a notebook. When saving it, this is sent from the user’s browser to the notebook server which, in turn, saves it on disk as a JSON file with a .ipynb extension.

Remember: do save each new file/notebook with a name, otherwise it will be saved as “unnamed file”. You can type in your search bar Jupyter and see if it is already installed on your computer, otherwise you can click > Home Page – Select or create a notebook

When ready, you can try the below steps in your Jupyter Notebook and see if they produce the expected result/s. To comment in Jupyter you can use the dropdown menu selecting “markdown” instead of “code”. Alternatively, you can use either 1 or 2 # signs.

In Jupyter every line can be a code, or a comment, and each line can be run independently – also, lines can be amended and rerun.

To start type these commands in your Jupyter Notebook (commands are marked by a bullet point and comments follow in Italic):

import nltk
- nltk stands for natural language tool kit
from nltk.book import *
- the star means “import everything”
text1[:20]
- for your reference: text1 is Moby Dick
- this command prompts the first 20 items of the text
text1.concordance(‘whale’)
- Concordance is a method that shows the first 25 occurrences of the word whale together with some words before and after – so the user can understand the context – this method is not case sensitive

Before proceeding, let’s briefly look at matplotlib.

What is matplotlib inline in python

IPython provides a collection of several predefined functions called magic functions, which can be used and called by command-line style syntax. There are two types of magic functions, line-oriented and cell-oriented. For the purpose of text analysis, let’s look at %matplotlib, which is a line-oriented magic function. Line-oriented magic functions (also called line magics) start with a percentage sign (%) followed by the arguments in the rest of the line without any quotes or parentheses. These functions return some results, hence can be stored by writing it on the right-hand side of an assignment statement. Some line magics: %alias, %autowait, %colors, %conda, %config, %debug, %env, %load, %macro, %matplotlib, %notebook, etc.

import matplotlib
%matplotlib inline

Remember that Jupyter notebooks can handle images, hence plots and graphs will be shown right in the notebook, after the command.

text1.dispersion_plot([‘whale’, ‘monster’])

text1.count(‘whale’)
- this command counts the requested word (or string) and is case sensitive – so searching for Whale or whale DOES make a difference.

As the count command is case sensitive, it is important to find a way to lower all words first and then count them accordingly. See below a new list made of lowered words. We can call the new list text1_tokens = []

text1_tokens = []
for word in text1:
- if word.isalpha():
  - text1_tokens.append(word.lower())

It is now possible to quickly check the new list by using the following command.

text1_tokens[:10]

The next step would be to count again; this time the result will be made of ‘whale’ + ‘Whale’ + any other combination of the word whale (i.e. ‘WhaLE’).

text1_tokens.count(‘whale’)

Further to the above, it is possible to calculate the text’s lexical density. Lexical density represents the number of unique (or distinct) words over the total number of words. How to calculate it: len(set(list)/len(list) the function ‘set’ returns a list containing only distinct words. To exemplify:

numbers = [1, 3, 2, 5, 2, 1, 4, 3, 1]
set(numbers)

The output should look like the following: {1, 2, 3, 4, 5}

Let’s use this on text 1 now. The first step requires to check the unique words in the newly created ‘lowered list’. Then, the next command asks the computer to spit out only the number of unique words, and not to show the full list. Finally, it is possible to compute the ratio of unique words over total of words.

set(text1_tokens)
len(set(text1_tokens))
len(set(text1_tokens)) / len(text1_tokens)

Let’s slice it now and create a list of the first 10,000 words. This allows to compare, for example, the ratio of text1 to the ratio of text2. Remember, it is very dangerous to draw conclusions based on such a simplified comparison exercise. The latter should be taken as a starting point to generate questions and provide an insightful base to work on more a complex text analysis instead.

t1_slice = text1_tokens[:10000]
len(t1_slice)
len(set(t1_slice))/10000
t2 = [word.lower() for word in text2 if word.isalpha()]
t2_slice = t2[:10000]
len(set(t2_slice))/10000

Minimal Computing

Every sector publishes information on websites and online portals to inform their particular audience about problems they may encounter and present pathways to potential solutions. For instance, in the medical/health sector there are sites like WebMD and Mayo Clinic, and in the legal/justice sector there are sites like LawHelpNY and Crime Victims Legal Help. These sites speak to a range of individuals, from advocates who use it to help their clients to people in need who have vastly different reading levels and internet access. This brings me to the article, Introduction: The Questions of Minimal Computing where we’re warned that “defining minimal computing is as quixotic a task as defining digital humanities itself,” but can generally be considered to mean “a mode of thinking about digital humanities praxis that resists the idea that “innovation” is defined by newness, scale, or scope,” in response to, or consideration of, constraints such as the “lack of access to hardware or software, network capacity, technical education, or even a reliable power grid.”

I’m considering the possibility of exploring this topic more broadly for my final paper, but will focus my notes for this post on minimal computing as it relates to digital humanities projects. First, the authors recommend that considering the constraints when developing a digital humanities project we should ask 4 constituent questions: 1) “what do we need?”; 2) “what do we have”; 3) “what must we prioritize?”; and 4) “what are we willing to give up?” As someone who has project-managed product development in the nonprofit sector over the past few years, this is a good framework for projects beyond the confines of digital humanities projects. A north star which the author’s point to and which resonates is “sometimes — perhaps often — when we pause to consider what we actually need to complete a project, the answer isn’t access to the latest and greatest but the tried and true.”

To implement minimal computing in digital humanities projects, we must sit with the following tensions: the impulse towards larger, faster, always-on forms of computing, the consideration of the range of computer literacy of the intended audience, and the tension between choice and necessity driven by the dearth of funding and resources.

Workshop: Finding Data for Research Projects

Early in the semester we had an assignment that involved finding a data set and offering analysis. There are several ways to go about doing this including a simple internet search for “data sets for research projects” or identifying an area of interest and looking for related data. Another great option I discovered is the Graduate Center Mina Rees Library’s Finding Data portal. The following is a brief overview.

The home page of the portal offers general information along with pathways for 5 categories of data:

Demography & Populations
Education & Housing
Labor & Economics
Health & Environment
Law, Politics & Conflict

There are also guides for analyzing and visualizing data and mapping data. If it all seems a bit overwhelming, a “where to start” section on the home page offers examples of local data such as NYC OpenData and Neighborhood Data Portal, national data such as American FactFinder, and international data such as UNdata.

There are some limitations, however. I was interested in the Law, Politics & Conflict data sets, and specifically about data related to 2 areas: access to civil justice and American democracy. There were a few data sets related to criminal justice but none for civil justice which is admittedly not as widely studied, and there were no data sets for American democracy. Still, if you don’t have a specific area of interest and looking for a place to discover data sets that you can use for projects, the Mina Rees Library’s Finding Data portal is a great resource.

Reflections about Workshop in interactive storytelling

I attended a workshop about interactive storytelling with the program Twine and I wanna share some of my takeaways.

Twine is a free and open-source tool for making interactive fiction in the form of web pages. You don’t need to write any code to create a story with Twine, but you can extend your stories with variables, conditional logic, images, CSS, and JavaScript. The goal of the workshop was to get familiar with the program and try to create your own interactive storytelling. The setup was that each person participated online from their own computer and was connected to an online program that took you through the different steps of using twine. Twine works as a grid of paths where at each step you will have to pick your next step.

This is an example of a storyline.

The workshop started with a video that explained the basic tools followed by time for each person to start creating their own stories. Every 15 min. you would then be matched up with another participant to try their game, give feedback and resume again to your own story. I think the idea of the setup was great. Unfortunately the workshop encountered many technical issues which resulted in I never got to try another participant’s game. Nevertheless I still learned the basics of the program and had some takeaways.

It is an easy program to get started with and uses very simple coding tools. I got a good idea of the program and fast I learned to use very simple tools and started creating a story. It can both be used as a program to do a fast mockup of an idea to an interactive game and can also be created with many different focuses both ethical dilemmas, informative questions etc. I could see myself using it both to create content for teaching students, but also for students themselves to create stories.

I started creating my story without any specific idea, but I was surprised how the program sparked creativity and how fast it was to create a story by very simple means.

Besides the setup did not work because of technical issuesI found the format interesting. To be online and get partnered up with somebody, give them feedback but actually never “meeting” them seemed to work for other people in the workshop.

Beside creating my own story I have afterwards also explored the many different interactive stories created by other users. If you want to explore yourself is there here a list of games made with Twine: https://itch.io/games/made-with-twine

Responsible Trauma-informed Archiving

This week’s readings particularly moved me as I have come across questions of responsible archiving throughout the past few years. Jessica Marie Johnson’s “Markup Bodies” articulated a festering discomfort I have sometimes found myself feeling in archival spaces (ranging from curated museums to viral social media posts) with regards to the commodification and spectacle-making of trauma. “ The brutality of black codes, the rise of Atlantic slaving, and everyday violence in the lives of the enslaved created a devastating archive. Left unattended, these devastations reproduce themselves in digital architecture, even when and where digital humanists believe they advocate for social justice” – Johnson points to a lack of critical engagement that can occur under the guise of “archiving” that ultimately leads to a desensitized and disconnected consumption of trauma through media that continues to replicate and give power to it. As we engage with traumatic archives, it’s important to question and think critically about the desired impacts of archival engagement, potential unintended consequences, emotional labor of archivists, and imagine creative, responsible ways to archive in a way that does not make light of the deep brutality experienced by real people. I found two really interesting readings when looking for more about responsible trauma-informed archiving that I think might be good for folks interested in more: Love (and Loss) in the Time of COVID-19: Translating Trauma into an Archives of Embodied Immediacy and Safety, Collaboration, and Empowerment.

Highlighting the Humane

Reflecting on the readings for this week, the point that kept coming up in my mind as I was reading the articles was how utterly humane the writers discussion was underlined. This was surprising by omission in the sense that it was something I had missed before their discussion. When thinking of an archive or a collection or a museum, the concept of humanness somehow didn’t cross my mind. Being aware of the horrors of slavery, I hadn’t thought of the point that an archive could perpetuate them. Reading Johnson’s Markup Bodies I became very aware of questions that need to be addressed while coming into contact with an archive. Questions of data stability, of data usage, and, of course, objectivity- something I was aware of previously. Data as Johnson states, ‘… has been central to the architecture of slavery studies and digital humanistic study.” Taking into account that data is not neutral allows me to have a critical take on what I’m looking at, very similar to the last set of readings which involved mapping. Neither can be taken as is. Even taking first hand experiential testimony can have a slanted point of view. Johnson notes, “ … an exploration of the world of the enslaved from their own perspective- served to further obscure the social and political realities of black diasporic life under slavery.” The tradition of data as ‘given’ or neutral has continued for quite a long time. Even up until recently, the concept of humanizing- I think of Drucker here- data is still resisted as Johnson mentions regarding the conference at the College of William and Mary. Her aim is to contextualize the data and to bring it home so to speak. To open up the database and allow it to be used by the community for the community’s well-being whether to remember, to pass on, or to help heal. It is this idea of it being a tool to heal that resonated with me and where I saw the archive being used by people, normal people rather than academics. Real world application to help real world hurts. Rather than just having numbers which can take away a sliver of humanity and obfuscate reality, we have a tool to do the converse of that.

The September 11 Digital Archive also showed healing as a component of its purpose. Brier and Brown showed tremendous insight in their approach towards the archive when they had an epiphany of being archivist-historians, the updated version of a historian- historian 2.0. Working very rapidly to collect as much input from as many sources as possible was an excellent idea to record things which were fresh in people’s minds. Creating webs by reaching out to other communities, whether ones with different languages, different locations, different geographies, was an enlightened approach. Giving ordinary people an input into the archive and allowing them to be represented gives people the opening to view/use the archive to remember, to heal. This is a major reason it’s one of the most visited sites.

Christen and Anderson articulate the view of an archive being a place to ‘decolonize processes’ as a ‘slow archive’ where data is preserved but more importantly where ‘we have to keep pushing forward to save the culture.’ By ‘pushing forward’ we can find solace in the wrongs of the past and, again, heal. A slow archive is, as others have noted, a place to pick up life experiences, oral stories, and in view of the Mukurtu CMS to have real healing whether recognizing tribal traditions or respecting tribal customs or allowing ‘temporal sovereignty.’ A slow archive is a recognition of the relationship between the archive and what the community needs.

This enlightened view of an archive is seen in different places. One such place is Pan Dulce: Breaking Bread with the Past. In creating the archive, Cotera not only created an archive but a template for others to use to tailor the archive to their unique purpose. For her, creating a feminist Chicana archive and the basic template for others to use consisted of 5 principles:

Research grounded in a specific topic. For her own, it was Feminist Chicana History.
Participants are co-creators and retain intellectual rights to their input.
Relies on relationships of reciprocity, vulnerability and researcher reflexivity.
Places lived experience at the center.
Provides a potential space for healing.

Healing, again. So, from the readings, the insight that I come back to over and over again is of use, of use to heal. How powerful is that?! They are not places to only be academic, nor are they places to be dissected and analyzed, but places to learn and to heal, an active place to be pro-active. A place where one can ‘keep pushing forward.”

In essence, an archive must not just record data, but bear witness, and be a place to help heal.

Digital Archive

I really enjoyed this unit and how digital archives can create connections. Recording, sharing, evaluating, and archiving past and present history digitally allows for an accelerated and vast reach. The readings, video and sites from this week demonstrated how important it is to ensure not only that the history be correct and representative of the people it tells the story of and how imperative it is to collaborate with the history makers. I found the TK knowledge site eye opening and after reading about the different labels, it made me really think about how important it is to share the stories digitally with permission and accuracy. Reading the CUNY Distance Learning Archive allowed me to step into the shoes of the students and faculty during the recent pandemic. I was working during that time, and I wish we recorded and digitally archived the experience to have the history but to also allow others to understand the impact of my company and coworkers. I was living in California during the time of Sept 11 and although I knew people that lived in NY and experienced that time, reading the articles and looking at the photos that I never saw gave me pause to what so many people in NYC and other areas went through.

My husband’s family is from the island of Kefalonia in Greece and on a visit many years ago, he recorded conversations he had with his grandmother. He is trying to locate the recordings for me, and I hope to be able to translate and watch the videos and along with the old photographs put together a digital archive of his family’s history. During World War II, in 1941, the island was occupied by the Italian troops which were allied with the Germans. In 1943, Italy capitulated but its troupes refused to leave from Kefalonia. As a punishment, the German forces killed more than 5000 Italian soldiers, a historic fact described in the famous book Captain Corelli’s Mandolin, written by Louis de Bernieres. Then in August of 1953, a huge earthquake destroyed the largest part of Kefalonia and demolished most villages of the island. This past weekend, I spent some time looking through his family photo albums that his grandmother kept and thought it would be fun to share some with you all (photos are over 50 years old). Gotta love the creativity of parents back then!

By FY 2024, NARA will digitize 500 million pages of records and make them available online to the public through the National Archives Catalog.https://www.archives.gov/files/about/plans-reports/strategic-plan/2018/strategic-plan-2018-2022.pdf

Alive in the Archive

Response to “Toward a Slow Archival”

“Towards Slow Archival” by Kimberly Christen and Jane Anderson, which discussed ways to name and address the colonial and harmful approach to digitizing and documenting Native cultures, was a fascinating discussion of the digital intersection with holding Native cultural expression. Given the authors’ attention to cultural preservation, I was surprised that there was no mention of The Native American Graves Protection Act of 1990 (NAGPRA) or The American Indian Religious Freedom Act of 1978 (AIRFA). The former necessitating that museums and other institutions repatriate to tribes the items stolen from their burial sites, including bones and sacred burial items and the latter protecting Native American religious and cultural practices as defined by practitioners (Yes, it is shocking that these were just passed in 1978 and 1990). Both of these acts necessarily modified the relationship between tribes and institutions focused on their documentation through a colonial lens. In many cases, the colonial mindset of the institutions required to repatriate items was undeniably exposed through their response to Native peoples empowered by the new legal landscape these acts created.

For many years, my mother was the Director of The Department of Tribal Preservation and tribal historian of my own tribe, the Mashpee Wampanoag. As part of her work, empowered by NAGPRA, she worked directly with museums, archeologists, and universities, including Harvard University and the Peabody Essex Museum, to repatriate artifacts, including the bones and burial items of our ancestors. Some of these institutions created struggles that took years. They simply would not give the items back, using every technique they could to stall or thwart the repatriation process, preferring that items remain in storage, tagged and hidden away. Most upsetting for them, I believe, was our practice of reinterment—returning the items to the Earth where our ancestors placed them. Much like Fewkes’s insistence on cataloging what he believed to be disappearing cultural practices, these museums felt that somehow simply possessing the bones and items of our ancestors, equated a kind of possession or acquisition of knowledge that provided them some true insight to our culture. The most fascinating irony of this situation is that the very stealing of the items and cutting off or gatekeeping access to them represented the exact negation of the knowledge they sought. To Christen and Anderson’s point, often times Native peoples understand history to be a living knowledge with which we interact and evolve within. This freezing in time, this pinning of our customs on a board like butterflies negates our cosmology and our expression of what it means to be human. We are living beings carrying and sharing the part of the story we hold, and our culture is a living expression of that.

Non-indigenous digital archives, even while claiming to promos access, often just reinforce the gatekeeping of informative moments in time. Locked behind inscrutable interfaces, in unfamiliar and potentially unwelcoming settings, described through a non-indigenous lens, the true inheritors of this information are cut off from receiving the full message of the captured and cataloged moments. Packaged for the non-indigenous reception, our culture performs but it doesn’t breathe. We, by extension, become artifact not being. The conscious effort to free this knowledge from the constraints of perspectives and tools that don’t serve our customs creates added labor and struggle.

This behavior of freezing and naming our culture discussed in the essay has contributed to so many old and tired stereotypes — specifically images of the Noble Savage both revered and infantalized as naive and childlike—and gone. Most people don’t realize there are still Native peoples in New England and across the Nation outside of reservations. The assumption is that we are gone, and that referencing the original inhabitants is enough to signal some awareness of these long lost disenfranchised peoples. Unfortunately, Federal Recognition is often the substantiating badge of existence, among Native peoples as well. T

This drive to pinpoint a culture in a linear way—to create equations of understanding—in my view—stems from fear of the unknowable. When you strip down the complex and and try to neatly organize what you do not understand into units that can be dressed in familiar terms, you remove the threat of looking at truths beyond your worldview and the smallness of your place in the world at large.

Historical Queer Data (Or, the Lack Thereof)

I began my reading for this week with my post from our last meeting and my presentation from the Text as Data conference fresh in my mind, so naturally I was thinking about Queer data. My own project is currently archiving Queer, naturally occurring writing in a way that has never been done before. Now, in my last post I came to the conclusion that for queer data to be ethical, it must be collected by Queer people (or at the very least, close close close allies who deeply understand Queer Theory).

In Markup Bodies, Johnson writes: “Black digital practice is the revelation that black subjects have themselves taken up science, data, and coding, in other words, have commodified themselves and digitized and mediated their own black freedom dreams, in order to hack their way into systems (whether modernity, science, or the West), thus living where they were “never meant to survive.”’ It is in this exact way I believe Queer data currently exists. Unlike this historical legacy of Black people, Queer people do not have a history of data at all, even data that is incorrect and twisted, really.

Most of modernity has been spent trying to be rid of the idea of Queer people to begin with. Especially in recent years, as far as I understand it, due to a smear campaign in the 1950s and 60s, Queer people have been painted as pedophiles (as in ‘Watch Out for Homosexuals’ and ‘Boys Beware’, both of which are available on YouTube). Naturally, being the scum of the Earth, people would like to rid themselves of all records of pedophiles. That’s something you want to forget. And not to mention the fact that, as recently as the early 00s, articles in Evangelical publications were still asserting this as fact (and I’m sure they continue to do so). Prior to that, we have been subject to eugenics at such young ages that the records simply don’t exist. From the systematic killings of Gay people during the Holocaust, to the AIDs epidemic where our health was ignored to protect straight people from us. And so, there is no historical queer data.

But I pose that we must gather data, for we need to prove that we survive in the world that the cisgender heterosexual society wants us gone from. We must live where we are never meant to survive. For future generations to know that we were here. We have been systematically killed by plagues like the AIDs crisis, and lost many of our community elders. A lot of Queer stories are strictly oral stories, because often our lives are at stake if we dare put these tales to paper. So now, in our more free world (as much as it can be at present), we must take advantage of this and archive, archive, archive.

PS: I will henceforth be capitalizing the word Queer when in reference to the community, in an effort to divorce ourselves from the notion of queerness and establish ourselves as a distinct group. I highly encourage everyone else to do the same.

Introduction to Digital Humanities Fall 2022

DHUM 70000 CUNY Graduate Center