I attended the Introduction to NLTK (Text Analysis) with Python workshop this last Friday. The workshop was overall very useful, and helps you get started with the right suite of tools to start conducting some basic exploratory analysis on text data.
The workshop asked users to download Jupyter Notebook. I already had this installed as part of Anaconda Navigator, and had used it in a previous class. I find Jupiter notebook to be a great user-friendly tool to work and learn Python with!
We imported the NLTK Library and the matplotlib for data visualization in the notebook. Using the capabilities of these libraries and magic functions, the instructor showed us how to do some basic plotting and text analyses, such as calculating lexical density, frequency distributions of certain words, and dispersion plots.
Some of these commands are case sensitive, so the instructor taught us how to make all words lowercase to allow for proper counts. We went through the process of cleaning the data, with lemmitization and stemming, as well as removing stop words.
The steps we used are described in the code below. For this blog post, I decided to re-do the workshop assignment with looking at the The Book of Genesis instead of Moby Dick, which we used in the workshop.
import nltk
In [2]:
#for dispersion plot, #to tell Jupyter notebook to display the graph in the notebook import matplotlib %matplotlib inline
In [3]:
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Out[3]:
True
In [4]:
from nltk.book import *
*** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908
In [5]:
type(text3)
Out[5]:
nltk.text.Text
In [6]:
text3.concordance("whale")
no matches
In [7]:
text3.similar("life")
house name brother cattle blood son fowls money thigh god heaven it good land fruit fowl beast lord sight knowledge
In [8]:
text3.similar("love")
went drank earth darkness morning se them give nig hath man had thus not took keep die call sle woman
In [9]:
text3.similar("queer")
No matches
In [10]:
text3.similar("death")
face place cattle image host generations sight father mother wife eyes voice presence head children hand way brother sheep flock
In [11]:
text3.dispersion_plot(["life","love","loss","death"]) #the following graph shows where the word appears based on the number of words or pages in. e.g. between 0 and 250000
In [12]:
text3.count("Love")
Out[12]:
0
In [13]:
text3.count("love")
Out[13]:
4
In [14]:
text3_lower = [] for t in text3: if t.isalpha(): t = t.lower() text3_lower.append(t)
In [15]:
from nltk.corpus import stopwords
In [16]:
stops = stopwords.words('english')
In [17]:
print(stops)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
In [18]:
text3_stops = [] for t in text3_lower: if t not in stops: text3_stops.append(t)
In [19]:
print(text3_stops[:30])
['beginning', 'god', 'created', 'heaven', 'earth', 'earth', 'without', 'form', 'void', 'darkness', 'upon', 'face', 'deep', 'spirit', 'god', 'moved', 'upon', 'face', 'waters', 'god', 'said', 'let', 'light', 'light', 'god', 'saw', 'light', 'good', 'god', 'divided']
In [20]:
len(set(text3_stops)) #to see how many unique stopwords there are use set()
Out[20]:
2495
In [21]:
# lemmatization to take the root word from nltk.stem import WordNetLemmatizer
In [22]:
wordnet_lemmatizer = WordNetLemmatizer()
In [23]:
wordnet_lemmatizer.lemmatize("waters")
Out[23]:
'water'
In [24]:
# create clean lemmatized list of text 3 text3_clean = [] for t in text3_stops: t_lem = wordnet_lemmatizer.lemmatize(t) text3_clean.append(t_lem)
In [25]:
print(len(text3_clean))
18335
In [27]:
#lexical density len(set(text3_clean)) / len(text3_clean)
Out[27]:
0.12838832833378783
In [28]:
#sorting the first 30 unique 'cleaned' words of the text sorted(set(text3_clean))[:30]
Out[28]:
['abated', 'abel', 'abelmizraim', 'abidah', 'abide', 'abimael', 'abimelech', 'able', 'abode', 'abomination', 'abr', 'abrah', 'abraham', 'abram', 'abroad', 'absent', 'abundantly', 'accad', 'accept', 'accepted', 'according', 'achbor', 'acknowledged', 'activity', 'adah', 'adam', 'adbeel', 'add', 'adder', 'admah']
In [30]:
#Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. from nltk.stem import PorterStemmer porter_stemmer = PorterStemmer()
In [31]:
print(porter_stemmer.stem('accept')) print(porter_stemmer.stem('accepted'))
accept accept
In [32]:
t3_porter = [] for t in text3_clean: t_stemmed = porter_stemmer.stem(t) t3_porter.append(t_stemmed)
In [33]:
print(len(set(t3_porter))) print(sorted(set(t3_porter))[:30])
2113 ['abat', 'abel', 'abelmizraim', 'abid', 'abidah', 'abimael', 'abimelech', 'abl', 'abod', 'abomin', 'abr', 'abrah', 'abraham', 'abram', 'abroad', 'absent', 'abundantli', 'accad', 'accept', 'accord', 'achbor', 'acknowledg', 'activ', 'adah', 'adam', 'adbeel', 'add', 'adder', 'admah', 'adullamit']
In [34]:
my_dist = FreqDist(text3_clean)
In [35]:
type(my_dist)
Out[35]:
nltk.probability.FreqDist
In [36]:
my_dist.plot(20)
Out[36]:
<AxesSubplot:xlabel='Samples', ylabel='Counts'>
In [37]:
my_dist.most_common(20)
Out[37]:
[('unto', 598), ('said', 477), ('son', 294), ('thou', 284), ('thy', 278), ('shall', 259), ('thee', 257), ('god', 236), ('lord', 208), ('father', 207), ('land', 189), ('jacob', 179), ('came', 177), ('brother', 171), ('joseph', 157), ('upon', 140), ('day', 133), ('abraham', 129), ('wife', 126), ('behold', 118)]
In [ ]: