I recently had the opportunity to attend the Internet Archive’s workshop on web archiving in Los Angeles. Firstly, before I get into a rundown of what we learned, I just wanna say it was AWESOME. I met some great people there, including fellow DHers! This was their first workshop, and there will be more forthcoming, so don’t be sad if you missed this one! Please feel free to ask me questions in the comments or in an email if you want more information or clarification on any of these points. I’d recommend reading this post chronologically, as it goes in order from basics to advanced topics. Now onto the good stuff…
What is the Internet Archive?
- Contains primarily, though not exclusively, 20th-21st century records of human interaction across all possible mediums (newspapers to fiction to gov. info to art etc).
- Constant change and capture.
- Every country in the world included.
- Fit for both macro and micro level research questions.
- Fit to archive both hundreds or millions of documents.
- Known for the Wayback Machine, which takes snapshots of websites at different points in time, shows you those snapshots, as well as information about when snapshots are taken.
What is a web archive?
- Web archives are a collection of archived URLs that contains as much original web content as possible while documenting the change over time and attempting to preserve the same experience a user would’ve had of the site on the day it was archived.
Challenges of web archiving
- Trying to hit the balance between access to billions of bits of information and actual usability of that information.
- Content is relational and self-describing.
- Difficult to subset relevant collections, storing and computing all of it.
- So many methods and tools to choose from.
Glossary
- Crawler – software that collects data from the web.
- Seed – a unique item within the archive.
- Seed URL – the starting point/access point for the crawler.
- Document, here meaning any file with a unique URL.
- Scope – how much stuff the crawler will collect.
- WARC—file type for downloaded archived websites.
Examples of steps to archive from a project on COVID response in the Niagara Falls region
- Close reading with Solrwayback – searchable, individual items examinable in the collection.
- Distant reading with Google Colab – sentiment analysis, summary statistics, data visualization.
- Data subsetting with ARCH – full-text dataset extraction from the Internet Archive’s collections.
- As an outcome, helped the City of Niagara Falls formulate a better FAQ for common questions they weren’t answering.
Other methods
- Web scraping – creating a program that takes data from websites directly.
- Topic modeling – assess recurring concepts at scale (understanding word strings together to create a topic).
- Network analysis – computationally assessing URL linking patterns to determine relationships between websites.
- Image visualization – extracting thousands of images and grouping them by features.
Web archiving tools
- Conifer (Rhizome)
- Webrecorder
- DocNow
- Web Curator Tool
- NetArchive suite
- HTTrack
- Wayback Machine – access tool for viewing pages, surf web as it was.
- Archive-It
- WARC – ISO standard for storing web archives.
- Heritrix – web crawler to capture web pages and creates WARC files.
- Brozzler – web crawler plus browser-based recording technology.
- ElasticSearch & SOLR – full-text search indexing & metadata search engine software.
Intro to Web Archiving
- The average web page only lasts ~90-100 days before changing, moving, or disappearing.
- Often used to document subject areas or events; capture and preserve web history as mandated; taking one-time snapshots; and supporting research use.
Particular challenges
- Social media is always changing policies, UI, and content.
- Dynamic content, stuff that changes a lot.
- Databases and forms that requires user interaction, alternatives include sitemaps or direct links to content.
- Password protected and paywalled content.
- Archive-it can only crawl the public web, unless you have your own credentials.
- Some sites, like Facebook, explicitly block crawlers. Instagram blocks them but has workarounds.
How to Use the Internet Archive (It’s SO EASY)
- Browse to web.archive.org/save – enter URL of the site you want to archive, creates an instance. Boom!
- You can also go to: archive-it.org – create a collection (of sites), add seeds (URLs).
- Two types of seeds: with the end / (backslash) and without the end backslash. Without adds all subdomains- eg, if I did my Commons blog noveldrawl.commons.gc.cuny.edu, it’ll give me ALL the commons blogs- everything before the ‘.commons’. If I do noveldrawl.commons.gc.cuny.edu/, it’ll give me just all the stuff on my blog AFTER the slash, like noveldrawl.commons.gc.cuny.edu/coursework.
I archived the website data… now what do I do with it?!… Some Tools to use with your WARC files:
- Palladio – create virtual galleries
- Voyant – explore text links
- RawGraphs – create graphs
ARCH (Archives Research Compute Hub)
- ARCH is not publicly available until Q1 2023; workshop participants are being given beta access and can publish experiment results using it.
- Currently can only use existing Archive-It collections, however after release user-uploaded collections will be supported.
- It uses existing collections in Archive-It, which you do need a membership to use.
- Non-profit owned, and the internet archive is decentralized and not limited to a government or corporate tool.
- Supports computational analysis of collections, eliminating the need for the technical knowledge to analyze sites, and allows for analysis of complex collections on a large scale.
- Integrates with the Internet Archive, and has the same interface as Archive-It.
- Can extract domain frequency (relationship between websites), plain text, text file info, audio files, images, pdfs, ppts. It can also create graphs of these relationships in browser. There’s even more it can do than this, if you need it, it can probably do it. All data is downloadable, which can be previewed before download.
Observations
- I noticed the majority of everyone present had faced some sort of cultural erasure, threatened or realized, modern or archaic, that has brought them to their interest in archiving.
- From experience using these tools, I’d say Wayback is great if you need to just archive one site, perhaps for personal use, whereas Archive-It is great if you have many sites in a particular research area that you’re trying to archive and keep all in one place.
Links of interest
- https://archive-it.org/collections/11913 – Schomburg Center for Research in Black Culture, Black Collecting, and Literary Initiatives (67 GB, 23+ websites since March 2019; contains blog posts, articles, bookstore lists, etc)
- https://archive-it.org/collections/19848 – Ivy Plus Libraries Confederation, LGBTQ+ Communities of the Former Soviet Union & Eastern Europe (30 GB, 70+ websites, since Aug ’22; contains news, events, issues, etc)
Further Resources