Archiving the AWOL Index

Archiving the AWOL Index
Ryan Bauman

The AWOL Index is a new experimental project to extract structured data from AWOL - The Ancient World Online, which has published links to material about the ancient world since 2009.

As a practical experiment, I thought it might be interesting to check which URLs in the index are already in web archives, and try to archive those which are not. To do this, I downloaded the AWOL index JSON, unzipped it, and extracted unique linked URLs with:
find . -name '*.json' -exec grep '"url":' {} \; | \
sed -e 's/^.*"url": "//' -e 's/".*$//' | \
sort -u > urls-clean-uniq.txt
 
This gave me 52,020 unique URLs.

Initially, I thought it would be best to check if the URLs were in any web archive, rather than just one. To do this, I used the mementoweb.org Time Travel API to hit the “Find” service to check URL availability in a wide range of archives. Unfortunately, this proved to be a relatively slow process.

In order to speed things up, I decided to try checking and using just one web archive: the Internet Archive Wayback Machine. Using some hand-picked URLs that showed as “missing” from the truncated mementoweb.org process, I checked the Wayback Machine Availability API to see what sort of results I got.

Interestingly, this lead to the realization that certain URLs which show no availability in the JSON API do show availability in the CDX API. So, I decided to check URL availability using the relatively fast CDX API for the most accurate results:
while read url; do \
  if [ -n "$(curl -s "http://web.archive.org/cdx/search/cdx?url=${url}")" ]; \
    then echo "$url">> cdx-success.txt; \
    else echo "$url">> cdx-failure.txt; \
  fi; \
done < urls-clean-uniq.txt
 
After this process finished running, I had 34,832 URLs showing as already successfully archived (or about 67%). For the remainder, I wanted to submit them to the Wayback Machine for archiving, which I did with:
while read url; do \
  echo "$url"; \
  curl -L -o/dev/null -s "http://web.archive.org/save/$url"; \
done < cdx-failure.txt
 
So, any live, savable URLs which weren’t already in the archive at the time this process was run should be added to it.

After this process finished, I did an initial pass at checking the submitted URLs for presence in the CDX index, and found 10,823 hits for the 17,188 URLs submitted (a 63% success rate). I also noticed that the CDX server can occasionally give false negatives as well (i.e. returns no results for something that’s in the index), so I did another pass against the 6,365 “missing” URLs to try to see if they were actually available, which added only 5 URLs as false negatives from the initial run.

So, after running these processes it seemed the Wayback Machine now had at least one snapshot for 45,660 of our 52,020 URLs (about 88%). Spot-checking the remaining 6,360 URLs showed that some returned no snapshots via either the JSON or CDX APIs but show snapshots in the web interface. This particular example shows in the mementoweb.org API, so I decided to try checking the Wayback Memento API by hitting http://web.archive.org/web/{URI-R}:
while read url; do echo $url; \
  if curl -s --fail -I "http://web.archive.org/web/$url"; \
    then echo $url >> memento-success.txt; \
  fi; \
done < cdx-missing-combined.txt
 
This revealed that 4,606 of our 6,360 “missing” URLs were, in fact, successfully archived (so 50,266 of our 52,020 original URLs, or about 97%, now have at least one snapshot in the Wayback Machine). Looking at the remaining 1,754 missing URLs, we can triage these further and see what currently returns a “live” response code with:
while read url; do \
  if curl -s --fail -L -I "$url" ; \
    then echo "$url">> cdx-missing-live-success.txt; \
    else echo "$url">> cdx-missing-live-failure.txt; \
  fi; \
done < cdx-memento-missing.txt
 
Giving us 431 URLs with no snapshots that currently return an HTTP error (so less than 1% of our total URL count).

I plan on doing one more archive run for the remaining 1,323 missing URLs, just in case some temporary server issues cropped up during the initial run.

Going forward, it might be helpful to automate this process to check and archive new URLs in the AWOL Index on a periodic basis. There are probably much more interesting things that can be done with mining and analyzing the AWOL Index, but the foundation of some of these activities will rely on the simple availability of the linked content.

Originally published on 2015-08-18 by Ryan BaumannFeedback? e-mail / twitter / github

Revision History

This work is licensed under a Creative Commons Attribution 4.0 International License.

Archiving the AWOL Index

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112