Editor’s Note: This article first appeared in Fall 2018 issue of Broadside, the quarterly magazine of the Library of Virginia.
How do you eat an elephant? One bite at a time.
For the past seven years, that’s how we’ve been tackling the task of processing the 1.5 million e-mails transferred to the Library of Virginia in 2010 as part of the electronic records of outgoing Governor Tim Kaine.
When Kaine announced his candidacy for the U.S. Senate in 2011, the Library challenged itself to make the Kaine administration’s e-mail records available for research in time for the 2012 election. What did that entail? Basically, we had to figure out how to separate whatever portion of those 1.5 million e-mails shouldn’t be included in our online collection—either because they aren’t records of enduring value (think e-mails announcing doughnuts in the break room) or because they contain sensitive materials such as attorney-client privileged communications, privacy-protected information, or operational security details.
When we set our sights on 2012, we knew of no good way to get to our goal other than to roll up our sleeves and start reading the e-mails. It did not take long to realize that we had bitten off more than we could chew. Kaine was entering his second year in the Senate before we could announce even a partial victory. In January 2014, we debuted our Kaine E-mail Project @ LVA, which contained 66,422 vetted e-mails. Releases of successive batches of processed e-mails followed in May 2014, September 2014, May 2016, and November 2016. Our manual and laborious review process—done one e-mail at a time—remained the same even as the collection got its 15 minutes of fame with presidential candidate Hillary Clinton’s July 2016 announcement of Tim Kaine as her running mate. We paused to cheer when the fruits of our labor appeared in the New York Times, Washington Post, and Politico as part of the media’s efforts to profile the man running for the number two spot in the government. Then we got back to work, continuing to chip away at the shrinking but still significant pile of unprocessed Kaine e-mail.
And all the while, bite after bite (or byte after byte, since this elephant is electronic), we’ve tried to ignore another elephant in the room—the knowledge that there would be more e-mails transferred from Kaine’s gubernatorial successors. Many more. The administration of Robert McDonnell transferred more than 7 million e-mails in 2014. These were followed by more than 8 million e-mails from the administration of Terry McAuliffe in 2018.
Like many state agencies still feeling the pain of repeated layoffs dating back to 2002, we hope to rebuild our staff in the coming years. Right now, our State Records department has only four archivists. But even with additional staffing, we will not be able to keep up with the exponential growth in digital materials without a new approach to processing electronic records. Fortunately, just such an approach is near at hand.
With the help of two professors in the David R. Cheriton School of Computer Science at the University of Waterloo in Ontario, Canada, the Library has been experimenting with the use of artificial intelligence (AI) to process electronic records more efficiently. The specific technology we have been testing, developed by Gordon V. Cormack and Maura R. Grossman, is known as Continuous Active Learning™ (CAL). A CAL system uses algorithms to make predictions about which documents are most likely to be relevant—in our case, which e-mails are most likely to be archival records that are open to the public. It presents its best guess to a human reviewer—in our case, the Library’s exhausted senior state governors’ records archivist. Based on a yes or no response from the reviewer, the tool improves its understanding of which e-mails the reviewer wants to find. The process continues until the tool no longer finds any e-mails that are likely to be of interest.
Cormack and Grossman liken the CAL process to popping a bag of popcorn. It takes a little while for the bag to warm up, but once it does, the kernels start popping rapidly. For the human reviewer, this means that some of the e-mails first presented by the tool for review are not relevant. But as the tool gets smarter, its predictions get more and more accurate until most of what it offers gets a thumbs-up from the human reviewer. And just like with popping popcorn, once the output switches back to mostly irrelevant items, you can assume that substantially all of the desired archival e-mails have been found and it’s time to pull the bag out of the microwave.
The Library recently released its first batch of e-mails processed using Cormack and Grossman’s CAL system. With artificial intelligence leading the way, a single archivist was able to identify roughly 27,000 archival e-mails by looking at only 45,000 e-mails in the total pile of 175,000 e-mails transferred from Kaine’s Secretary of Education. This is only a first step, and much work remains to develop workflows for implementing AI solutions for the McDonnell and McAuliffe administrations’ elephants still in the room. But we’re hopeful and excited about the possibilities. Because, let’s face it—we’d rather pop popcorn than eat elephants any day.
–SusanGray Eakin Page, Digital Archives Coordinator