Skip to main content

Throughout the ten plus years of our crowdsourcing project, Making History: Transcribe, we have had many conversations with our volunteers about emerging technologies and whether or not they could be useful in making our digitized records more accessible. Outside of the tech tools embedded in the transcription software FromthePage, which are always evolving and advancing, we sometimes use computer applications such as OpenRefine to ensure that we present the best possible data and digital humanities programs to help us explore the resulting information gleaned from full text transcriptions and indices.

Recently, the focus on AI has brought renewed interest in how machine learning might be used in transcription of handwritten historical documents as well as other library and archives related work. We have used Optical Character Recognition (OCR) for many years. It is because of OCR that we are easily able to search scanned items that have typewritten characters, such as the newspapers on Virignia Chronicle. But as anyone who has joined us for a Transcribe-a-thon knows, even OCR is not foolproof. Despite typewritten/printed words being more uniform, there are still instances where the variations can confuse OCR.

That being the case, imagine how much more work is needed to create an HCR, or Handwriting Character Recognition program. The variety in handwriting is seemingly endless, with people sometimes writing the same character differently depending on its place in a word. English handwriting styles have not stayed consistent through the years. Transcribers are often stumped when they encounter the “long S” for the first time because our modern eyes see it as the letter “f” instead. We have heard about the large language models that AI needs in order to predict an accurate natural language response to a prompt. How many more examples of human handwriting might a model need to be able to ensure a reasonable amount of accuracy on any English language historical document?

That doesn’t mean people aren’t trying, but the best results are often achieved only if you have a big enough collection that you can train the model to be an expert on that one specific person’s handwriting. We might get there some day, but for now and for Making History: Transcribe, using people to crowdsource transcription is still the most efficient way.

Using Text to Image Prompts to ``Handwrite Virginia`` Often Yields Interesting Results

Our volunteers have also inquired about using Speech-To-Text software so they can read the document out loud as they decipher the text. We have never said no to this approach, but we find that the majority of the time, it isn’t as helpful as volunteers have envisioned. FromThePage does not allow text-to-speech directly into the text boxes on the website but it is possible to cut and paste from another document. Other obstacles such as navigating the fields in a form-based transcription or not knowing how to pronounce some of the Latin phrases common in many old court documents can also slow down the process.

There are also problems inherent in the software itself and whatever AI model it may be using. Transcription software always need to be trained on large models, and their accuracy depends on the model used. Recently, several news outlets reported on a transcription service used heavily in the medical field that had been trained largely on what was a readily accessible and free collection of audio: YouTube. Users discovered that when the program could not decipher the speech it heard, it would fall back to predicting the most common phrases in its corpus which, given its training, included phrases such as “like and subscribe” and “thank you for watching.”

Just A Few of the the Different World War II Separation Notices We Are Transcribing

We recently had an interesting peek into how such software might deal with historical documents. While reviewing Army World War II Separation Notices, we noticed a few weird occurrences. At first there was some confusion over some of the transcriber’s choices. We ask volunteers to recreate the original text as closely as possible including misspellings and abbreviations, so we were surprised to see abbreviations in the original such as “co.” and “va” written out in-full in the transcription. Even more confusing, the first word of many sentences seemed to be missing in the transcribed version. It wasn’t until we came across a description of the veteran’s civilian occupation that we realized what was going on.

When asked what his job was prior to military service, the secretary had typed the veteran’s response as:

“…worked at Mack’s shoe repair. Periodically 5 years.”

Yet the transcription provided by the volunteer transcriber read,

“…worked at Mack’s shoe repair. 4 or 5 years.”

The software is primed to interpret the audio command “period” to mean that the speaker has come to the end of the sentence and punctuation is needed before beginning a new sentence (presumably the “ically” was lost, just as the first word of every other sentence; the software seemed unable to keep up with the user).

The software also seemed to have issues with the speaker’s pronunciation of certain words. Most noticeably, the word “enemy,” (which occurs quite frequently given the nature of these documents) was consistently transcribed as “anime.” “Corps,” either through mistaken pronunciation or confusion by the AI, was sometimes rendered as “corpse.”

But most interesting were the moments when the machine learning revealed itself through choices that the AI model made when it was confused by information it was given. One veteran’s military job was recorded by the military secretary as:

TELEPHONE ENGINEER (0-17.01): Army Signal Corps, Trinidad Base Command, Trinidad, B.W.I. Designed telephone outside plant and supervised construction to insure [sic] contractors following of specifications. Collected data and prepared mortality records of poles and equipment. Draftsman, Electrical (0-48-.110) Army Signal Corps, Trinidad Base Command. Made detailed drawings and sketches, of working plans or wiring diagrams for the erection, installation and wiring of electrical machinery and equipment.

The recorded transcription read:

TELEPHONE ENGINEER (0-17.01): Army signal corpse, Trinidad Base Command, Trinidad, B.. Design telephone outside plant and supervised construction to ensure contractors following the specifications. Collected data and prepared mortality records of Poles and equipment. Draftsman, electrical (0-48-.110) army signal corps, Trinidad base command. Made detailed drawings and sketches, of working plans or wiring diagrams for the erection, installation and wiring of electrical machinery and equipment.

You may immediately spot some mistakes such as “B.W.I.” becoming “B..” because of the dropping of words/characters after the audio command for “period,” the mispronunciation of “corps” that became “corpse,” and the correct spelling of “ensure” in the transcription while the original actually reads “insure.” But what really stood out to us was the capitalization of “Poles.” The phrase “mortality records of poles” is a very uncommon way of saying that the telephone poles were deteriorating. It appears that the use of “mortality” (a word usually applied to humans), the mispronunciation of “corpse,” and perhaps the added context of “Army” led the AI model to decide that the “poles” in this context were Polish people.  We do not know the program used or what it was trained on, but such a small capitalization change drastically changes the meaning of this paragraph.

So, we aren’t quite ready to let AI do all the work for us, even in this small area. Even though we review every transcription before we ingest it for use in our digital collections, it’s always good to check your own work, and AI’s.

Jessi Bennett

Digital Collections Specialist

Leave a Reply