By Meghan Arends

Technological advancements present new opportunities within education, both in classrooms and in public settings. The world of Virtual Reality introduces a unique method of interaction with history, something that has inspired Greenhouse Studios at the University of Connecticut. One of their most recent projects, temporarily titled “Courtroom 600,” is an educational experience that combines the video game technology and archival sources to give players a firsthand look at the Holocaust and the Nuremberg Trials. By using actual evidence and testimony documented from the case, Greenhouse Studios aims to present the historical truth to students, museumgoers, and gamers in an effort to keep collective memory of the Holocaust alive at a time most paramount.

I began working with the Courtroom 600 team as a graduate archives intern this past summer. Since the internship was remote, my work focused on digital archives rather than within a traditional repository. I previously had very little experience in the field of digital archives, but with the help of my supervisor, I became well-versed in metadata, Optical Character Recognition, or OCR, and HOCR. This is because the game uses documents from the Thomas J. Dodd Papers, held at the UConn Archives & Special Collections, to narrate the story of the Holocaust. Players will discover these documents as clues in the game, building the story as they go. These documents will need to be collected and interpreted for the sake of gameplay, and the program will link players to each document’s original listing on the digital repository. Therefore, I was editing useable and searchable data to prepare these documents for use within Courtroom 600.

A comparison between the OCR generated by Tesseract and the corrected product. German names were often difficult for Tesseract to recognize, and inconsistencies in the typewriter’s printing created numerous errors. This image comes from a presentation I gave at the CTDA Annual Meeting in June.

OCR is essentially a transcript of what is on a document. The University of Connecticut Archives & Special Collections uses a program called Tesseract, which reads each page of an archived document and generates a read-out of the text. This transcript is then visible to users of the digital repository within the book viewer of the document. Meanwhile, HOCR is an HTML version of OCR used to make a document searchable. HOCR doesn’t just read what’s on the document; it maps out the content, generating bounding boxes around each word to highlight searches. As we learned over time, OCR and HOCR both generate from the original object but do so separately. We were unable to find a way to intertwine the editing processes, so I had to use an XML editor to display and edit the HOCR. The issue with automatically generating OCR and HOCR is that they’re not always accurate, especially when originating from documents created on older technology like the typewriters used during the trials. Errors typical to typewriters (i.e., letters typed over letters to correct typos and mechanical issues) create errors in the transcripts.

Unlike computers, we can recognize the actions and errors of other humans. Approximately half of my time during this internship was spent correcting the OCR. This was done within Islandora, the program used to build the digital repository, and was relatively simple. It was tedious at times, especially when an error was continuously made during the original generation, but that time was mostly spent correcting typos and focusing on consistency for line spacing and the characters used. This was a learning process, as we were still just experimenting within the program. The other half of my time was then spent updating the HOCR, though not just the typos. The bounding box coordinates had to be checked and adjusted to create an accurate map of the page. I also helped update descriptive metadata for individual pages, once again aiding their searchability, but the main purpose of this internship was to help develop and streamline the editing process so that it could be passed on to future interns and assistants with near flawless execution

An example of HOCR being edited in Notepad++, an XML editor. The numbers proceeding “bbox” correspond to the word’s position on the page, creating a box to highlight each word during a search

OCR and HOCR exist for an important reason: to make archival sources fully accessible to researchers. These documents are supposed to provide an enriching experience to gameplayers, who are meant to interpret the sources to fully understand the gravity of the situation. The HOCR obviously has to be accurate to what is on each page to make it searchable. If it isn’t edited and is instead left with numerous errors that come from computer generation, the search feature becomes inoperative and pointless. Leaving inaccurate OCR just creates confusion and inconsistency. As I learned, the necessity of accuracy holds greater significance than just providing tools for use. With this internship, I witnessed the practical application of archives and the significance they can provide. One of the primary purposes of archives is to preserve historical truth for future generations to utilize. Archives are essential keepers of public memory and can be vital tools for communities, educators, and activists. The job of the archivist is to preserve and document these truths and to aid researchers in its discovery, or rediscovery. To provide any information to the public that is not truly accurate to the original source is to provide false information.

There is, unfortunately, a deeply rooted history of denial when it comes to the Holocaust. Despite best efforts to punish those who committed such atrocities, anti-Semitic sentiments always linger. When I ponder the importance of Courtroom 600 and these documents, I think of swastikas spray-painted on walls, anti-Semitic conspiracy theories, and the tragic vandalization of monuments meant for mourning. These documents are actual evidence used during the trials. They are representations of the truth that historians and archivists are continuously working to keep within the light to work against misinformation and hate. Editing metadata and transcriptions is necessary; to provide anything less than full accuracy is a misrepresentation of the truth. Even if it’s out of pure ignorance, it is alteration of the facts. My job was certainly to make research easier for users of the repository and to prepare a polished product for use in the game, but my job was to also ensure that, no matter what public users look at or what feature they use, the truth of what occurred during that period would be visible. I learned during this internship that within archives, there’s a responsibility to the moral significance that comes with accessioning and processing historical artifacts. Each step of the archival process must consider what each artifact represents to specific groups. Doing so only helps to serve the communities we help preserve.

Upper left: an example of the documents I worked with. Upper right: an example of what OCR looks like from an editor’s point of view. Lower left: Search results using HOCR mapping. Lower right: Text transcript of document generated by OCR. Images courtesy of the Connecticut Digital Archive.