Skip to content

Tag: metadata

Primary Menu

  • About
    • Why Study Public History at UMass Boston?
  • Program
    • Program Requirements
  • Our Work
    • Our Projects
    • Internships
    • Capstones & Theses
    • From The Field
    • Testimonials
  • Resources
    • Employment Resources

Breadcrumbs

  • Home
  • metadata
Internship: Archives, Virtual Reality, and the Preservation of Memory
Posted on November 9, 2021Author Heidi Gengenbach No Comments on Internship: Archives, Virtual Reality, and the Preservation of Memory

Internship: Archives, Virtual Reality, and the Preservation of Memory

By Meghan Arends

Technological advancements present new opportunities within education, both in classrooms and in public settings. The world of Virtual Reality introduces a unique method of interaction with history, something that has inspired Greenhouse Studios at the University of Connecticut. One of their most recent projects, temporarily titled “Courtroom 600,” is an educational experience that combines the video game technology and archival sources to give players a firsthand look at the Holocaust and the Nuremberg Trials. By using actual evidence and testimony documented from the case, Greenhouse Studios aims to present the historical truth to students, museumgoers, and gamers in an effort to keep collective memory of the Holocaust alive at a time most paramount.

I began working with the Courtroom 600 team as a graduate archives intern this past summer. Since the internship was remote, my work focused on digital archives rather than within a traditional repository. I previously had very little experience in the field of digital archives, but with the help of my supervisor, I became well-versed in metadata, Optical Character Recognition, or OCR, and HOCR. This is because the game uses documents from the Thomas J. Dodd Papers, held at the UConn Archives & Special Collections, to narrate the story of the Holocaust. Players will discover these documents as clues in the game, building the story as they go. These documents will need to be collected and interpreted for the sake of gameplay, and the program will link players to each document’s original listing on the digital repository. Therefore, I was editing useable and searchable data to prepare these documents for use within Courtroom 600.

A comparison between the OCR generated by Tesseract and the corrected product. German names were often difficult for Tesseract to recognize, and inconsistencies in the typewriter’s printing created numerous errors. This image comes from a presentation I gave at the CTDA Annual Meeting in June.

OCR is essentially a transcript of what is on a document. The University of Connecticut Archives & Special Collections uses a program called Tesseract, which reads each page of an archived document and generates a read-out of the text. This transcript is then visible to users of the digital repository within the book viewer of the document. Meanwhile, HOCR is an HTML version of OCR used to make a document searchable. HOCR doesn’t just read what’s on the document; it maps out the content, generating bounding boxes around each word to highlight searches. As we learned over time, OCR and HOCR both generate from the original object but do so separately. We were unable to find a way to intertwine the editing processes, so I had to use an XML editor to display and edit the HOCR. The issue with automatically generating OCR and HOCR is that they’re not always accurate, especially when originating from documents created on older technology like the typewriters used during the trials. Errors typical to typewriters (i.e., letters typed over letters to correct typos and mechanical issues) create errors in the transcripts.

Unlike computers, we can recognize the actions and errors of other humans. Approximately half of my time during this internship was spent correcting the OCR. This was done within Islandora, the program used to build the digital repository, and was relatively simple. It was tedious at times, especially when an error was continuously made during the original generation, but that time was mostly spent correcting typos and focusing on consistency for line spacing and the characters used. This was a learning process, as we were still just experimenting within the program. The other half of my time was then spent updating the HOCR, though not just the typos. The bounding box coordinates had to be checked and adjusted to create an accurate map of the page. I also helped update descriptive metadata for individual pages, once again aiding their searchability, but the main purpose of this internship was to help develop and streamline the editing process so that it could be passed on to future interns and assistants with near flawless execution

An example of HOCR being edited in Notepad++, an XML editor. The numbers proceeding “bbox” correspond to the word’s position on the page, creating a box to highlight each word during a search

OCR and HOCR exist for an important reason: to make archival sources fully accessible to researchers. These documents are supposed to provide an enriching experience to gameplayers, who are meant to interpret the sources to fully understand the gravity of the situation. The HOCR obviously has to be accurate to what is on each page to make it searchable. If it isn’t edited and is instead left with numerous errors that come from computer generation, the search feature becomes inoperative and pointless. Leaving inaccurate OCR just creates confusion and inconsistency. As I learned, the necessity of accuracy holds greater significance than just providing tools for use. With this internship, I witnessed the practical application of archives and the significance they can provide. One of the primary purposes of archives is to preserve historical truth for future generations to utilize. Archives are essential keepers of public memory and can be vital tools for communities, educators, and activists. The job of the archivist is to preserve and document these truths and to aid researchers in its discovery, or rediscovery. To provide any information to the public that is not truly accurate to the original source is to provide false information.

There is, unfortunately, a deeply rooted history of denial when it comes to the Holocaust. Despite best efforts to punish those who committed such atrocities, anti-Semitic sentiments always linger. When I ponder the importance of Courtroom 600 and these documents, I think of swastikas spray-painted on walls, anti-Semitic conspiracy theories, and the tragic vandalization of monuments meant for mourning. These documents are actual evidence used during the trials. They are representations of the truth that historians and archivists are continuously working to keep within the light to work against misinformation and hate. Editing metadata and transcriptions is necessary; to provide anything less than full accuracy is a misrepresentation of the truth. Even if it’s out of pure ignorance, it is alteration of the facts. My job was certainly to make research easier for users of the repository and to prepare a polished product for use in the game, but my job was to also ensure that, no matter what public users look at or what feature they use, the truth of what occurred during that period would be visible. I learned during this internship that within archives, there’s a responsibility to the moral significance that comes with accessioning and processing historical artifacts. Each step of the archival process must consider what each artifact represents to specific groups. Doing so only helps to serve the communities we help preserve.

Upper left: an example of the documents I worked with. Upper right: an example of what OCR looks like from an editor’s point of view. Lower left: Search results using HOCR mapping. Lower right: Text transcript of document generated by OCR. Images courtesy of the Connecticut Digital Archive.

Tagged Archives internship, Archives students, Digital Archives, featured, HOCR, Internship, metadata, OCR, virtual reality
Internship: “Adventures in Metadata”
Posted on September 2, 2020Author Heidi Gengenbach No Comments on Internship: “Adventures in Metadata”

Internship: “Adventures in Metadata”

By Christopher Brown, Archives Track

*Note: Effective September 1, 2020, WGBH is known as “GBH” and the Media Library and Archives (MLA) as the “GBH Archives”. The current terminology is used in this blog, though the internship occurred while prior names were in use.*

Located in Boston, GBH is one of the largest public broadcasting stations in the country, offering a variety of TV and radio programs aimed at fostering education, culture, and a diversity of viewpoints. As PBS’s flagship station, GBH produces a substantial amount of all national content, including programs such as Antiques Roadshow, Nova, Frontline, and American Experience. As someone who is passionate about history, culture, and media, I deeply respect and believe in GBH’s mission. The network is my ultimate career goal, offering an opportunity to merge my BA in Film with my graduate studies in History. As such, I am grateful to have interned at GBH last summer as a volunteer, and this summer in a formal capacity.

Last year, my work centered around promotion for the AAPB (American Archive of Public Broadcasting), a repository of video and audio from around the country, including material from over 100 PBS affiliates. This summer, I worked at the GBH Archives, the official repository of internally generated content. The GBH Archive’s focus on preservation and access makes materials available for research, education, and production use. From audio/visual content such as photographs, footage, and full episodes, to paper records such as press kits and production documents, the archive contains a rich collection of the network’s history and programming.

Interning during Covid presented unanticipated challenges. The office was tentatively scheduled to reopen by summer but unfortunately, this was not the case. As such, the lack of on-site access to systems and materials was a hindrance. It is to my manager’s credit that she came up with a work plan on the fly which provided a meaningful and enjoyable internship experience.

My duties centered around the classic GBH series, Masterpiece (originally known as Masterpiece Theatre.) First aired in 1971, the program offers sophisticated and acclaimed dramas, including period pieces and adaptations of classic literature. More specifically, my work was a deep dive into archival metadata.

GBH recently received an NEH Challenge Grant to support reformatting of its most at-risk programs and development of infrastructure to support long-term digital preservation and access to the archive. The grant was supported by a matching donation from a viewer and a fan of Masterpiece. The donor intended this generous sum for digitization of the program’s first 20 seasons (1971 – 1992), specifically those hosted by the estimable Alistair Cooke. This process will result in program metadata records which are searchable, with digitized program clips presented on the GBH Archive’s “Open Vault”, including the introduction and conclusion monologues delivered by Mr. Cooke for each episode. Open Vault is an online platform where archival content can be accessed, viewed, and searched.

The Linda and Andrew Egendorf Masterpiece Theatre Alistair Cooke Collection on GBH’s Open Vault website. Credit: Courtesy of GBH Archives.

Unfortunately, the metadata pertaining to Masterpiece assets was both voluminous and messy, having been entered over many years, utilizing different standards at different times, and input by various parties such as prior interns. The data needed substantial vetting and editing to accompany this important project, coinciding with the series’ 50th anniversary in 2021. My work would establish reliable and robust metadata for these newly digitized programs, both to accompany clips on Open Vault and for internal reference.

Without access to the GBH Archive’s internal systems due to Covid, metadata was uploaded into a spreadsheet on Google Drive to be edited and then fed back into the database. This spreadsheet was my primary workspace. Encompassing approximately 850 line items, each corresponding to a miniseries or episode record, the data included fields such as air dates, display titles, episode descriptions, asset types, and internal reference numbers. In all, there were 6,000+ lines of data to be reviewed, edited, and in many cases, populated from scratch.

To validate the accuracy of existing data, sources of various types were used. To start, my manager provided a book published by GBH on the 20th anniversary of the series, listing information for each season such as air dates, cast, and in some instances, episode titles. This proved to be a valuable research tool but it presented challenges. For example, only a span of air dates was provided for each miniseries while I needed to verify exact dates for all 850+ episodes. Another challenge was missing or inconsistent episode titles. External sources such as Internet Movie Database were helpful but often created more confusion due to conflicting information, such as BBC air dates instead of those from PBS. Conversely, in some instances it was determined the book was incorrect. Research skills and critical thinking were crucial during this process.

Page example from Masterpiece Theatre 20th Anniversary book. Credit: Courtesy of GBH Archives.

Though much of the data was cleaned up using these sources, numerous unresolved items remained. At this point, we turned to internal documents. Had the office been open as initially planned, these primary sources would have been utilized earlier in the process. Due to Covid, they became a last resort. Thankfully, some of these documents had been digitized and were shared in Google Drive, while others were paper files obtained from the office which my manager boxed and I retrieved from the lobby. These documents offered a fascinating look into each production, such as the original Alistair Cooke scripts, production notes, press kits, and photographs. Though much of the material was not relevant to my work, certain key documents helped resolve most of the remaining discrepancies. For example, several miniseries’ had two episodes aired on the same night which was not reflected in the book nor on most websites. These primary sources helped to reliably vet the metadata and resolve these issues.

    Internal documents used for research. Credit: Courtesy of GBH Archives.

    As the work progressed line by line, data was steadily vetted, corrected, and restored. Several programs were missing from the spreadsheet altogether and these were fully populated. Chronological order of episodes was properly established, with air dates and season numbers reliably entered. Asset types (miniseries vs. episode records) were correctly labeled and internal coding numbers applied to each. One particular challenge involved descriptions which were needed for each miniseries and episode record. Most of these were populated but many had minor typos such as misspellings or grammatical errors. Others were missing or had been merely copied from the miniseries level to each episode. I read each of the 800+ existing descriptions, word by word, to make corrections, then populated those which were missing. Some of these came from internal sources, such as press kits, while others were obtained externally from sites like Internet Movie Database. However, though the latter had been a prior practice, it was determined that potential copyright issues rendered it risky and only internal materials should be used. Our procedure was shifted to reflect this.

    Though the work may sound tedious, I found it both interesting and a good fit for my detail-oriented and organizational mind (attributes which led me to consider the archives profession to begin with.) It also offered opportunities for analytical thought as I worked with my manager to dismantle and improve old naming conventions, program number formats, and asset hierarchies. For example, as many programs were licensed from the BBC, their usage of basic terms like “Series” and “Season” had differing meanings and were inconsistently applied over time. The new, official hierarchy we proposed involved multiple layers of asset records and terminology, used for organization and official naming of seasons, series, episodes, parts, etc. We also created a more coherent convention for program numbers which eliminated the potential for duplicates, as was previously the case. These changes were discussed with management from the Masterpiece side of the house and eventually integrated as official practice.

    In all, I found my second summer at GBH to be as enjoyable and satisfying as the first, the only downsides being Covid restrictions and the lack of personal connections due to remote work arrangements. This was remedied somewhat by online staff meetings and the fact that I had met most coworkers last summer, having kept in touch with several of them. In contrast to my day job, it was satisfying to simply be working on subject matter that fits my interests and passions, reminding me how deeply I wish to work at GBH someday. The internship offered an opportunity to apply the archival skills and knowledge I’ve learned into practical use. The matter of external descriptions even provided a brief foray into Copyright concerns, a subject I studied independently last semester. I look forward to seeing the results of my work as the project comes to fruition on Open Vault and I wish my colleagues in the GBH Archives the best in their future endeavors.


    Tagged Archives internship, Archives students, featured, GBH, Internship, metadata, WGBH

    Subsidiary Sidebar

    Recent Posts

    • Creating inclusive collections descriptions with Historic New England
    • Masterplan for Dorchester’s Harbor Walk
    • Reimagining Faneuil Hall’s Great Hall
    • Getting Started in Archives: An Interview with Jennifer Pelose
    • Internship: Understanding the art of unfamiliar cultures: Repairing sensitive collection descriptions

    Recent Comments

    • dragon ok on Women & Witchcraft in Colonial Dorchester: The Tragic & Mysterious Story of Alice Lake
    • Thomas G. Schafer on Women & Witchcraft in Colonial Dorchester: The Tragic & Mysterious Story of Alice Lake
    • Wally Ramirez on The Many Tasks of an Intern at a Small Historic Site
    • Jemis on Employment Resources
    • Jane S Becker on Ancient Archaeology meets Public History: Project Tlalocan, the Underworld Beneath the Feathered Serpent Temple

    Categories

    • Alumni profile
    • Archival Research
    • Archives Program
    • Archives students
    • Biography
    • Black History Month
    • Digital Archives
    • Digital History
    • Exhibition
    • Faculty research
    • From the Field
    • Internship
    • Our Projects
    • Professional Development
    • Professor profile
    • Public History Student
    • Public History studies
    • Student profile
    • Student research
    • Thesis/Capstone
    • Uncategorized
    • Volunteer
    • Women's History Month

    Search

    Proudly powered by WordPress · Theme Toivo Lite by Foxland
    Skip to toolbar
    • Log In