Transcribe Us: the Evolution of a Library Transcription Project

When I started my internship with Digital Programs last fall, I was encouraged by my colleagues to find a long term project to work on, to contribute to my professional portfolio and to explore my interests and how they could incorporate or enhance the resources of the library and college. I took some time to observe the workings of the library and to understand the various departments and working groups, and to explore the services offered by the library. I have a personal and professional interest in accessibility in general, and in exploring accessibility in librarianship, so one of the (many) things on my mind was how accessibility can be incorporated into the infrastructure of libraries, whether it be in the physical or digital environment.

I had been interacting with our digital repository, Amherst College Digital Collections (ACDC), for various projects and discussing the site with my supervisors and colleagues. They have been working on revamping the repository with Islandora, and in doing so, will have opportunities to add new features to increase searchability and usability. Through these discussions with my colleagues, I landed on the idea of enhancing the accessibility of ACDC by figuring out how we could provide transcriptions of the handwritten documents housed in the repository. If we could provide transcriptions of the archival materials in ACDC, they could be accessible to individuals who use screenreaders, and it would be less arduous for any member of the community to engage with these documents. The experience of viewing the original document and observing the tangible aesthetic qualities is unique, and sometimes manually transcribing can be a helpful step in the research process, but providing plaintext transcriptions can expedite this process for any community member who wishes to engage with the content of the documents. Having plaintext available can also eventually increase the searchability of the repository, as we can use encoding languages to make the text itself searchable (currently only the metadata is searchable).

I discussed the logistics of this project with staff members from various departments in the library- Technical Services, Research and Instruction, Archives- to see what obstacles we might face, how transcriptions could be beneficial, and for general input on what strategy I should use. They gave me invaluable advice and guidance, and encouraged me to take steps to ensure that the project would be sustainable. I also researched existing transcription projects at other institutions to get an idea of which software would be a good fit for us. I discovered that there are a wide variety of strategies employed, from the Smithsonian Transcription Center and Library of Congress’s By The People, which have the resources to build custom crowdsourcing platforms, to Worcester Polytechnic Institute, which has a crowdsourced project and provides transcriptions for documents within their repository.

In searching for a software that would suit us, I considered the scope of our repository, price, technical support, and compatibility with Islandora. I tested a software called Tpen, but discovered that it had been abandoned a few years ago and was not maintained. I ended up choosing a software called Transkribus, which is free to use, open source, very well-maintained, has proven successful for a variety of institutions, and has the ability to learn individual handwriting styles in order to automatically transcribe large collections. There is a downloadable program for preparing and transcribing documents, as well as a simplified web interface and elearning tools .

A screenshot of the Transkribus interface with the Tools sidebar open — The Transkribus program interface

This project started as just one intern experimenting and slowly chipping away at a massive project. The items in ACDC represent only a fraction of the items stored within the physical Archives and Special Collections. I surveyed the digital repository and found that some collections contain close to 10,000 pages of material that could be transcribed. I had to make peace with the fact that the goal of this project would not be represented as a finish line, but as a perpetual effort.

Just before the start of winter break, I was able to add a couple of student workers to the project. Transkribus requires a five step process: Import, Layout Analysis (creating an infrastructure within the document so that paragraphs, lines, and words can be distinguished), Transcription, Review, and Export. I had imported a single folder of the Edward and Orra Hitchcock Collection and was working on the Layout Analysis, and had the student workers begin transcribing the pages that were ready. I was very impressed by how quickly the students worked, and how enthusiastic they were, but also felt a bit cynical about what we could accomplish with such a huge undertaking, and such a small team.

This is about where we were in the process when the pandemic hit, and we moved to working from home. With a tremendous amount of support from my colleagues, we were able to quickly expand the capacity of the project and onboard more library staff members, then student workers from Access Services, and a few staff members from various departments on campus. I have been absolutely blown away by the enthusiasm of the folks who have contributed, my own ability to adapt, and the incredible support of the Digital Programs team. We shifted our focus to the Justin Perkins Papers, which was newly digitized and a much smaller and more manageable collection than the Hitchcock Collection. We’ve been able to streamline and document our process so that the project remains organized and cohesive, and we have about half of the collection finished at this point. It has been amazing to see this project become a lifeline during this crisis, as it is so suitable for remote work.

Things have definitely slowed down, as everyone prepares for the re-opening of the library. I plan to take this as an opportunity to make some plans for how the project can move forward, now that we have a better idea of what we are capable of. Our next step is to figure out how these transcriptions can be uploaded to Islandora to be accessed by the community.

I’m so grateful to everyone who has participated in our transcription project over the last few months, and especially grateful to Sarah Walden-McGowan and Este Pope, for all their support and guidance.

Behold, the fruits of our labor:

John O. Mead letter to Justin Perkins, 1863 December 20, Digitized — John O. Mead letter to Justin Perkins, 1863 December 20

An image of a plaintext file exported from Transkribus — John O. Mead letter to Justin Perkins, 1863 December 20, Transcribed and exported from Transkribus

You can also download and view the Accessible Plaintext File !

Hallie Twiss

Digital Programs Graduate Intern