5 unexpected insights from automatic speech recognition
Pop Up Archive has been hard at work implementing new speech recognition software for our partners at organizations like NPR, StoryCorps, and the Hoover Institution. The result of this work means better auto-transcripts, and better auto-transcripts mean better access into hours upon hours of spoken content locked in digital audio.
Along the way, we’ve learned some surprising things about the state of automatic speech recognition. Here’s our crash course in the workings of speech-to-text software:
1. Speech-to-text software learns language like people do.
All automatic speech recognition software learns from whatever data it’s given. So, like a person, the more “well-read” your software is in a particular area, the more it will understand.
2. The human standard for perfect transcription is being questioned.
The gold standard for transcripts has always been human transcription. But as machine learning gets better, a human transcriber won’t necessarily transcribe more accurately than a computer for unfamiliar dialects. Speech-to-text software is trained on many voices, so it can interpret dialects from all over the world. Check out this 2011 Google Tech Talk on “Superhuman speech recognition.”
3. Speaking clearly can make you harder to understand.
Since most speech software is trained on naturalistic pronunciations — that is, how you would say a word in a real conversation — speakers that over-articulate may not be properly understood. For example, to clearly pronounce the “t”s in “butter” would go against the Standard American English pronunciation, which is closer to a “d” sound.
4. Not all vocabularies are created equal.
When you create a language model, it’s not just the number of words in the model that contributes to accuracy - it’s how well their distribution matches those of the content.
5. We’ve only scratched the surface.
Speaker recognition. Accurate punctuation. Comprehensive geographical and biographical knowledge….
All of these features are not only possible in automatic speech recognition, but will soon be on their way into your own Pop Up Archive auto-transcripts. As we integrate the new software into Pop Up Archive over the next few months, you’ll see major improvements to our automatic transcription and editing tools. We’ll keep you posted as our new features become available!
Bust out your speakers, sprawl out on a picnic blanket and enjoy these Fourth of July picks from the archive:
1. I’ll have the gospel bird with a side of rabbit fries, please. Finding America through its food. America Eats: A Hidden Archive - The Kitchen Sisters
During the 1930s, the WPA sent dozens of journalists, including Zora Neale Hurston and Eudora Welty, all throughout the country to document how America’s immigrant communities shaped local culinary traditions. Although the program, entitled “America Eats,” was shut down at the outbreak of World War Two, in this piece, The Kitchen Sisters continue the grand legacy of national food reporting.
2. Independence Day on the eve of America’s entry into WWII. FDR’s Fourth Of July Address (1941) - WWII Broadcasts
President Franklin Roosevelt gives a Fourth of July address in Hyde Park, New York just months prior to America’s entry into WWII. Evidently already ramping up for U.S. participation, Roosevelt proclaims that “the fundamentals of [freedom established in] 1776 are being struck down abroad.” This is also the recording in which Roosevelt famously said:
… the United States will never survive as a happy and fertile oasis of liberty surrounded by a cruel desert of dictatorship.
3. Everybody has their own American dream, but some of us have to work a lot harder than others to enjoy their piece of it Coming to America - Snap Judgment
Host Glynn Washington invites you to “put on your sunglasses and open up the fire hydrants for Snap Judgment’s Fourth of July special; amazing stories about people making America their home.” One highlight: a second generation Chinese American growing up in rural Virginia starts receiving threatening letters from the KKK, signed “the Wizard.” After her non-English-speaking mother suggests she write back, she adopts “The Wizard” as her whimsical penpal, hoping to swap stickers and playground stories.
4. How the Declaration of Independence inspired Lincoln’s Gettysburg address. Our Secret Constitution: How Lincoln Redefined American Democracy - Illinois Public Media: Focus 580
In this interview with Focus 580, Columbia Law professor George P. Fletcher claims Lincoln was more inspired by the Declaration of Independence than the Constitution, which he felt only preserved the rights of the propertied white male elite. You know that “four score and seven years ago” line in the Gettysburg Address? It doesn’t date back to the historic document you would expect.
Interview from the Detroit Sound Conservancy’s Greystone collection
Why is it hard to find audio on the web? Audio isn’t text. That means it doesn’t get indexed by search engines.
Don’t worry: Pop Up Archive is taking care of that. We’ve developed a WordPress plugin that lets you quickly add audio and automatic tags straight to blog posts. No more annoying manual tag entry — and no more digging through old file folders buried in your hard drive. You can access your audio and tags right from inside WordPress.
What you see when you add media to a WordPress post:
Why it makes your life easier:
Ready to check it out? Install the plugin and get started today.
Need help setting it up? Don’t hesitate to contact us with support questions.
Final pro-tip: Even if you don’t use WordPress, you can easily embed our audio player into the html of any other site, including tumblr, by simply clicking the “embed” button on any Pop Up Archive item page.
Guest post by Roger Macdonald about The TV News Archive, an inspiring project from the Internet Archive that lets users search and share clips of U.S. TV news programs by repurposing closed captioning text. Via archive.org.
UI / UX Advances in Freeing Information Enslaved by an Ancient Egyptian Model Or… Why Video Scrolling is so Last Millenniums
In creating an open digital research library of television news, we have been challenged by being unable to reference a current user experience model for searching video. Conventional video search requires users to start at the beginning of video and proceed at the pace and sequencing dictated by content creators. Our service has vaulted over the confines of the linear video storytelling framework by helping users jump into content at points directly pertaining to their search. But by doing so, we have left some of our prospective users adrift, without a conceptual template to rely on. That is until this April, with the release of a new user interface.
Treating video as infinitely addressable data is enabling us to do an increasingly better job at getting researchers right to their points of interest. While revolutionary in its application to television news at the scale we are doing it, it does have an antecedent in a prior media revolution — the transition from the age of scrolls to printed books. Gutenberg used movable type to print identical bibles in the mid-1400′s. It took a hundred more years before detailed indexes started appearing at the end of books. The repurposing of closed captioning to facilitate deep search of video is, in some ways, as significant for television as the evolution from parchment and papyrus rolls to page numbered and indexed books.
The value of most major innovations can only be realized when people adapt their conceptual models to understand and use them. Our interface design challenge included helping users make a perceptual leap from a video experience akin to ancient Egyptians unfurling scrolls to that of library-literate modern readers, or the even more recent experience of being able to find specific Web “pages” via search engines.
Our latest interface version helps users cross the cognitive bridge from video “scrolling” through television programs to accessing them instead as digitally indexed “books” with each page comprised of 60-second video segments. We convey this visually by joining the video segments with filmstrip sprocket border graphics. Linear, like film, but also “paginated” for leaping from one search-related segment to another.
When searching inside individual broadcasts, the new interface reinforces that metaphor of content hopping by truncating presentation of interleaving media irrelevant to the search query. We present the search-relevant video segments, while still conveying the relative “distance” between each jump — again referencing the less efficient linear “scroll” experience that most still find more familiar.
The new UI has another revolutionary aspect that also hearkens back to one of the great byproducts of the library index model: serendipitous discovery of adjacent knowledge. Dan Cohen, founding Executive Director of the Digital Public Library of America recently recounted, “I know a professor who was hit on the head by a book falling off a shelf as he reached for a different one; that book ended up being a key part of his future work.”
When using the new “search within” a single program feature, the browser dynamically refines the results with each character typed. As typing proceeds towards the final search term, unexpected 60-second segments and phrases arise, providing serendipitous, yet systematic choices, even while options narrow towards the intended results. These surprising occurrences suggest the diverse opportunities for inquiry afforded by the unique research library and encourage some playful exploration.
The Internet Archive is still in the early stages of helping guide online television out of its imprisonment in ancient conceptual frameworks. A bright future awaits knowledge seekers and content creators alike when digital video is optimized for systematic discovery of even short segments. New audiences and new use-cases will be joined with media that has been languishing for too long in digital tombs, mostly unseen and unheard.
At its heart, the Internet Archive is an invitation to explore and collaborate. Please, join us in evolving digital opportunities to open knowledge for the benefit of all.
Start by giving our service a whirl, find something important and quote it. I just did - https://twitter.com/r_macdonald/status/463492832867516416
We are thrilled to announce the incredible slate of partners working with us to build custom speech-to-text software for news organizations, historical audio collections, and religious institutions: CUNY Television, the Hoover Institution at Stanford University, Illinois Public Media, KCRW Los Angeles, KQED San Francisco, NPR, the Presbyterian Historical Society, the Princeton Theological Seminary, Snap Judgment, and StoryCorps.
Curious where this idea came from? We started with some big problems:
Transcription is time-consuming. There are no human hands fast enough to transcribe the amount of recorded sound we process. And even if there were…
Transcription is expensive. Enough said.
Out of the box automatic transcription services are inaccurate. “India” becomes “ninja,” “quitter” becomes “Twitter,” and a meaningful broadcast or oral history can end up reading like tech gibberish.
Over the next two months, we’re creating unique speech-to-text vocabularies tailored specifically to our partners’ content: contemporary news broadcasts, oral histories, archival recordings, religious lecture series and sermons. We’ll be blogging about our partners, their amazing audio, and the speech-to-text customization process as it unfolds, so check back for updates.
The vocabularies are built directly from words and phrases found in our partners’ content. It won’t be 100% accurate, but these special vocabularies enable our speech-to-text software to effectively gauge the likeliness that sounds in certain contexts correspond to particular words or phrases — so that, for example, when someone recording an oral history for StoryCorps says “quitter,” it doesn’t get transcribed “Twitter.” Unless of course the person actually said “Twitter,” which our software can accurately guess by looking at the placement of the word and other nearby words within a sentence.
Our software was initially trained on a subset of audio and transcripts from NPR, StoryCorps, the Washington Post, the Broadcast Board of Governors, and numerous independent producers, reporters, and radio stations. If you want to learn more about advancements in speech-to-text, watch this short Google Tech Talk.
We can’t wait to bring cutting edge speech recognition methods to organizations that would otherwise never benefit from this technology. Want to be a part of the custom speech-to-text magic? Just let us know. We’ll be onboarding more organizations in the coming weeks, and yours could be one of them.
Advised by the British Broadcasting Corp. R&D team and partnered with the Public Radio Exchange, Pop Up Archive is supported by the Knight Foundation, the National Endowment for the Humanities, and 500 Startups.