ABC Uses Machine Learning To Improve Redesigned Search Results – Software
The Australian Broadcasting Corporation uses machine learning to extract metadata from text, podcasts and other forms of media, making it easier to find through a new search engine.
Machine learning engineer Gareth Seneque told YOW! Data 2019 conference that ABC rolled out of beta in February this year with a new tech-based search engine from US startup Algolia (which also searches for Twitch and Stripe).
The research area still sports beta labeling, but it is fully utilized in production.
“There are reasons for [the url] behind the scenes – stuff involving CMS migrations and the like that I won’t shy away from – but we’re really in the scaling and releasing phase of things, ”Seneque said.
ABC’s existing search indexes approximately 600,000 bits of content from the past decade, including 230,000 articles, 270,000 audio tracks and 85,000 videos.
But Seneque said user reviews on the search were poor.
“Specifically, the content types were not supported, indexing speeds were slow because items would take a while to appear in the index and the relevance of the results was poor,” he said. he declares.
“The public also found it difficult to find content for accessibility reasons. “
The new Algolia-based research is expected to roll out across all of ABC’s digital properties, including iView and the Listener app.
But before that happens, the Seneque team is working to improve the metadata recorded against individual pieces of content – especially podcasts and videos – to make them more searchable.
“Our challenges are twofold: how do we get people to use our search engine? And once they use it, how do we deliver the most relevant content? Said Seneca.
Seneque noted the challenge every digital real estate developer faces with search – making it work as well as Google.
“People use Google every day,” he said.
“It sets the benchmark for what a search engine can do and conditions people’s expectations accordingly.
“We have a public feedback form and see this stuff all the time. We get comments like “Why doesn’t the search answer my question” or “Why don’t you provide results that include presenter biographies”.
The second challenge is delivering relevant results to searchers – and, in most cases, that means improving the metadata recorded for each piece of content.
Seneque said the obvious way to improve metadata is to automate keyword and tag selection as much as possible, instead of relying on reporters to do it manually.
“We have a number of different content teams, each with their own standards for generating metadata, all doing this over many years, which means people and processes change,” he said.
“These people are busy creating content, so it makes sense to use automated systems to help them.
“Our team is adjacent to the content development pipeline. We retrieve content after it’s created and published, and turn it into a searchable state.
“Clearly, if we are to deliver meaningful results, we need metadata consistency and coverage on as many attributes as possible for all of the objects in our index.
“If, along the way, we can create a system that says the CMS can log in to suggest metadata for teams to include if they want, why not? “
But ABC faced a separate problem with content like podcasts.
These were rarely transcribed or converted to text, and therefore “had very little metadata” associated with the files, “but were a key type of content for our audience,” Seneque said.
The answer was to use some form of text-to-speech to transcribe podcast content.
This would provide keywords to make them searchable. Transcriptions could also be a useful accompaniment to audio files, provided they are precise enough.
“The answer – but maybe not the solution – was obvious: get machines to do it cheaply,” Seneque said.
“I insist on the cheap, because as I’m sure many of you have experienced trying to implement completely new things in an organization, it’s hard to get big budgets to test unproven ideas. “
This led Seneque to initially experiment with Mozilla’s open source implementation of Baidu’s Deep Speech system, touted as a precise and non-proprietary text-to-speech tool.
“It’s available for free on GitHub with pre-trained models, so that seemed like a reasonable starting point for our experiments,” he said.
One of the drawbacks of the model, however, was the “massive amount of memory” required for operations, meaning that the experiments were performed on short audio clips.
The pre-trained model also struggled with portions of a 30-second clip from Radio National’s Rear Vision podcast.
“Expect to generate words for dates, but producing ‘guly’ instead of ‘probably’ and ‘hearth’ instead of ‘Earth’, as well as scrambling the words together, clearly indicates problems or concerns. more fundamental challenges that would require training our own models, ”Seneque said.
“It would be on top of a data pipeline that slices up and maintains podcast mappings that last up to 90 minutes and longer.
“So the engineering required to build such a system is obviously not trivial, and we had to prove our idea without that kind of investment in resources. “
Seneque changed course and switched to machine learning as a service through AWS Transcribe.
“Long story short, we have found the service to be very efficient at generating sufficiently good transcripts at a relatively low price,” he said.
“By that, I mean, good enough for metadata, but not good enough to necessarily present to our audience – something you might consider an obvious next step.
“There are specific requirements [for audience presentation] around accessibility – things like identifying sounds in audio content, which Transcribe is not yet capable of, although I have full confidence in Amazon to find out in due course. “
Seneque found AWS Transcribe capable of handling domain-specific names and tools with a higher degree of accuracy.
“We still get dates as words, but overall the transcription is much better. The names of the individuals are close, a couple being correct and a couple not, ”Seneque said.
“We can also see that the name of the program is incorrectly transcribed, but AWS Transcribe offers features to improve these types of issues, including custom vocabularies and speaker identification.
“We will explore the use of these in the future. “
Encouraging first results
Seneque’s small team has now built an automated metadata platform from AWS components.
This is used to extract metadata from the content and feed it into the new search engine to improve search results, and the early results are promising. It also includes a serverless process that fetches podcasts, sends them for transcription, retrieves the results, and pushes them into the search index.
“For the top articles on the news site, we’re seeing an average 280% increase in the number of keywords associated with the relevant search object,” he said, while noting that some some auto-applied keywords were less useful than others. .
“We’re also seeing a 22% increase in the number of results returned for popular search terms involving audio content.
“This is of course a pretty fuzzy metric. But when research is integrated into our listening app, where 91% of our podcast audience is located, we’ll have more hard data.