In the beginning of September, FirstAgenda’s two speech experts, Morten Højfeldt Rasmussen and Jesper Lisby Højvang, went to Hyberabad, India, to join the world’s largest gathering of speech researchers and specialists at Interspeech 2018.
This highly impactful conference is the place to be if you want to be a leader in speech science – and if you want to know in what direction it’s taking for the future.
Since Interspeech is the world’s largest gathering of speech experts, many of the findings presented in this article are, therefore, complex to understand for the common joe, but don’t worry- we break them down.
If you would like a sneak peek on interesting key takeways from Interspeech 2018, read on. You’ll find out how speech technology might affect you in the near future.
TIP! You can also watch Jesper and Morten talk about the key takeways on video.
P.S. Jump down to the bottom to see how speech recognition can be used to transcribe or select keywords from a meeting.
What is speech technology?
In short, speech technology is analyzing and/or processing speech. It could be extracting interesting information and data from it. One intriguing subdiscipline is speech recognition, which is the process of recognizing what is being said. This is the technology behind voice assistants like Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana and Google Assistant.
Other fascinating information you can extract using speech technology includes personal characteristics, medical conditions, the language spoken, the speaker’s emotional state of mind and much more.
One of the main topics at Interspeech 2018 was speaker verification and identification. While speech recognition is about extracting and understanding what is being said, speaker identification is identifying the person behind the voice.
Speaker validation is the slightly different problem of making sure a person is who he/she claims to be. This can be compared to fingerprint or face identification, where the person is identified based on their individual appearance. Like a fingerprint, every voice is unique.
Speaker verification has several uses. Imagine unlocking your front door, starting your car or opening your safety box with just your voice. The technology is still young, and there are obviously a lot of security issues in regard to speaker verification. For instance, how do you prevent someone from using a recording of your voice to trick the speaker validator?
Microsoft is one of the players on the market experimenting with identification through voice. Test it for yourself and discover how you can power applications with only your voice.
Another brand-new possibility of speech technology is the possibility of recognizing personal characteristics like gender, age and even mood. There’s quite a lot of use for this. For instance, your company might have a customer support bot to answer phone calls, but if the customer gets upset, you might want a real person to take over.
Defining the mood of the speaker might also be useful in a sales or interview situation to analyze the customers feelings about the product or topic.
Defining gender and age from speech is a shortcut to improving the speech recognizer. Morten explains:
“If you can teach the speech recognizer to know the gender or age, then it can choose the best speech recognizing technique that fits this specific gender or age to record and analyze this voice. That will lead to a higher quality outcome,” Morten says.
To understand and analyze what is being said, the speech recognizer needs to master the spoken language. Thus, it’s quite challenging when the speaker switches between languages or if it is a conversation between two persons speaking each their own language. In linguistic terms, this is called code switching. Dealing with code switching was a prominent topic at Interspeech 2018. Jesper explains:
“India has 20 languages that are spoken by more than 1 million people, and often they switch between several of the languages in just one sentence. A single conversation might include words from three or four languages. Without precautions, the typical speech recognizer gets in real trouble when it doesn’t know which language is spoken. At Interspeech we heard about the possibility of the speech recognizer being able to switch over when the language change within one sentence - that's really interesting," Jesper says.
Morten and Jesper especially paid attention to this, because they see similar challenges in their work at FirstAgenda.
“When using the speech recognition in our meeting tool, at the moment you have to pre-set the language before you start recording. Maybe you don’t have to in the future, because we will be able to identify the language you speak – even if you speak different languages in one meeting situation,” Morten says.
At the risk of losing you in the land of complex technology talk, the next key finding is quite technical, but really interesting. Morten reports with great enthusiasm about Microsoft’s presentation about unmixing speech streams.
At the moment, analyzing speech includes mixing the streams of each speaker and detecting the best soundbites from the mix. Microsoft presented a different way at Interspeech 2018, that Jesper explains:
“Microsoft separates the speech streams, meaning in the output you can actually hear what each speaker says, even though they spoke at once. This is a huge opportunity in regard to meetings where people often speak at the same time, making it difficult to recognize the speech,” says Jesper.
Another fascinating technology we saw at Interspeech 2018 was converting whispered speech to voiced speech. It might be used for medical challenges. People who are only able to whisper because of for instance laryngeal cancer can make their voice more natural-sounding using this technology. This way speech technology can be used to analyze and raise the speech level.
Morten and Jesper are working with speech recognition to improve meetings every day, so they especially paid attention to this topic. Big players on the market struggle to crack the code of using speech technology at meetings, including Microsoft, Google and Amazon.
Meetings are challenging to speech recognizers because the meeting participants often speak at once in ungrammatical sentences. As Morten explains, this is why it makes no sense to transcribe the whole meeting conversation. He elaborates:
“When we’re having a conversation in meetings, we don’t usually speak grammatically correct and we say a lot of unnecessary words and sounds. That’s why only a few sentences would actually make sense if we printed the full transcription from the meeting”, Morten says.
To show you the difference between keywords and a full transcription, here are the two versions. Both are outputs based on the speech from the interview behind this article.
Do you want to try speech recognition on your meetings? Start your free trial now.