close
Written by Ditte Ydegaard
on October 18, 2018
A report on the key takeaways from the world’s largest conference on speech science and technology, Interspeech 2018.

 

In the beginning of September, FirstAgenda’s two speech experts, Morten Højfeldt Rasmussen and Jesper Lisby Højvang, went to Hyberabad, India, to join the world’s largest gathering of speech researchers and specialists at Interspeech 2018.

This highly impactful conference is the place to be if you want to be a leader in speech science – and if you want to know in what direction it’s taking for the future.

Speech recognition


Since Interspeech is the world’s largest gathering of speech experts, many of the findings presented in this article are, therefore, complex to understand for the common joe, but don’t worry- we break them down. 

If you would like a sneak peek on interesting key takeways from Interspeech 2018, read on. You’ll find out how speech technology might affect you in the near future. 

TIP! You can also watch Jesper and Morten talk about the key takeways on video.

 

P.S. Jump down to the bottom to see how speech recognition can be used to transcribe or select keywords from a meeting.

 

Key takeaways

1: Verifying and identifying speakers is a hot research topic

2: It is possible to define gender, age, and mood from speech

3: Speech technology can keep up with language switching within one sentence

4: Separating overlapping speech is very promising

5: You can analyze and change the level of speech. For instance, no more whispering

6: Analyzing meeting recordings are one of the biggest challenges

 

What is speech technology? 

In short, speech technology is analyzing and/or processing speech. It could be extracting interesting information and data from it. One intriguing subdiscipline is speech recognition, which is the process of recognizing what is being said. This is the technology behind voice assistants like Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana and Google Assistant.

Other fascinating information you can extract using speech technology includes personal characteristics, medical conditions, the language spoken, the speaker’s emotional state of mind and much more. 

Dig deeper into the magic of speech recognition.

 

Key takeaway 1: Verifying and identifying speakers is a hot research topic 

One of the main topics at Interspeech 2018 was speaker verification and identification. While speech recognition is about extracting and understanding what is being said, speaker identification is identifying the person behind the voice.

Speaker validation is the slightly different problem of making sure a person is who he/she claims to be. This can be compared to fingerprint or face identification, where the person is identified based on their individual appearance. Like a fingerprint, every voice is unique.

Speaker verification has several uses. Imagine unlocking your front door, starting your car or opening your safety box with just your voice. The technology is still young, and there are obviously a lot of security issues in regard to speaker verification. For instance, how do you prevent someone from using a recording of your voice to trick the speaker validator?

Microsoft is one of the players on the market experimenting with identification through voice. Test it for yourself and discover how you can power applications with only your voice.

 

Key takeaway 1 from Interspeech: Verifying and identifying speakers is a hot research topic from FirstAgenda on Vimeo.

 

Key takeaway 2: It is possible to determine gender, age and mood from speech

Another brand-new possibility of speech technology is the possibility of recognizing personal characteristics like gender, age and even mood. There’s quite a lot of use for this. For instance, your company might have a customer support bot to answer phone calls, but if the customer gets upset, you might want a real person to take over.

Defining the mood of the speaker might also be useful in a sales or interview situation to analyze the customers feelings about the product or topic. 

Defining gender and age from speech is a shortcut to improving the speech recognizer. Morten explains:

 

“If you can teach the speech recognizer to know the gender or age, then it can choose the best speech recognizing technique that fits this specific gender or age to record and analyze this voice. That will lead to a higher quality outcome,” Morten says.

 

Key takeaway 2 from Interspeech: It is possible to define gender, age, and mood from speech from FirstAgenda on Vimeo.

 

Key takeaway 3: Speech technology can keep up with language switching within one sentence

To understand and analyze what is being said, the speech recognizer needs to master the spoken language. Thus, it’s quite challenging when the speaker switches between languages or if it is a conversation between two persons speaking each their own language. In linguistic terms, this is called code switching. Dealing with code switching was a prominent topic at Interspeech 2018. Jesper explains:

 

“India has 20 languages that are spoken by more than 1 million people, and often they switch between several of the languages in just one sentence. A single conversation might include words from three or four languages. Without precautions, the typical speech recognizer gets in real trouble when it doesn’t know which language is spoken. At Interspeech we heard about the possibility of the speech recognizer being able to switch over when the language change within one sentence - that's really interesting," Jesper says.

 

Morten and Jesper especially paid attention to this, because they see similar challenges in their work at FirstAgenda.

 

“When using the speech recognition in our meeting tool, at the moment you have to pre-set the language before you start recording. Maybe you don’t have to in the future, because we will be able to identify the language you speak – even if you speak different languages in one meeting situation,” Morten says.

 

Key takeaway 3 from Interspeech: Speech technology can keep up with language switching within one sentence from FirstAgenda on Vimeo.

 

Key takeaway 4: Separating overlapping speech is very promising

At the risk of losing you in the land of complex technology talk, the next key finding is quite technical, but really interesting. Morten reports with great enthusiasm about Microsoft’s presentation about unmixing speech streams.

At the moment, analyzing speech includes mixing the streams of each speaker and detecting the best soundbites from the mix. Microsoft presented a different way at Interspeech 2018, that Jesper explains:


“Microsoft separates the speech streams, meaning in the output you can actually hear what each speaker says, even though they spoke at once. This is a huge opportunity in regard to meetings where people often speak at the same time, making it difficult to recognize the speech,” says Jesper.

 

Key takeaway 5: You can analyze and change the level of speech. For instance, no more whispering!

Another fascinating technology we saw at Interspeech 2018 was converting whispered speech to voiced speech. It might be used for medical challenges. People who are only able to whisper because of for instance laryngeal cancer can make their voice more natural-sounding using this technology. This way speech technology can be used to analyze and raise the speech level. 

 

Key take away 6: Analyzing meeting recordings are one of the biggest challenges

Morten and Jesper are working with speech recognition to improve meetings every day, so they especially paid attention to this topic. Big players on the market struggle to crack the code of using speech technology at meetings, including Microsoft, Google and Amazon.

Meetings are challenging to speech recognizers because the meeting participants often speak at once in ungrammatical sentences. As Morten explains, this is why it makes no sense to transcribe the whole meeting conversation. He elaborates:

 

“When we’re having a conversation in meetings, we don’t usually speak grammatically correct and we say a lot of unnecessary words and sounds. That’s why only a few sentences would actually make sense if we printed the full transcription from the meeting”, Morten says.

 

Instead, Morten and Jesper came up with smart keywords. These are the most important words from the meeting selected by intelligent algorithms.

 

Key takeaway from Interspeech: Analyzing meeting recordings are one of the biggest challenges from FirstAgenda on Vimeo.

 

To show you the difference between keywords and a full transcription, here are the two versions. Both are outputs based on the speech from the interview behind this article.

 

Automatic keywords in Assistant versus full transcription from FirstAgenda on Vimeo.

 

Do you want to try speech recognition on your meetings? Start your free trial now.

 Start free trail

 

 

You may also like:

Meeting Software

Meeting Makeover: How to Bring Technology into Your Meeting

Meetings are one of the biggest time consumers in your workplace – and a tired practice.So, why on earth haven’t we rein...

Meeting Software

The Magic of Speech Recognition: What are the Possibilities for Businesses?

Speech recognition. It’s almost like magic. You can just say a few words into the air, and a nearby digital device can u...

Meeting Software

2 Ways to Manage Note-Taking in Your Meetings (+ the Best of Both Worlds)

To write or not to write. That is the question.   Meeting minutes are a fundamental part of our professional lives. On o...