Recognize and synthesize speech Flashcards

1
Q

Microsoft Azure offers both speech recognition and speech synthesis capabilities through the Speech cognitive service, which includes the following application programming interfaces (APIs):

A

The Speech-to-Text API
The Text-to-Speech API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Azure resources for the Speech service

A
  1. A Speech resource - choose this resource type if you only plan to use the Speech service, or if you want to manage access and billing for the resource separately from other services.
  2. A Cognitive Services resource - choose this resource type if you plan to use the Speech service in combination with other cognitive services, and you want to manage access and billing for these services together.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The speech-to-text API

A

You can use the speech-to-text API to perform real-time or batch transcription of audio into a text format. The audio source for transcription can be a real-time audio stream from a microphone or an audio file.

The model that is used by the speech-to-text API, is based on the Universal Language Model that was trained by Microsoft.

The model is optimized for two scenarios:
1. Conversational
2. Dictation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Real-time transcription

A

Real-time speech-to-text allows you to transcribe text in audio streams. You can use real-time transcription for presentations, demos, or any other scenario where a person is speaking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Batch transcription

A

Not all speech-to-text scenarios are real time. You may have audio recordings stored on a file share, a remote server, or even on Azure storage.

Batch transcription should be run in an asynchronous manner because the batch jobs are scheduled on a best-effort basis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The text-to-speech API

A

The text-to-speech API enables you to convert text input to audible speech, which can either be played directly through a computer speaker or written to an audio file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Speech synthesis voices

A

When you use the text-to-speech API, you can specify the voice to be used to vocalize the text. This capability offers you the flexibility to personalize your speech synthesis solution and give it a specific character.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(AI) solutions to accept vocal commands and provide spoken responses.
the AI system must support two capabilities:

A
  1. Speech recognition - the ability to detect and interpret spoken input.
  2. Speech synthesis - the ability to generate spoken output.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Speech recognition

A

Speech recognition is concerned with taking the spoken word and converting it into data that can be processed - often by transcribing it into a text representation.

The spoken words can be in the form of a recorded voice in an audio file, or live audio from a microphone.

Speech patterns are analyzed in the audio to determine recognizable patterns that are mapped to words. To accomplish this feat, the software typically uses multiple types of models, including:

  1. An acoustic model that converts the audio signal into phonemes (representations of specific sounds).
  2. A language model that maps phonemes to words, usually using a statistical algorithm that predicts the most probable sequence of words based on the phonemes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In Speech recognition

The recognized words are typically converted to text, which you can use for various purposes, such as.

A
  1. Providing closed captions for recorded or live videos
  2. Creating a transcript of a phone call or meeting
  3. Automated note dictation
  4. Determining intended user input for further processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Speech synthesis

A

Speech synthesis is in many respects the reverse of speech recognition.

It is concerned with vocalizing data, usually by converting text to speech. A speech synthesis solution typically requires the following information:

  1. The text to be spoken.
  2. The voice to be used to vocalize the speech.

To synthesize speech, the system typically tokenizes the text to break it down into individual words, and assigns phonetic sounds to each word.
These phonemes are then synthesized as audio by applying a voice, which will determine parameters such as pitch and timbre; and generating an audio wave form that can be output to a speaker or written to a file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

purposes of using the output of speech synthesis

A
  1. Generating spoken responses to user input.
  2. Creating voice menus for telephone systems.
  3. Reading email or text messages aloud in hands-free scenarios.
  4. Broadcasting announcements in public locations, such as railway stations or airports.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Speech cognitive service

A

Microsoft Azure offers both speech recognition and speech synthesis capabilities through the Speech cognitive service, which includes the following application programming interfaces (APIs):

The Speech-to-Text API
The Text-to-Speech API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Azure resources for the Speech service

A
  1. A Speech resource - choose this resource type if you only plan to use the Speech service, or if you want to manage access and billing for the resource separately from other services.
  2. A Cognitive Services resource - choose this resource type if you plan to use the Speech service in combination with other cognitive services, and you want to manage access and billing for these services together.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The speech-to-text API

A

You can use the speech-to-text API to perform real-time or batch transcription of audio into a text format. The audio source for transcription can be a real-time audio stream from a microphone or an audio file.

The model that is used by the speech-to-text API, is based on the Universal Language Model that was trained by Microsoft.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Real-time transcription

A

Real-time speech-to-text allows you to transcribe text in audio streams. You can use real-time transcription for presentations, demos, or any other scenario where a person is speaking.

In order for real-time transcription to work, your application will need to be listening for incoming audio from a microphone, or other audio input source such as an audio file. Your application code streams the audio to the service, which returns the transcribed text.

17
Q

Batch transcription

A

Not all speech-to-text scenarios are real time. You may have audio recordings stored on a file share, a remote server, or even on Azure storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results.

Batch transcription should be run in an asynchronous manner because the batch jobs are scheduled on a best-effort basis. Normally a job will start executing within minutes of the request but there is no estimate for when a job changes into the running state.

18
Q

The text-to-speech API

A

The text-to-speech API enables you to convert text input to audible speech, which can either be played directly through a computer speaker or written to an audio file.

19
Q

Speech synthesis voices

A

When you use the text-to-speech API, you can specify the voice to be used to vocalize the text. This capability offers you the flexibility to personalize your speech synthesis solution and give it a specific character.