Model adaptation
Speechly Conformer RNN-T models can be further optimized to your specific use case or domain with the use of Text adaptation and Audio adaptation. Model adaptation significantly increases the model accuracy.
Overview
The model adaptation process looks like this:
- You provide training data (text or audio)
- The model is trained using this data
- Upon successful training an adapted model is created
The application status changes during model adaptation and the status is visible in both Speechly Dashboard and Speechly CLI.
Text adaptation
Text adaptation is done by providing example utterances of things your users might say. This teaches the speech recognition system the vocabulary that is relevant in your application. For example, your application may require the use of uncommon words, obscure brand names or specialist jargon that must explicitly be taught to the speech recognition model.
Viewing text training data
- Log in to Speechly Dashboard
- Open the Application you wish to inspect
- Go to the Training data tab
- Find the Text adaptation section
When using file-base configuration, the text training data is specified using the templates
option.
Writing example utterances
Example utterances should cover every functionality of your application. In general, the more you can provide, the better the model will be.
Since Speechly is a spoken language understanding system, it is important to use example utterances that reflect how users talk. An example utterance is probably good, if it sounds natural when spoken out aloud.
For example, a simple customer service application might have a configuration like this:
*_ hi this is john smith, what can i do for you today?
*_ hi, i have a question regarding my recent order, as it hasn't yet arrived
*_ i’m so sorry to hear that
*_ if you give me your full name and order number, i can check in on that order?
*_ ok, so my name is james bond and my order number is zero zero seven
...
Speechly expects every example utterance must have an intent assigned to it. As we are not interested in intent detection in this example, we are using *_
(an empty intent).
Leverage user data
As your application gets more usage, you can see how your users are talking to your application and use this information to improve and expand your text adaptation. This data is also valuable when used for evaluating the accuracy of your application.
To view your user data:
- Log in to Speechly Dashboard
- Open the Application you wish to inspect
- Go to the User data tab
You can also use the utterances
command, if using Speechly CLI.
Natural Language Understanding
Speechly has Natural Language Understanding (NLU) features built into it and is capable of detecting intents and entities, among other things. Intent and entity detection is useful if you want to perform custom downstream actions in your application. By default NLU features are disabled and Speechly operates in speech-to-text mode.
Audio adaptation
Audio adaptation is done by providing a audio adaptation package, which is a set of user audio files together with correct transcripts and annotations. This data is used during model training to teach our speech recognition system the vocabulary that is relevant in your application and how it is said.
In general, audio adaptation results in higher accuracy than text adaptation alone. We recommended using both audio and text adaptation to achieve the highest possible accuracy.
Audio adaptation and Annotation service are available on Enterprise plans.
Viewing audio training data
- Log in to Speechly Dashboard
- Open the Application you wish to edit
- Go to the Training data tab
- Find the Audio adaptation section
When using file-base configuration, the audio training data is specified using the adaptation_audio_package
option.
Annotation service
Annotating audio files is a time consuming task, that's why we offer an annotation service where our Natural Language Specialist make sure that whatever your users are saying gets transcribed and annotated correctly. You can use this curated data to further improve audio adaptation.