Skip to main content

Intent and entity detection

Intent and entity detection feature allows you to extract information from live audio in real-time.


By default intent and entity detection is disabled and Speechly operates in speech-to-text mode. To enable it, you’ll need to provide a configuration for your application.

This enables your application to do:

  • Intent detection: returns the meaning of the speech segment
  • Entity detection and classification: returns the keywords and their types in the speech segment

Intent and entity detection is only available for Speechly Cloud. We are working on making it available for Speechly On-device and On-premise in the near future.

Configuring your application

In general it is necessary to design the utterances for each application separately. With Speechly, the configuration serves two equally important purposes:

  1. Teaching our speech recognition system the vocabulary that is relevant in your application. An application may require the use of uncommon words (e.g. obscure brand names or specialist jargon) that must explicitly be taught to our speech recognition model.
  2. Defining the information (intents and entities) that should be extracted from users' utterances. It is difficult to provide ready-made configurations that would sufficiently suit a variety of use-cases. The set of intents and entities are tightly coupled with the workings of each specific application.

What is a configuration?

A configuration contains training data for machine learning models. It describes a number of example utterances that your users might be saying, and from which an intent and possibly a number of entities should be parsed.

Example utterances are written in a Markdown-like syntax called Speechly Annotation Language (SAL):

*search do you have [blue](color) [jackets](product)

The above example defines the user utterance "do you have blue jackets", assigns this to have intent search, and defines two entities that are named color and product, with values blue and jackets.

The intent and entities are then returned to your application, and based on these your application can carry out the correct action. In this case the application should update a search result view to show only blue jackets.

How many example utterances do i need to provide?

A configuration must contain at least a few example utterances for every functionality of your application. In general, the more example utterances you can provide, the better the model will be.

Typing out multiple example utterances can be a tedious task. To make things easier SAL syntax allows configurations to be written using compact templates that are then expanded into a large set of example utterances during model training.

For example, the configuration:

product = [t shirts | hoodies | jackets | jeans | slacks | shorts | sneakers | sandals]
color = [black | white | blue | red | green | yellow | purple | brown | gray]
*search do you have $color(color) $product(product)

declares two variables, product and color, and assigns to both a list of relevant values. The 3rd line defines a template that generates 72 example utterances that each start with "do you have", followed by a color entity and a product entity, with their values taken from the respective lists:

*search do you have [black](color) [t shirts](product)
*search do you have [white](color) [t shirts](product)
*search do you have [blue](color) [t shirts](product)
*search do you have [gray](color) [sandals](product)

All of these 72 example utterances are compactly defined just by the three lines of "code" above.

It is useful to think of preparing the example utterances as the task of "programming" a data generator. You can learn more about how this is done from the SAL reference.


You can see the example utterances that are generated from the templates using either the Show sample utterances button in Speechly Dashboard, or using the the sample command in Speechly CLI.

What is a well-designed example utterance?

Since Speechly is a spoken language understanding system, it is important to use example utterances that as precisely as possible reflect how users talk. An example utterance is probably good, if it sounds natural when spoken out aloud.

Notice that how something spoken can depend on the context. For example, the number 16500 could be either the price of a car or a US zip-code. However, it is spoken quite differently depending on the context:

sixteen thousand five hundred → price
one six five zero zero → zip-code.

A good configuration takes such contextual details into account.


To preview your application, use the Preview tab in Speechly Dashboard. See Previewing your application to learn more.

How do intents and entities appear in my application?

Our spoken language understanding system extracts intents and entities from the user’s speech input and returns these to your application.

You can use the Speechly Streaming API directly or one of our client libraries to obtain this information and perform desired downstream tasks in your application.

Also, check out Using React client for React specific usage documentation.

Alter the application behavior

To change the behavior of your application and NLU engine, following settings are available:

  1. Silence triggered segmentation: duration of silence in milliseconds that creates a new segment
  2. Faster intent detection: controls intent detection speed
  3. Rule-based NLU: controls NLU tagging accuracy
  4. Non-streaming NLU: controls when intents and entities are returned by the system

You can enable them in Speechly Dashboard by going to Application Overview Preferences, or by adding them to your config.yaml.