API Reference

API reference for the Speechly API

Speechly gRPC Api Reference

The following lists the most important APIs which are used for spoken language understanding. This document is created based on the API definitions in Github: https://github.com/speechly/api.

APIs:

speechly.slu.v1.SLU

Service that implements Speechly SLU (Spoken Language Understanding) API.

To use this service you MUST use an access token from Speechly Identity API. The token MUST be passed in gRPC metadata with Authorization key and Bearer $ACCESS_TOKEN as value, e.g. in Go:

ctx := context.Background()
ctx = metadata.AppendToOutgoingContext(ctx, "Authorization", "Bearer "+accessToken)
stream, err := speechlySLUClient.Stream(ctx)

Methods

namerequestresponsedescription
StreamSLURequest streamSLUResponse streamPerforms bidirectional streaming speech recognition: receive results while sending audio.

First request MUST be an SLUConfig message with the configuration that describes the audio format being sent.

This RPC can handle multiple logical audio segments with the use of SLUEvent_START and SLUEvent_STOP messages,
which are used to indicate the beginning and the end of a segment.

A typical call timeline will look like this:

1. Client starts the RPC.
2. Client sends SLUConfig message with audio configuration.
3. Client sends SLUEvent.START.
4. Client sends audio and receives responses from the server.
5. Client sends SLUEvent.STOP.
6. Client sends SLUEvent.START.
7. Client sends audio and receives responses from the server.
8. Client sends SLUEvent.STOP.
9. Client closes the stream and receives responses from the server until EOF is received.

NB: the client does not have to wait until the server acknowledges the start / stop events,
this is done asynchronously. The client can deduplicate responses based on the audio context ID,
which will be present in every response message.

speechly.slu.v1.WLU

Service that implements Speechly WLU (Written Language Understanding).

To use this service you MUST use an access token from Speechly Identity API. The token MUST be passed in gRPC metadata with Authorization key and Bearer $ACCESS_TOKEN as value, e.g. in Go:

ctx := context.Background()
ctx = metadata.AppendToOutgoingContext(ctx, "Authorization", "Bearer "+accessToken)
res, err := speechlyWLUClient.Text(ctx, req)

Methods

namerequestresponsedescription
TextWLURequestWLUResponsePerforms recognition of a text with specified language.

Messages

SLUConfig

Describes the configuration of the audio sent by the client. Currently the API only supports single-channel Linear PCM with sample rate of 16 kHz.

Fields

nametypedescription
encodingEncodingThe encoding of the audio data sent in the stream.
Required.
channelsint32The number of channels in the input audio data.
Required.
sample_rate_hertzint32Sample rate in Hertz of the audio data sent in the stream.
Required.
language_codestringThe language of the audio sent in the stream as a BCP-47 language tag (e.g. “en-US”).
Defaults to the target application language.

SLUEntity

Describes an SLU entity.

An entity is a specific object in the phrase that falls into some kind of category, e.g. in a SAL example “*book book a burger restaurant for tomorrow” “burger restaurant” would be an entity of type restaurant_type, and “tomorrow” would be an entity of type date.

An entity has a start and end indices which map to the indices of words in SLUTranscript messages, e.g. in the example “book a burger restaurant for tomorrow” it would be:

  • Entity “burger restaurant” - start_position = 2, end_position = 3
  • Entity “tomorrow” - start_position = 5, end_position = 5

The start index is inclusive, but the end index is exclusive, i.e. the interval is [start_position, end_position).

Fields

nametypedescription
entitystringThe type of the entity, e.g. restaurant_type or date.
valuestringThe value of the entity, e.g. burger restaurant or tomorrow.
start_positionint32The starting index of the entity in the phrase, maps to the index field in SLUTranscript.
Inclusive.
end_positionint32The finishing index of the entity in the phrase, maps to the index field in SLUTranscript.
Exclusive.

SLUError

Describes the error that happened when processing an audio context. DEPRECATED: Will not be returned. Any errors are returned as gRCP status codes with detail messages.

Fields

nametypedescription
codestringError code (refer to documentation for specific codes).
messagestringError message.

SLUEvent

Indicates the beginning and the end of a logical audio segment (audio context in Speechly terms).

Fields

nametypedescription
eventEventThe event type being sent. Required.
app_idstringThe appId for the utterance.
Required in the START event if the authorization token is project based. The
given application must be part of the project set in the token.
Not required if the authorization token is application based.

SLUFinished

Indicates that the API has stopped processing current audio context. It guarantees that no new messages for that context will be sent by the server.

Fields

nametypedescription
errorSLUErrorDEPRECATED
An error which has happened when processing the context, if any.

SLUIntent

Describes an SLU intent. There can be only one intent per SLU segment.

Fields

nametypedescription
intentstringThe value of the intent, as defined in SAL.

SLURequest

Top-level message sent by the client for the Stream method.

Fields

nametypedescription
configSLUConfigDescribes the configuration of the audio sent by the client.
MUST be the first message sent to the stream.
eventSLUEventIndicates the beginning and the end of a logical audio segment (audio context in Speechly terms).
A context MUST be preceded by a start event and concluded with a stop event,
otherwise the server WILL terminate the stream with an error.
audiobytesContains a chunk of the audio being streamed.

SLUResponse

Top-level message sent by the server for the Stream method.

Fields

nametypedescription
audio_contextstringThe ID of the audio context that this response belongs to.
segment_idint32The ID of the SLU segment that this response belongs to.
This will be 0 for SLUStarted and SLUFinished responses.
transcriptSLUTranscriptFinal SLU transcript.
entitySLUEntityFinal SLU entity.
intentSLUIntentFinal SLU intent.
segment_endSLUSegmentEndA special marker message that indicates that the segment with specified segment_id
has been finalised and no new responses belonging to that segment will be sent.
The client is expected to discard any tentative responses in this segment.
tentative_transcriptSLUTentativeTranscriptTentative SLU transcript.
tentative_entitiesSLUTentativeEntitiesTentative SLU entities.
tentative_intentSLUIntentTentative SLU intent.
startedSLUStartedA special marker message that indicates that the audio context with specified audio_context id
has been started by the API and all audio data sent by the client will be processed in that context.
This message is an asynchronous acknowledgement for client-side SLUEvent_START message.
finishedSLUFinishedA special marker message that indicates that the audio context with specified audio_context id
has been stopped by the API and no new responses for that context will be sent.
The client is expected to discard any non-finalised segments.
This message is an asynchronous acknowledgement for client-side SLUEvent_STOP message.

SLUSegmentEnd

Indicates the end of the segment. Upon receiving this, the segment should be finalised and all future messages for that segment (if any) discarded.

Fields

nametypedescription

SLUStarted

Indicates that the API has started processing the portion of audio as new audio context. This does not guarantee that the server will not send any more messages for the previous audio context.

Fields

nametypedescription

SLUTentativeEntities

Describes tentative entities.

Fields

nametypedescription
tentative_entitiesSLUEntityA list of entities, which must be treated as tentative.

This is not an aggregate of all entities in the audio,
but rather it ONLY contains entities that have not been finalised yet.

e.g. if at the start there are two tentatively recognised entities - [“burger restaurant”, “tomorrow”]
but then the API marks “burger restaurant” as final and recognises a new tentative entity “for two”,
this will contain [“tomorrow”, “for two”].

SLUTentativeTranscript

Describes a tentative transcript.

Tentative transcript is an interim recognition result, which may change over time, e.g. a phrase “find me a red t-shirt” can be tentatively recognised as “find me a tea”, until the API processes the audio completely.

Fields

nametypedescription
tentative_transcriptstringAggregated tentative transcript from the beginning of the audio until current moment in time.
Consecutive transcripts will have this value appended to,
e.g. if in the first message it’s “find me”, in the next it may be “find me a t-shirt”.
tentative_wordsSLUTranscriptA list of individual words which compose tentative_transcript.
All words must be considered tentative.

SLUTranscript

Describes an SLU transcript. A transcript is a speech-to-text element of the phrase, i.e. a word recognised from the audio.

Fields

nametypedescription
wordstringThe word recongised from the audio.
indexint32The position of the word in the whole phrase, zero-based.
start_timeint32The end time of the word in the audio, in milliseconds from the beginning of the audio.
end_timeint32The end time of the word in the audio, in milliseconds from the beginning of the audio.

WLUEntity

Describes a single entity in a segment.

An entity is a specific object in the phrase that falls into some kind of category, e.g. in a SAL example “*book book a burger restaurant for tomorrow” “burger restaurant” would be an entity of type restaurant_type, and “tomorrow” would be an entity of type date.

An entity has a start and end indices which map to the indices of words in WLUToken messages, e.g. in the example “book a burger restaurant for tomorrow” it would be:

  • Entity “burger restaurant” - start_position = 2, end_position = 3
  • Entity “tomorrow” - start_position = 5, end_position = 5

The start index is inclusive, but the end index is exclusive, i.e. the interval is [start_position, end_position).

Fields

nametypedescription
entitystringThe type of the entity, e.g. restaurant_type or date.
valuestringThe value of the entity, e.g. burger restaurant or tomorrow.
start_positionint32The starting index of the entity in the phrase, maps to the index field in SLUTranscript.
Inclusive.
end_positionint32The finishing index of the entity in the phrase, maps to the index field in SLUTranscript.
Exclusive.

WLUIntent

Describes the intent of a segment. There can only be one intent per segment.

Fields

nametypedescription
intentstringThe value of the intent, as defined in SAL.

WLURequest

Top-level message sent by the client for the Text method.

Fields

nametypedescription
language_codestringThe language of the text sent in the request as a BCP-47 language tag (e.g. “en-US”).
Required.
textstringThe text to recognise.
Required.

WLUResponse

Top-level message sent by the server for the Text method.

Fields

nametypedescription
segmentsWLUSegmentA list of WLU segments.

WLUSegment

Describes a WLU segment. A segment is a logical portion of text denoted by its intent, e.g. in a phrase “book me a flight and rent a car” there would be a segment for “book me a flight” and another for “rent a car”.

Fields

nametypedescription
textstringThe portion of text that contains this segment.
tokensWLUTokenThe list of word tokens which are contained in this segment.
entitiesWLUEntityThe list of entities which are contained in this segment.
intentWLUIntentThe intent that defines this segment.

WLUToken

Describes a single word token in a segment.

Fields

nametypedescription
wordstringThe value of the word.
indexint32Position of the token in the text.

speechly.identity.v2.IdentityAPI

Speechly Identity API is used for creating access tokens for the Speechly APIs.

Methods

namerequestresponsedescription
LoginLoginRequestLoginResponsePerforms a login of specific Speechly application.
Returns an access token which can be used to access thee Speechly API.

Messages

ApplicationScope

Used as the scope in LoginRequest when the access is for a single Speechly application.

Fields

nametypedescription
app_idstringSpeechly application ID. The defined application can be accessed with the returned token.
Required.
config_idstringDefine a specific model configuration to use.
Defaults to the application’s latest configuration.

LoginRequest

Top-level message sent by the client for the Login method.

Fields

nametypedescription
device_idstringA unique end-user device identifier.
Must be a UUID.
Required.
applicationApplicationScopeLogin scope application: use the given application context for all utterances.
projectProjectScopeLogin scope project: define the target application per utterance.
The target applications must be located in the same project.

LoginResponse

Top-level message returned by the server for the Login method.

Fields

nametypedescription
tokenstringAccess token which can used for the Speechly API.
The token is a JSON Web Token and includes all standard claims, as well as custom ones.
The token has expiration, so you should check whether it has expired before using it.
It is safe to cache the token for future use until its expiration date.
valid_for_suint32Amount of seconds the returned token is valid.
expires_at_epochuint64Token expiration time in seconds after 1970-01-01 (“unix time”).
expires_atstringISO-formatted UTC timestamp of the expiration time of the returned token.

ProjectScope

Used as the scope in LoginRequest when access is required for every application in a Speechly project.

Fields

nametypedescription
project_idstringSpeechly project ID. Every application in the same project is accessible with the same token.
Required.

Profile image for Markus Lång

Last updated by Markus Lång on April 13, 2021 at 10:21 +0300

Found an error on our documentation? Please file an issue or make a pull request