Documentations

ASR Streaming

Service: wss://asr.api.yating.tw/ws/v1/

With streaming speech recognition you can stream Speech Audio-to-Text and get real time streaming speech recognition results as the audio is processed. You can also use custom language model to enhance recognition accuracy.

To simplify your integration process, we recommend using ASR python SDK instead.

If you prefer to connect to the ASR streaming service through websocket, you must first generate a one-time token.

There are two steps:

Get a one-time token with your API key, see “Generate an one-time token” for more information.

Build a websocket connection with the token.

Generate an one-time token

The token can only be used once and has an expiration time of 60 seconds.

Request

URL: https://asr.api.yating.tw/v1/token

Method: POST

Header

Name	Type	Info
*key	String
*Content-Type	String

{
  "key": "Put your API-Key here",
  "Content-Type": "application/json"
}

Body

Name	Type	Info
*pipeline	String	Put language code here. See language codes.
options		`s3CusModelKey: put your custom language model ID here.`

{
   "pipeline":"asr-zh-en-std",
   "options":{
      "s3CusModelKey":"custom_language_model_id"
   }
}

Response

Name	Type	Info
auth_token	string

Example

{
  "success": true,
  "auth_token": "4e87d0db167271245234e6a80522a33e81c3eb90"
}

status	description
Success	Http Status Code: 201
Failed	Http Status Code: 401 (key does not exist or the key is not available for the service) 5xx (please contact our CS team to solve this issue)

Create a websocket connection

More information about websocket: https://tools.ietf.org/html/rfc6455

URL: wss://asr.api.yating.tw/ws/v1/

Query: token=[API Token]

Example: wss://asr.api.yating.tw/ws/v1/?token=YOUR_TOKEN

Send audio data

Chunked transfer encoding is a streaming data transfer mechanism. In chunked transfer encoding, the data stream is divided into a series of non-overlapping "chunks". The chunks are sent out and received independently of each other. Send a chunk in a format of binary frame, the chunk size should be 2000 bytes, roughly 1/16 second.

Send audio data with binary frame

Audio data format: 16kHz, mono, 16 bits per sample, PCM

Sample rate per sec: 16000 x 1 x 16/8 = 32000 bytes ~= 32 Kbytes

Each chunk size: 2000 bytes, 1/16 secs

Client to Server websocket transmission example:

[PCM 16bit binary audio chunk]

...

[EOF: empty audio chunk] # (optional) send a chunk with 0 length to end a sentence.

Receive responses

ASR divides the streaming input voice into segments based on whether or not human speech is detected. The recognition of this voice segmentation is finalized once the voice has been segmented. The state will record the <state> of recognition, the follow list shown all possible values:

Key	Explain
status	ok: websocket connection is built successfully.
	error: there is an error occurred, the connection will be disconnected within 3 seconds.detail: detail error message
asr_state	first_chunk_received: receive the first chunk. utterance_begin: sentence starts utterance_end: sentence ends
asr_sentence	The ASR result. It might change when “final” is not true.
asr_confidence	Sentence confidence score
asr_final	The sentence ends when it is true.
asr_begin_time	Time period between websocket connection start time and sentence start time.
asr_end_time	Time period between websocket connection start time and sentence end time.
asr_word_time_stamp
asr_eof	No audio frame in ASR buffer

Example:

time	condition	from server side to client side
t		`"status":"ok"`
T+1	After receiving a chunk	`"pipe":{ "asr_state":"first_chunk_received" }`
T+2	The ASR model recognizes the first speech	`"pipe":{ "asr_state":"utterance_begin" }`
T+4	The value in the sentence may change before the sentence is ended.	`"pipe":{ "asr_sentence":"金" }`
T+5		`"pipe":{ "asr_sentence":"今天天" }`
T+6		`"pipe":{ "asr_sentence":"今天天" }`
T+7		`"pipe":{ "asr_sentence":"今天天氣" }`
…	…	…
T+N	when final == true, the sentence is fixed and will not change	`"pipe":{ "asr_sentence":"今天天氣很好", "asr_final":true, "asr_begin_time":4.38600015640259, "asr_end_time":24.5459995269775, "asr_word_time_stamp":[ { "word":"今天", "begin_time":4.38600015640259, "end_time":5.55634234543425 }, { "word":"天氣", "begin_time":5.86554, "end_time":6.2434 }, { "word":"很", "begin_time":6.7543, "end_time":8.2453234 }, { "word":"好", "begin_time":8.4567654, "end_time":9.01324324 } ] }`
T+N+1		`"pipe":{ "asr_state":"utterance_end" }`

Language codes

Language code	Info	language
asr-zh-en-std	Use it when speakers speak Chinese more than English	Mandarin and English
asr-zh-tw-std	Use it when speakers speak Chinese and Taiwanese.	Mandarin and Taiwanese
asr-en-std	English	English
asr-jp-std	Japanese	Japanese

Samples

Python: https://github.com/TaiwanAILabs-Yating/asr-samples-python