Documentations

ASR Streaming

Service: wss://asr.api.yating.tw/ws/v1/
With streaming speech recognition you can stream Speech Audio-to-Text and get real time streaming speech recognition results as the audio is processed. You can also use custom language model to enhance recognition accuracy.
To simplify your integration process, we recommend using ASR python SDK instead.
If you prefer to connect to the ASR streaming service through websocket, you must first generate a one-time token.
There are two steps:
Get a one-time token with your API key, see “Generate an one-time token” for more information.
Build a websocket connection with the token.

Generate an one-time token

The token can only be used once and has an expiration time of 60 seconds.
Request
URL: https://asr.api.yating.tw/v1/token
Method: POST
Header
Name
Type
Info
*key
String
*Content-Type
String
{
  "key": "Put your API-Key here",
  "Content-Type": "application/json"
}
Body
Name
Type
Info
*pipeline
String
Put language code here. See language codes.
options
s3CusModelKey: put your custom language model ID  here.
{
   "pipeline":"asr-zh-en-std",
   "options":{
      "s3CusModelKey":"custom_language_model_id"
   }
}
Response
Name
Type
Info
auth_token
string
Example
{
  "success": true,
  "auth_token": "4e87d0db167271245234e6a80522a33e81c3eb90"
}
status
description
Success
Http Status Code:
201
Failed
Http Status Code:
401 (key does not exist or the key is not available for the service)
5xx (please contact our CS team to solve this issue)

Create a websocket connection

More information about websocket: https://tools.ietf.org/html/rfc6455
URL: wss://asr.api.yating.tw/ws/v1/
Query: token=[API Token]
Example: wss://asr.api.yating.tw/ws/v1/?token=YOUR_TOKEN

Send audio data

Chunked transfer encoding is a streaming data transfer mechanism. In chunked transfer encoding, the data stream is divided into a series of non-overlapping "chunks". The chunks are sent out and received independently of each other. Send a chunk in a format of binary frame, the chunk size should be 2000 bytes, roughly 1/16 second.
Send audio data with binary frame
Audio data format: 16kHz, mono, 16 bits per sample, PCM
Sample rate per sec: 16000 x 1 x 16/8 = 32000 bytes ~= 32 Kbytes
Each chunk size: 2000 bytes, 1/16 secs
Client to Server websocket transmission example:
[PCM 16bit binary audio chunk]
[PCM 16bit binary audio chunk]
...
[EOF: empty audio chunk] # (optional) send a chunk with 0 length to end a sentence.

Receive responses

ASR divides the streaming input voice into segments based on whether or not human speech is detected. The recognition of this voice segmentation is finalized once the voice has been segmented. The state will record the <state> of recognition, the follow list shown all possible values:
Key
Explain
status
ok: websocket connection is built successfully.
error: there is an error occurred, the connection will be disconnected within 3 seconds. detail: detail error message
asr_state
first_chunk_received: receive the first chunk.
utterance_begin: sentence starts
utterance_end: sentence ends
asr_sentence
The ASR result. It might change when “final” is not true.
asr_confidence
Sentence confidence score
asr_final
The sentence ends when it is true.
asr_begin_time
Time period between websocket connection start time and sentence start time.
asr_end_time
Time period between websocket connection start time and sentence end time.
asr_word_time_stamp
asr_eof
No audio frame in ASR buffer
Example:
time
condition
from server side to client side
t
"status":"ok"
T+1
After receiving a chunk
"pipe":{
   "asr_state":"first_chunk_received"
}
T+2
The ASR model recognizes the first speech
"pipe":{
   "asr_state":"utterance_begin"
}
T+4
The value in the sentence may change before the sentence is ended.
"pipe":{
   "asr_sentence":"金"
}
T+5
"pipe":{
   "asr_sentence":"今天天"
}
T+6
"pipe":{
   "asr_sentence":"今天天"
}
T+7
"pipe":{
   "asr_sentence":"今天天氣"
}
T+N
when final == true, the sentence is fixed and will not change
"pipe":{
   "asr_sentence":"今天天氣很好",
   "asr_final":true,
   "asr_begin_time":4.38600015640259,
   "asr_end_time":24.5459995269775,
   "asr_word_time_stamp":[
      {
         "word":"今天",
         "begin_time":4.38600015640259,
         "end_time":5.55634234543425
      },
      {
         "word":"天氣",
         "begin_time":5.86554,
         "end_time":6.2434
      },
      {
         "word":"很",
         "begin_time":6.7543,
         "end_time":8.2453234
      },
      {
         "word":"好",
         "begin_time":8.4567654,
         "end_time":9.01324324
      }
   ]
}
T+N+1
"pipe":{
   "asr_state":"utterance_end"
}

Language codes

Language code
Info
language
asr-zh-en-std
Use it when speakers speak Chinese more than English
Mandarin and English
asr-zh-tw-std
Use it when speakers speak Chinese and Taiwanese.
Mandarin and Taiwanese
asr-en-std
English
English
asr-jp-std
Japanese
Japanese

Samples