Documentations
Introduction
Changelog
API Doc
Introduction
Changelog
API Doc
ASR Streaming
Service: wss://asr.api.yating.tw/ws/v1/
With streaming speech recognition you can stream Speech Audio-to-Text and get real time streaming speech recognition results as the audio is processed. You can also use custom language model to enhance recognition accuracy.
If you prefer to connect to the ASR streaming service through websocket, you must first generate a one-time token.
There are two steps:
Get a one-time token with your API key, see “Generate an one-time token” for more information.
Build a websocket connection with the token.
Generate an one-time token
The token can only be used once and has an expiration time of 60 seconds.
Request
URL: https://asr.api.yating.tw/v1/token
Method: POST
Header
Name
| Type
| Info
|
*key | String
| |
*Content-Type | String
|
|
Body
Name
| Type
| Info
|
*pipeline | String
| Put language code here. See language codes.
|
options
|
|
|
Response
Name
| Type
| Info
|
auth_token
| string
|
Example
|
status
| description
|
Success
| Http Status Code:
201
|
Failed
| Http Status Code:
401 (key does not exist or the key is not available for the service)
5xx (please contact our CS team to solve this issue)
|
Create a websocket connection
URL: wss://asr.api.yating.tw/ws/v1/
Query: token=[API Token]
Example: wss://asr.api.yating.tw/ws/v1/?token=YOUR_TOKEN
Send audio data
Chunked transfer encoding is a streaming data transfer mechanism. In chunked transfer encoding, the data stream is divided into a series of non-overlapping "chunks". The chunks are sent out and received independently of each other. Send a chunk in a format of binary frame, the chunk size should be 2000 bytes, roughly 1/16 second.
Send audio data with binary frame
Audio data format: 16kHz, mono, 16 bits per sample, PCM
Sample rate per sec: 16000 x 1 x 16/8 = 32000 bytes ~= 32 Kbytes
Each chunk size: 2000 bytes, 1/16 secs
Client to Server websocket transmission example:
[PCM 16bit binary audio chunk]
[PCM 16bit binary audio chunk]
...
[EOF: empty audio chunk] # (optional) send a chunk with 0 length to end a sentence.
Receive responses
ASR divides the streaming input voice into segments based on whether or not human speech is detected. The recognition of this voice segmentation is finalized once the voice has been segmented. The state will record the <state> of recognition, the follow list shown all possible values:
Key
| Explain
|
status
| ok: websocket connection is built successfully.
|
error: there is an error occurred, the connection will be disconnected within 3 seconds.detail: detail error message
| |
asr_state
| first_chunk_received: receive the first chunk.
utterance_begin: sentence starts
utterance_end: sentence ends
|
asr_sentence
| The ASR result. It might change when “final” is not true.
|
asr_confidence
| Sentence confidence score
|
asr_final
| The sentence ends when it is true.
|
asr_begin_time
| Time period between websocket connection start time and sentence start time.
|
asr_end_time
| Time period between websocket connection start time and sentence end time.
|
asr_word_time_stamp
| |
asr_eof
| No audio frame in ASR buffer
|
Example:
time
| condition
| from server side to client side
|
t
|
| |
T+1
| After receiving a chunk
|
|
T+2
| The ASR model recognizes the first speech
|
|
T+4
| The value in the sentence may change before the sentence is ended.
|
|
T+5
|
| |
T+6
|
| |
T+7
|
| |
…
| …
| …
|
T+N
| when final == true, the sentence is fixed and will not change
|
|
T+N+1
|
|
Language codes
Language code
| Info
| language
|
asr-zh-en-std
| Use it when speakers speak Chinese more than English
| Mandarin and English
|
asr-zh-tw-std
| Use it when speakers speak Chinese and Taiwanese.
| Mandarin and Taiwanese
|
asr-en-std
| English
| English
|
asr-jp-std
| Japanese
| Japanese
|