API/Streaming

Audio Transcriptions

Streaming

This API provides streaming speech-to-text transcriptions using WebSockets.

Endpoint: wss://api.sully.ai/v1/audio/transcriptions/stream?account_id=1234567890&api_token=1234567890&sample_rate=16000&language=en

arrow-up-rightX-API-KEYstring

The API key to use for authentication. Required if X-API-TOKEN is not provided.

arrow-up-rightX-API-TOKENstring

The API token to use for authentication. Required if X-API-KEY is not provided.

arrow-up-rightX-ACCOUNT-IDstring

The account ID to use for authentication.

arrow-up-rightQuery parameters

arrow-up-rightlanguagestringdefault:"en"

The language of your submitted audio. See our Supported Languagesarrow-up-right documentation for a complete list of language options.

arrow-up-rightsample_ratestring

The sample rate of your submitted audio.

arrow-up-rightencodingstring

Specifies the encoding format of the audio being sent.

Important: This parameter is required when transmitting raw, unstructured audio packets without headers. If the audio data is encapsulated within a container format, this parameter should be omitted.

Supported formats:

  • linear16: 16-bit, little-endian PCM audio

  • flac: Free Lossless Audio Codec (FLAC)

  • mulaw: Mu-law encoded WAV

  • amr-nb: Adaptive Multi-Rate, narrowband

  • amr-wb: Adaptive Multi-Rate, wideband

  • opus: Ogg Opus codec

  • speex: Speex codec

  • g729: G729 codec (usable with raw or containerized audio)

arrow-up-rightaccount_idstring

The account ID to use for authentication. Required if X-ACCOUNT-ID is not provided.

arrow-up-rightapi_tokenstring

A temporary authentication token. Required if X-API-KEY is not provided.

The Speech-to-Text Websockets API is designed to generate text from partial audio input. It’s well-suited for scenarios where the input audio is being streamed or generated in chunks.

The WebSocket API uses a bidirectional protocol that encodes all messages as JSON objects.

arrow-up-rightConnection Status Messages

Upon successful connection, the server sends a status message:

Copy

When the connection closes:

Copy

Important: Wait for the “status”: “connected” message before sending audio data. This ensures the server is ready to process your stream.

arrow-up-rightStreaming input audio

The client can send messages with audio input to the server. The messages can contain the following fields:

Copy

arrow-up-rightaudiostringrequired

A generated partial audio chunk encoded as a base64 string.

Browser MediaRecorder Notice: When using Chrome’s MediaRecorder API, the first audio chunk contains critical header information. Always send this first chunk for proper audio processing. Failing to include header information may result in transcription errors or complete failure.

arrow-up-rightStreaming output audio

The server will always respond with a message containing the following fields:

Response messageCopy

arrow-up-righttypestring

The type of response, will be “transcript” for transcription results.

arrow-up-rightaudio_startnumber

Start time of the audio segment in seconds.

arrow-up-rightaudio_endnumber

End time of the audio segment in seconds.

arrow-up-rightdurationnumber

Duration of the audio segment in seconds.

arrow-up-righttextstring

The processed text sequence.

arrow-up-rightisFinalbooleandeprecated

Indicates if the generation is complete. Deprecated: Use is_final instead.

arrow-up-rightis_finalboolean

Indicates if the generation is complete.

arrow-up-rightwordsarray

Array of word objects with text content and timing information:

  • word: The raw word as recognized

  • start: Start time of the word in seconds

  • end: End time of the word in seconds

  • confidence: Confidence score between 0-1 for the word recognition

  • punctuated_word: The word with proper capitalization and punctuation

arrow-up-righttimestampstring

ISO-formatted timestamp when the response was generated.

Test this API out here: https://docs.sully.ai/api-reference/audio-transcriptions/streamingarrow-up-right

Last updated