Voice Streaming
Stream audio directly into voice channels via WebSocket. Music bots, TTS, live audio relay — no WebRTC needed.
How It Works #
Argon bots don't use WebRTC. Instead, they stream audio directly to a WebSocket ingress endpoint. The server handles mixing and distribution to all participants in the voice channel.
Three WebSocket endpoints are available depending on your use case:
Endpoints Overview #
You cannot use a single token for both /audio/connect and /audio/subscribe simultaneously. If you need both directions, use /audio/duplex.
/audio/connect — Publish #
Publish-only endpoint. Send binary WebSocket messages containing Opus frames — each message is one Opus packet. The server publishes them as an audio track in the Argon Voice Channel.
ws://ingress.argon.gl:12880/audio/connect?token=JWT&stereo=false&frame_duration_ms=20
Query Parameters
| Parameter | Required | Description |
|---|---|---|
| token | Yes | JWT with room grant |
| stereo | No | true for stereo. Default: false |
| channels | No | Alternative to stereo: 2 = stereo, 1 = mono |
| frame_duration_ms | No | Opus frame duration: 2.5, 5, 10, 20 (default), 40, 60 |
| track_name | No | Display name for the published track (default: audio) |
| track_source | No | Source type: microphone (default), screen_share_audio |
| metadata | No | Arbitrary participant metadata (JSON string) |
WS messages (bot → server): Binary only. Each binary message = one raw Opus packet.
WS messages (server → bot): Text only. {"status":"ready","session_id":"AWS_xxx"} — track published, safe to start sending.
/audio/subscribe — Subscribe #
Subscribe-only endpoint. Receive a specific participant's audio as raw Opus frames. The server also sends JSON status messages as text frames.
ws://ingress.argon.gl:12880/audio/subscribe?token=JWT&target_identity=user-guid&target_track_source=microphone
Query Parameters
| Parameter | Required | Description |
|---|---|---|
| token | Yes | JWT with room grant |
| target_identity | Yes | Identity of the participant to listen to |
| target_track_source | No | microphone (default) or screen_share_audio |
WS Messages (server → bot)
| Type | Format | Description |
|---|---|---|
| Binary | Raw Opus frame | One Opus packet per message (RTP payload) |
| Text | {"status":"waiting","session_id":"...","participant_identity":"..."} | Target participant not yet in room |
| Text | {"status":"subscribed","session_id":"...","track_sid":"...","participant_identity":"..."} | Subscribed to target's audio, frames incoming |
| Text | {"status":"target_left","session_id":"...","participant_identity":"..."} | Target left the room, WS closes |
| Text | {"status":"error","message":"..."} | Error occurred, WS closes after |
/audio/duplex — Bidirectional #
Full-duplex endpoint — publish and subscribe on a single WebSocket connection. One identity, one room join. This is the recommended endpoint for voice bots that need to both speak and listen.
ws://ingress.argon.gl:12880/audio/duplex?token=JWT&target_identity=user-guid&target_track_source=microphone&stereo=false&frame_duration_ms=20
Query Parameters
| Parameter | Required | Description |
|---|---|---|
| token | Yes | JWT with room grant |
| target_identity | Yes | Identity of the participant to listen to |
| target_track_source | No | microphone (default) or screen_share_audio |
| track_name | No | Name of the published track (default: audio) |
| track_source | No | Published track source type (default: microphone) |
| stereo | No | true for stereo. Default: false |
| channels | No | Alternative to stereo: 2 = stereo, 1 = mono |
| frame_duration_ms | No | Opus frame duration: 2.5, 5, 10, 20 (default), 40, 60 |
| metadata | No | Participant metadata (JSON string) |
WS Message Flow
| Direction | Type | Description |
|---|---|---|
| Bot → Room | Binary | Opus frames published as an audio track |
| Room → Bot | Binary | Target participant's Opus frames |
| Room → Bot | Text | {"status":"ready","session_id":"AWS_xxx"} — track published, safe to start sending |
| Room → Bot | Text | {"status":"waiting","session_id":"...","participant_identity":"..."} — target not yet in room |
| Room → Bot | Text | {"status":"subscribed","session_id":"...","track_sid":"...","participant_identity":"..."} — receiving target's audio |
| Room → Bot | Text | {"status":"target_left","session_id":"...","participant_identity":"..."} — target left the room |
| Room → Bot | Text | {"status":"error","message":"..."} — error, WS closes after |
Recommended for call bots. The ICalls/Accept endpoint returns audioBaseUrl and callerId — use them directly to build the duplex URL.
Step-by-Step Flow #
Request a Stream Token
Call the StreamToken endpoint with the target space and channel:
POST /IVoice/v1/StreamToken
Authorization: Bot YOUR_TOKEN
Content-Type: application/json
{
"spaceId": "550e8400-e29b-41d4-a716-446655440000",
"channelId": "6ba7b810-9dad-11d1-80b4-00c04fd430c8"
} {
"token": "eyJhbGciOiJIUzI1NiIs...",
"ingressUrl": "ws://ingress.argon.gl:12880",
"roomName": "550e8400-.../6ba7b810-..."
} Connect to the WebSocket Endpoint
Open a WebSocket connection to the endpoint that matches your use case. For publish-only (music bots, TTS), use /audio/connect. For voice bots that speak and listen, use /audio/duplex.
ws://ingress.argon.gl:12880/audio/connect?token=TOKEN&stereo=true&frame_duration_ms=20
ws://ingress.argon.gl:12880/audio/duplex?token=TOKEN&target_identity=caller-guid&target_track_source=microphone
See the endpoint sections above for full parameter tables.
Wait for Ready Status
Before sending any audio, wait for the server's {"status":"ready"} text message. This confirms the track is published and the room is joined. Do not send Opus frames before receiving this.
// Server sends:
{"status": "ready", "session_id": "AWS_xxxxx"} Stream Opus Audio Frames
Send binary WebSocket frames containing Opus-encoded audio data. Each frame should be a single Opus packet (max 3,825 bytes).
Interactive Demo: Echo Bot Flow #
Listen to a real echo bot session and see how the six phases map to the audio timeline. Hit play and watch the phases light up — this is exactly what happens over the WebSocket connection during a call.
Bot plays a pre-recorded greeting. Audio is sent as Opus frames over the WebSocket. The bot is not yet listening to the caller.
Audio samples on this page are proprietary assets of Argon Inc. LLC and are provided for demonstration purposes only. They may not be reused or redistributed without permission.
Audio Requirements #
| Parameter | Value |
|---|---|
| Codec | Opus |
| Sample Rate | 48,000 Hz |
| Channels | Mono (1) or Stereo (2) — configured via stereo or channels query param |
| Frame Duration | 2.5, 5, 10, 20 (default), 40, or 60 ms — configured via frame_duration_ms query param |
| Max Frame Size | 3,825 bytes per Opus packet |
| Transport | Binary WebSocket frames (one Opus packet per frame) |
Do not send raw PCM. The ingress expects Opus-encoded packets. Use a library like opusenc, ffmpeg, or your language's Opus bindings to encode before sending.
Frame Validation
The server validates every incoming Opus frame by parsing the TOC byte (RFC 6716 §3.1). If a frame's declared structure doesn't match its actual size, the server immediately closes the WebSocket with the error:
opus frame too small for declared structure This typically happens when:
- Sending a truncated or corrupted Opus packet
- Re-publishing DTX/comfort noise frames received from a subscribed participant (see below)
- Accidentally sending an OGG page header instead of a raw Opus frame
- Sending an empty or 1-byte binary message where the TOC byte declares a multi-frame packet
| TOC Frame Code | Meaning | Minimum Packet Size |
|---|---|---|
| 0 | 1 frame | 2 bytes (TOC + at least 1 byte payload) |
| 1 | 2 equal-size CBR frames | 3 bytes (TOC + even payload ≥ 2) |
| 2 | 2 different-size VBR frames | 3 bytes (TOC + size field + data) |
| 3 | Arbitrary N frames (CBR/VBR) | 3+ bytes (TOC + count byte + padding/sizes + data) |
0xf8 0xff 0xfe — a valid Opus frame that decodes to silence).
DTX & Comfort Noise Frames
When subscribed to a participant's audio (/audio/subscribe or /audio/duplex), WebRTC clients with DTX (Discontinuous Transmission) enabled will send very small packets during silence — typically 1-byte frames with TOC bytes like 0xdc or 0xfc.
These DTX frames are valid for a decoder to produce comfort noise, but they cannot be re-published through the ingress — the server will reject them as structurally invalid and close the WebSocket.
Audio Normalization #
If your bot plays multiple audio files (greetings, responses, background music, sound effects), their loudness levels should be normalized so listeners don't experience jarring volume jumps between tracks.
Recommended Target
| Parameter | Value | Notes |
|---|---|---|
| Integrated Loudness (I) | -16 to -23 LUFS | Streaming standard. Voice-heavy bots can aim for -16, music bots for -14. |
| True Peak (TP) | -1 dBTP | Prevents clipping after Opus encoding |
| Loudness Range (LRA) | ≤ 11 LU | Keep dynamics consistent across tracks |
Hear the Difference #
Put on headphones and compare the same voice clip at three loudness levels. This is the same "Привет" greeting from the echo bot — normalized to different LUFS targets.
Barely audible. Listener has to max out volume to hear anything.
Clipping, distorted. Painful on headphones, wakes the neighbors.
Clean, comfortable listening level. Consistent with other tracks.
Opus playback requires Chrome, Firefox, or Edge. Safari does not support OGG/Opus natively.
Step 1: Measure Loudness
Use ffmpeg's loudnorm filter in analysis mode to measure the loudness of each file:
ffmpeg -i input.opus -af loudnorm=print_format=json -f null /dev/null
Look at the input_i value in the JSON output — that's the integrated loudness in LUFS. Compare it across all your files.
// Voice greeting (quiet) "input_i" : "-40.74" // -40.7 LUFS // Background music (too loud!) "input_i" : "-8.40" // -8.4 LUFS — 32 dB louder
Step 2: Normalize to a Reference
Pick your main voice track as the reference and normalize all other files to match its loudness. Use loudnorm with your target values:
ffmpeg -y -i loud_background.opus \ -af "loudnorm=I=-40.7:TP=-1:LRA=11" \ -c:a libopus -b:a 48k -ar 48000 -ac 2 -frame_duration 20 \ normalized_background.opus
| Parameter | Description |
|---|---|
| I=<value> | Target integrated loudness in LUFS (match your reference) |
| TP=-1 | True peak limit — prevents clipping |
| LRA=11 | Loudness range limit — compresses dynamics |
| -c:a libopus | Re-encode as Opus |
| -b:a 48k | Bitrate (48–96 kbps for voice, 96–128 for music) |
| -frame_duration 20 | Match the ingress frame duration setting |
Step 3: Verify
Re-measure the output to confirm the loudness matches your target:
ffmpeg -i normalized_background.opus -af loudnorm=print_format=json -f null /dev/null // Should output input_i close to your target (e.g. -40.7 → -41.2 is fine)
Common pitfalls:
TPmust be between -9 and 0 — values outside this range cause ffmpeg errors- Don't mix mono and stereo files without matching the
-acflag to your ingressstereosetting - Always re-measure after normalization —
loudnormmay undershoot by 0.5–1 dB, which is perfectly acceptable - If your input is an OGG/Opus file, ffmpeg will transcode losslessly through PCM — this is expected
Keepalive #
| Parameter | Value |
|---|---|
| Ping Interval | 30 seconds (server → bot) |
| Pong Timeout | 60 seconds — session closed if bot doesn't respond |
| Write Deadline | 10 seconds per message |
Most WebSocket libraries respond to pings automatically. If yours doesn't, make sure to reply with a pong frame within 60 seconds or the server will disconnect the session.
Room & Token Details #
| Property | Details |
|---|---|
| Room Name Format | {spaceId}/{channelId} |
| Token Type | LiveKit JWT (HS256) |
| Token TTL | 2 hours — request a new one before it expires |
| Bot Permissions | Publish audio/video, subscribe to tracks, send data, update own metadata |
Prerequisites & Validation #
The StreamToken endpoint validates the following before issuing a token:
- The bot must be a member of the space — returns
403otherwise. - The channel must exist — returns
404if not found. - The channel must be a voice channel — returns
400for text channels. - Rate limited to 5 requests per minute.
Code Example #
TypeScript (Bun / Node.js)
import { spawn } from "child_process"; // 1. Get stream token const resp = await fetch("https://gateway.argon.zone/IVoice/v1/StreamToken", { method: "POST", headers: { "Authorization": "Bot YOUR_TOKEN", "Content-Type": "application/json", }, body: JSON.stringify({ spaceId: "...", channelId: "..." }), }); const { token, ingressUrl } = await resp.json(); // 2. Encode audio to Opus with ffmpeg (stereo, 48kHz) const ffmpeg = spawn("ffmpeg", [ "-i", "music.mp3", "-f", "opus", "-ar", "48000", "-ac", "2", "-frame_duration", "20", "pipe:1", ]); // 3. Stream to ingress (stereo, 20ms frames) const wsUrl = `${ingressUrl}?token=${token}&stereo=true&frame_duration_ms=20&track_name=music`; const ws = new WebSocket(wsUrl); ws.addEventListener("open", () => { ffmpeg.stdout.on("data", (chunk) => { ws.send(chunk); }); }); ffmpeg.on("close", () => ws.close());
C# (.NET)
using System.Diagnostics; using System.Net.Http.Json; using System.Net.WebSockets; // 1. Get stream token using var http = new HttpClient(); http.DefaultRequestHeaders.Add("Authorization", "Bot YOUR_TOKEN"); var resp = await http.PostAsJsonAsync( "https://gateway.argon.zone/IVoice/v1/StreamToken", new { spaceId = "...", channelId = "..." }); var data = await resp.Content.ReadFromJsonAsync<JsonElement>(); var token = data.GetProperty("token").GetString()!; var url = data.GetProperty("ingressUrl").GetString()!; // 2. Encode audio with ffmpeg var ffmpeg = Process.Start(new ProcessStartInfo { FileName = "ffmpeg", Arguments = "-i music.mp3 -f opus -ar 48000 -ac 2 -frame_duration 20 pipe:1", RedirectStandardOutput = true, UseShellExecute = false })!; // 3. Stream to ingress using var ws = new ClientWebSocket(); await ws.ConnectAsync( new Uri($"{url}?token={token}&stereo=true&frame_duration_ms=20&track_name=music"), CancellationToken.None); var buf = new byte[960]; int read; while ((read = await ffmpeg.StandardOutput.BaseStream.ReadAsync(buf)) > 0) await ws.SendAsync(buf.AsMemory(0, read), WebSocketMessageType.Binary, true, CancellationToken.None);
Use Cases #
Music Bot
Stream audio from local files or your own media library. Encode to Opus and push to the channel via /audio/connect.
TTS Bot
Convert text to speech (e.g., via Google TTS or Coqui), encode to Opus, and stream the result.
Voice / Echo Bot
Use /audio/duplex to listen to a participant and respond — AI voice assistants, echo bots, call automation.
Transcription Bot
Subscribe to a participant's audio via /audio/subscribe and pipe Opus frames to a speech-to-text engine.
Live Audio Relay
Relay audio from an external source (radio stream, microphone, podcast feed) into the channel.
Audio Notifications
Play short audio clips for alerts, announcements, or sound effects triggered by events.
Next Steps
Audio samples on this page are proprietary assets of Argon Inc. LLC and are provided for demonstration purposes only. They may not be reused or redistributed without permission.