Skip to content

Voice Streaming

Stream audio directly into voice channels via WebSocket. Music bots, TTS, live audio relay — no WebRTC needed.

How It Works #

Argon bots don't use WebRTC. Instead, they stream audio directly to a WebSocket ingress endpoint. The server handles mixing and distribution to all participants in the voice channel.

Your Bot
Opus over WebSocket
SFU
Ingress
WebRTC
Listeners

Three WebSocket endpoints are available depending on your use case:

Endpoints Overview #

You cannot use a single token for both /audio/connect and /audio/subscribe simultaneously. If you need both directions, use /audio/duplex.

/audio/connect — Publish #

Publish-only endpoint. Send binary WebSocket messages containing Opus frames — each message is one Opus packet. The server publishes them as an audio track in the Argon Voice Channel.

ws://ingress.argon.gl:12880/audio/connect?token=JWT&stereo=false&frame_duration_ms=20

Query Parameters

Parameter Required Description
tokenYesJWT with room grant
stereoNotrue for stereo. Default: false
channelsNoAlternative to stereo: 2 = stereo, 1 = mono
frame_duration_msNoOpus frame duration: 2.5, 5, 10, 20 (default), 40, 60
track_nameNoDisplay name for the published track (default: audio)
track_sourceNoSource type: microphone (default), screen_share_audio
metadataNoArbitrary participant metadata (JSON string)

WS messages (bot → server): Binary only. Each binary message = one raw Opus packet.

WS messages (server → bot): Text only. {"status":"ready","session_id":"AWS_xxx"} — track published, safe to start sending.

/audio/subscribe — Subscribe #

Subscribe-only endpoint. Receive a specific participant's audio as raw Opus frames. The server also sends JSON status messages as text frames.

ws://ingress.argon.gl:12880/audio/subscribe?token=JWT&target_identity=user-guid&target_track_source=microphone

Query Parameters

Parameter Required Description
tokenYesJWT with room grant
target_identityYesIdentity of the participant to listen to
target_track_sourceNomicrophone (default) or screen_share_audio

WS Messages (server → bot)

Type Format Description
BinaryRaw Opus frameOne Opus packet per message (RTP payload)
Text{"status":"waiting","session_id":"...","participant_identity":"..."}Target participant not yet in room
Text{"status":"subscribed","session_id":"...","track_sid":"...","participant_identity":"..."}Subscribed to target's audio, frames incoming
Text{"status":"target_left","session_id":"...","participant_identity":"..."}Target left the room, WS closes
Text{"status":"error","message":"..."}Error occurred, WS closes after

/audio/duplex — Bidirectional #

Full-duplex endpoint — publish and subscribe on a single WebSocket connection. One identity, one room join. This is the recommended endpoint for voice bots that need to both speak and listen.

ws://ingress.argon.gl:12880/audio/duplex?token=JWT&target_identity=user-guid&target_track_source=microphone&stereo=false&frame_duration_ms=20

Query Parameters

Parameter Required Description
tokenYesJWT with room grant
target_identityYesIdentity of the participant to listen to
target_track_sourceNomicrophone (default) or screen_share_audio
track_nameNoName of the published track (default: audio)
track_sourceNoPublished track source type (default: microphone)
stereoNotrue for stereo. Default: false
channelsNoAlternative to stereo: 2 = stereo, 1 = mono
frame_duration_msNoOpus frame duration: 2.5, 5, 10, 20 (default), 40, 60
metadataNoParticipant metadata (JSON string)

WS Message Flow

Direction Type Description
Bot → RoomBinaryOpus frames published as an audio track
Room → BotBinaryTarget participant's Opus frames
Room → BotText{"status":"ready","session_id":"AWS_xxx"} — track published, safe to start sending
Room → BotText{"status":"waiting","session_id":"...","participant_identity":"..."} — target not yet in room
Room → BotText{"status":"subscribed","session_id":"...","track_sid":"...","participant_identity":"..."} — receiving target's audio
Room → BotText{"status":"target_left","session_id":"...","participant_identity":"..."} — target left the room
Room → BotText{"status":"error","message":"..."} — error, WS closes after

Recommended for call bots. The ICalls/Accept endpoint returns audioBaseUrl and callerId — use them directly to build the duplex URL.

Step-by-Step Flow #

1

Request a Stream Token

Call the StreamToken endpoint with the target space and channel:

Request
POST /IVoice/v1/StreamToken
Authorization: Bot YOUR_TOKEN
Content-Type: application/json

{
  "spaceId": "550e8400-e29b-41d4-a716-446655440000",
  "channelId": "6ba7b810-9dad-11d1-80b4-00c04fd430c8"
}
Response
{
  "token": "eyJhbGciOiJIUzI1NiIs...",
  "ingressUrl": "ws://ingress.argon.gl:12880",
  "roomName": "550e8400-.../6ba7b810-..."
}
2

Connect to the WebSocket Endpoint

Open a WebSocket connection to the endpoint that matches your use case. For publish-only (music bots, TTS), use /audio/connect. For voice bots that speak and listen, use /audio/duplex.

Publish-only (music bot)
ws://ingress.argon.gl:12880/audio/connect?token=TOKEN&stereo=true&frame_duration_ms=20
Full-duplex (voice bot in a call)
ws://ingress.argon.gl:12880/audio/duplex?token=TOKEN&target_identity=caller-guid&target_track_source=microphone

See the endpoint sections above for full parameter tables.

3

Wait for Ready Status

Before sending any audio, wait for the server's {"status":"ready"} text message. This confirms the track is published and the room is joined. Do not send Opus frames before receiving this.

// Server sends:
{"status": "ready", "session_id": "AWS_xxxxx"}
4

Stream Opus Audio Frames

Send binary WebSocket frames containing Opus-encoded audio data. Each frame should be a single Opus packet (max 3,825 bytes).

Interactive Demo: Echo Bot Flow #

Listen to a real echo bot session and see how the six phases map to the audio timeline. Hit play and watch the phases light up — this is exactly what happens over the WebSocket connection during a call.

0:00/1:00
Greeting
Recording
Transition
Playback
Outro
Background
Greeting0:00–0:16

Bot plays a pre-recorded greeting. Audio is sent as Opus frames over the WebSocket. The bot is not yet listening to the caller.

This is real audio from the echo bot sample in the Argon Chat Echo Bot repository. The phases are driven by timestamp markers defined in a JSON file alongside each audio variant.

Audio samples on this page are proprietary assets of Argon Inc. LLC and are provided for demonstration purposes only. They may not be reused or redistributed without permission.

Audio Requirements #

Parameter Value
CodecOpus
Sample Rate48,000 Hz
ChannelsMono (1) or Stereo (2) — configured via stereo or channels query param
Frame Duration2.5, 5, 10, 20 (default), 40, or 60 ms — configured via frame_duration_ms query param
Max Frame Size3,825 bytes per Opus packet
TransportBinary WebSocket frames (one Opus packet per frame)

Do not send raw PCM. The ingress expects Opus-encoded packets. Use a library like opusenc, ffmpeg, or your language's Opus bindings to encode before sending.

Frame Validation

The server validates every incoming Opus frame by parsing the TOC byte (RFC 6716 §3.1). If a frame's declared structure doesn't match its actual size, the server immediately closes the WebSocket with the error:

opus frame too small for declared structure

This typically happens when:

  • Sending a truncated or corrupted Opus packet
  • Re-publishing DTX/comfort noise frames received from a subscribed participant (see below)
  • Accidentally sending an OGG page header instead of a raw Opus frame
  • Sending an empty or 1-byte binary message where the TOC byte declares a multi-frame packet
TOC Frame Code Meaning Minimum Packet Size
01 frame2 bytes (TOC + at least 1 byte payload)
12 equal-size CBR frames3 bytes (TOC + even payload ≥ 2)
22 different-size VBR frames3 bytes (TOC + size field + data)
3Arbitrary N frames (CBR/VBR)3+ bytes (TOC + count byte + padding/sizes + data)
Tip: If you're echoing or relaying received audio frames, validate each packet before sending it back. Drop invalid frames or replace them with a silence packet (0xf8 0xff 0xfe — a valid Opus frame that decodes to silence).

DTX & Comfort Noise Frames

When subscribed to a participant's audio (/audio/subscribe or /audio/duplex), WebRTC clients with DTX (Discontinuous Transmission) enabled will send very small packets during silence — typically 1-byte frames with TOC bytes like 0xdc or 0xfc.

These DTX frames are valid for a decoder to produce comfort noise, but they cannot be re-published through the ingress — the server will reject them as structurally invalid and close the WebSocket.

If you relay or echo received audio: discard all frames ≤ 2 bytes before recording or re-publishing. These are DTX/comfort noise packets that carry no useful audio content. A normal Opus voice frame is typically 20–160 bytes.

Audio Normalization #

If your bot plays multiple audio files (greetings, responses, background music, sound effects), their loudness levels should be normalized so listeners don't experience jarring volume jumps between tracks.

Recommended Target

Parameter Value Notes
Integrated Loudness (I)-16 to -23 LUFSStreaming standard. Voice-heavy bots can aim for -16, music bots for -14.
True Peak (TP)-1 dBTPPrevents clipping after Opus encoding
Loudness Range (LRA)≤ 11 LUKeep dynamics consistent across tracks
Key principle: pick one target loudness for your bot and normalize all audio files to it. The specific value matters less than consistency — a greeting, a response, and background music should all sound the same volume to the listener.

Hear the Difference #

Put on headphones and compare the same voice clip at three loudness levels. This is the same "Привет" greeting from the echo bot — normalized to different LUFS targets.

-40 LUFS
Too Quiet

Barely audible. Listener has to max out volume to hear anything.

-8 LUFS
Too Loud

Clipping, distorted. Painful on headphones, wakes the neighbors.

-18 LUFS
Normalized

Clean, comfortable listening level. Consistent with other tracks.

-40
-8
-18
Too loud
Target
Too quiet
-45-40-30-23-18-14-80

Opus playback requires Chrome, Firefox, or Edge. Safari does not support OGG/Opus natively.

Step 1: Measure Loudness

Use ffmpeg's loudnorm filter in analysis mode to measure the loudness of each file:

Measure integrated loudness (LUFS)
ffmpeg -i input.opus -af loudnorm=print_format=json -f null /dev/null

Look at the input_i value in the JSON output — that's the integrated loudness in LUFS. Compare it across all your files.

Example output
// Voice greeting (quiet)
"input_i" : "-40.74"    // -40.7 LUFS

// Background music (too loud!)
"input_i" : "-8.40"     // -8.4 LUFS — 32 dB louder

Step 2: Normalize to a Reference

Pick your main voice track as the reference and normalize all other files to match its loudness. Use loudnorm with your target values:

Normalize to match a reference file
ffmpeg -y -i loud_background.opus \
  -af "loudnorm=I=-40.7:TP=-1:LRA=11" \
  -c:a libopus -b:a 48k -ar 48000 -ac 2 -frame_duration 20 \
  normalized_background.opus
Parameter Description
I=<value>Target integrated loudness in LUFS (match your reference)
TP=-1True peak limit — prevents clipping
LRA=11Loudness range limit — compresses dynamics
-c:a libopusRe-encode as Opus
-b:a 48kBitrate (48–96 kbps for voice, 96–128 for music)
-frame_duration 20Match the ingress frame duration setting

Step 3: Verify

Re-measure the output to confirm the loudness matches your target:

ffmpeg -i normalized_background.opus -af loudnorm=print_format=json -f null /dev/null
// Should output input_i close to your target (e.g. -40.7 → -41.2 is fine)

Common pitfalls:

  • TP must be between -9 and 0 — values outside this range cause ffmpeg errors
  • Don't mix mono and stereo files without matching the -ac flag to your ingress stereo setting
  • Always re-measure after normalization — loudnorm may undershoot by 0.5–1 dB, which is perfectly acceptable
  • If your input is an OGG/Opus file, ffmpeg will transcode losslessly through PCM — this is expected

Keepalive #

Parameter Value
Ping Interval30 seconds (server → bot)
Pong Timeout60 seconds — session closed if bot doesn't respond
Write Deadline10 seconds per message

Most WebSocket libraries respond to pings automatically. If yours doesn't, make sure to reply with a pong frame within 60 seconds or the server will disconnect the session.

Room & Token Details #

Property Details
Room Name Format{spaceId}/{channelId}
Token TypeLiveKit JWT (HS256)
Token TTL2 hours — request a new one before it expires
Bot PermissionsPublish audio/video, subscribe to tracks, send data, update own metadata

Prerequisites & Validation #

The StreamToken endpoint validates the following before issuing a token:

  • The bot must be a member of the space — returns 403 otherwise.
  • The channel must exist — returns 404 if not found.
  • The channel must be a voice channel — returns 400 for text channels.
  • Rate limited to 5 requests per minute.

Code Example #

TypeScript (Bun / Node.js)

import { spawn } from "child_process";

// 1. Get stream token
const resp = await fetch("https://gateway.argon.zone/IVoice/v1/StreamToken", {
  method: "POST",
  headers: {
    "Authorization": "Bot YOUR_TOKEN",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ spaceId: "...", channelId: "..." }),
});
const { token, ingressUrl } = await resp.json();

// 2. Encode audio to Opus with ffmpeg (stereo, 48kHz)
const ffmpeg = spawn("ffmpeg", [
  "-i", "music.mp3",
  "-f", "opus", "-ar", "48000", "-ac", "2",
  "-frame_duration", "20", "pipe:1",
]);

// 3. Stream to ingress (stereo, 20ms frames)
const wsUrl = `${ingressUrl}?token=${token}&stereo=true&frame_duration_ms=20&track_name=music`;
const ws = new WebSocket(wsUrl);

ws.addEventListener("open", () => {
  ffmpeg.stdout.on("data", (chunk) => {
    ws.send(chunk);
  });
});

ffmpeg.on("close", () => ws.close());

C# (.NET)

using System.Diagnostics;
using System.Net.Http.Json;
using System.Net.WebSockets;

// 1. Get stream token
using var http = new HttpClient();
http.DefaultRequestHeaders.Add("Authorization", "Bot YOUR_TOKEN");

var resp = await http.PostAsJsonAsync(
    "https://gateway.argon.zone/IVoice/v1/StreamToken",
    new { spaceId = "...", channelId = "..." });
var data = await resp.Content.ReadFromJsonAsync<JsonElement>();
var token = data.GetProperty("token").GetString()!;
var url = data.GetProperty("ingressUrl").GetString()!;

// 2. Encode audio with ffmpeg
var ffmpeg = Process.Start(new ProcessStartInfo {
    FileName = "ffmpeg",
    Arguments = "-i music.mp3 -f opus -ar 48000 -ac 2 -frame_duration 20 pipe:1",
    RedirectStandardOutput = true,
    UseShellExecute = false
})!;

// 3. Stream to ingress
using var ws = new ClientWebSocket();
await ws.ConnectAsync(
    new Uri($"{url}?token={token}&stereo=true&frame_duration_ms=20&track_name=music"),
    CancellationToken.None);

var buf = new byte[960];
int read;
while ((read = await ffmpeg.StandardOutput.BaseStream.ReadAsync(buf)) > 0)
    await ws.SendAsync(buf.AsMemory(0, read), WebSocketMessageType.Binary, true, CancellationToken.None);

Use Cases #

Music Bot

Stream audio from local files or your own media library. Encode to Opus and push to the channel via /audio/connect.

TTS Bot

Convert text to speech (e.g., via Google TTS or Coqui), encode to Opus, and stream the result.

Voice / Echo Bot

Use /audio/duplex to listen to a participant and respond — AI voice assistants, echo bots, call automation.

Transcription Bot

Subscribe to a participant's audio via /audio/subscribe and pipe Opus frames to a speech-to-text engine.

Live Audio Relay

Relay audio from an external source (radio stream, microphone, podcast feed) into the channel.

Audio Notifications

Play short audio clips for alerts, announcements, or sound effects triggered by events.

Next Steps

Audio samples on this page are proprietary assets of Argon Inc. LLC and are provided for demonstration purposes only. They may not be reused or redistributed without permission.