Voice Streaming

Stream audio directly into voice channels via WebSocket. Music bots, TTS, live audio relay — no WebRTC needed.

How It Works #

Argon bots don't use WebRTC. Instead, they stream audio directly to a WebSocket ingress endpoint. The server handles mixing and distribution to all participants in the voice channel.

Your Bot

Opus over WebSocket

SFU

Ingress

WebRTC

Listeners

Three WebSocket endpoints are available depending on your use case:

/audio/connect

Publish only

Stream audio into a room. Music bots, TTS, notifications.

/audio/subscribe

Subscribe only

Receive a participant's audio. Transcription, analysis.

/audio/duplex

Bidirectional

Publish + subscribe on one connection. Voice bots, echo bots.

Endpoints Overview #

You cannot use a single token for both /audio/connect and /audio/subscribe simultaneously. If you need both directions, use /audio/duplex.

`/audio/connect` — Publish #

Publish-only endpoint. Send binary WebSocket messages containing Opus frames — each message is one Opus packet. The server publishes them as an audio track in the Argon Voice Channel.

ws://ingress.argon.gl:12880/audio/connect?token=JWT&stereo=false&frame_duration_ms=20

Query Parameters

Parameter	Required	Description
token	Yes	JWT with room grant
stereo	No	`true` for stereo. Default: `false`
channels	No	Alternative to `stereo`: `2` = stereo, `1` = mono
frame_duration_ms	No	Opus frame duration: 2.5, 5, 10, 20 (default), 40, 60
track_name	No	Display name for the published track (default: `audio`)
track_source	No	Source type: `microphone` (default), `screen_share_audio`
metadata	No	Arbitrary participant metadata (JSON string)

WS messages (bot → server): Binary only. Each binary message = one raw Opus packet.

WS messages (server → bot): Text only. {"status":"ready","session_id":"AWS_xxx"} — track published, safe to start sending.

`/audio/subscribe` — Subscribe #

Subscribe-only endpoint. Receive a specific participant's audio as raw Opus frames. The server also sends JSON status messages as text frames.

ws://ingress.argon.gl:12880/audio/subscribe?token=JWT&target_identity=user-guid&target_track_source=microphone

Query Parameters

Parameter	Required	Description
token	Yes	JWT with room grant
target_identity	Yes	Identity of the participant to listen to
target_track_source	No	`microphone` (default) or `screen_share_audio`

WS Messages (server → bot)

Type	Format	Description
Binary	Raw Opus frame	One Opus packet per message (RTP payload)
Text	{"status":"waiting","session_id":"...","participant_identity":"..."}	Target participant not yet in room
Text	{"status":"subscribed","session_id":"...","track_sid":"...","participant_identity":"..."}	Subscribed to target's audio, frames incoming
Text	{"status":"target_left","session_id":"...","participant_identity":"..."}	Target left the room, WS closes
Text	{"status":"error","message":"..."}	Error occurred, WS closes after

`/audio/duplex` — Bidirectional #

Full-duplex endpoint — publish and subscribe on a single WebSocket connection. One identity, one room join. This is the recommended endpoint for voice bots that need to both speak and listen.

ws://ingress.argon.gl:12880/audio/duplex?token=JWT&target_identity=user-guid&target_track_source=microphone&stereo=false&frame_duration_ms=20

Query Parameters

Parameter	Required	Description
token	Yes	JWT with room grant
target_identity	Yes	Identity of the participant to listen to
target_track_source	No	`microphone` (default) or `screen_share_audio`
track_name	No	Name of the published track (default: `audio`)
track_source	No	Published track source type (default: `microphone`)
stereo	No	`true` for stereo. Default: `false`
channels	No	Alternative to `stereo`: `2` = stereo, `1` = mono
frame_duration_ms	No	Opus frame duration: 2.5, 5, 10, 20 (default), 40, 60
metadata	No	Participant metadata (JSON string)

WS Message Flow

Direction	Type	Description
Bot → Room	Binary	Opus frames published as an audio track
Room → Bot	Binary	Target participant's Opus frames
Room → Bot	Text	{"status":"ready","session_id":"AWS_xxx"} — track published, safe to start sending
Room → Bot	Text	{"status":"waiting","session_id":"...","participant_identity":"..."} — target not yet in room
Room → Bot	Text	{"status":"subscribed","session_id":"...","track_sid":"...","participant_identity":"..."} — receiving target's audio
Room → Bot	Text	{"status":"target_left","session_id":"...","participant_identity":"..."} — target left the room
Room → Bot	Text	{"status":"error","message":"..."} — error, WS closes after

Recommended for call bots. The ICalls/Accept endpoint returns audioBaseUrl and callerId — use them directly to build the duplex URL.

Step-by-Step Flow #

Request a Stream Token

Call the StreamToken endpoint with the target space and channel:

Request

POST /IVoice/v1/StreamToken
Authorization: Bot YOUR_TOKEN
Content-Type: application/json

{
  "spaceId": "550e8400-e29b-41d4-a716-446655440000",
  "channelId": "6ba7b810-9dad-11d1-80b4-00c04fd430c8"
}

Response

{
  "token": "eyJhbGciOiJIUzI1NiIs...",
  "ingressUrl": "ws://ingress.argon.gl:12880",
  "roomName": "550e8400-.../6ba7b810-..."
}

Connect to the WebSocket Endpoint

Open a WebSocket connection to the endpoint that matches your use case. For publish-only (music bots, TTS), use /audio/connect. For voice bots that speak and listen, use /audio/duplex.

Publish-only (music bot)

ws://ingress.argon.gl:12880/audio/connect?token=TOKEN&stereo=true&frame_duration_ms=20

Full-duplex (voice bot in a call)

ws://ingress.argon.gl:12880/audio/duplex?token=TOKEN&target_identity=caller-guid&target_track_source=microphone

See the endpoint sections above for full parameter tables.

Wait for Ready Status

Before sending any audio, wait for the server's {"status":"ready"} text message. This confirms the track is published and the room is joined. Do not send Opus frames before receiving this.

// Server sends:
{"status": "ready", "session_id": "AWS_xxxxx"}

Stream Opus Audio Frames

Send binary WebSocket frames containing Opus-encoded audio data. Each frame should be a single Opus packet (max 3,825 bytes).

Interactive Demo: Echo Bot Flow #

Listen to a real echo bot session and see how the six phases map to the audio timeline. Hit play and watch the phases light up — this is exactly what happens over the WebSocket connection during a call.

0:00/1:00

Greeting

Recording

Transition

Playback

Outro

Background

Greeting0:00–0:16

Bot plays a pre-recorded greeting. Audio is sent as Opus frames over the WebSocket. The bot is not yet listening to the caller.

This is real audio from the echo bot sample in the Argon Chat Echo Bot repository. The phases are driven by timestamp markers defined in a JSON file alongside each audio variant.

Audio samples on this page are proprietary assets of Argon Inc. LLC and are provided for demonstration purposes only. They may not be reused or redistributed without permission.

Audio Requirements #

Parameter	Value
Codec	Opus
Sample Rate	48,000 Hz
Channels	Mono (1) or Stereo (2) — configured via `stereo` or `channels` query param
Frame Duration	2.5, 5, 10, 20 (default), 40, or 60 ms — configured via `frame_duration_ms` query param
Max Frame Size	3,825 bytes per Opus packet
Transport	Binary WebSocket frames (one Opus packet per frame)

Do not send raw PCM. The ingress expects Opus-encoded packets. Use a library like opusenc, ffmpeg, or your language's Opus bindings to encode before sending.

Frame Validation

The server validates every incoming Opus frame by parsing the TOC byte (RFC 6716 §3.1). If a frame's declared structure doesn't match its actual size, the server immediately closes the WebSocket with the error:

opus frame too small for declared structure

This typically happens when:

Sending a truncated or corrupted Opus packet
Re-publishing DTX/comfort noise frames received from a subscribed participant (see below)
Accidentally sending an OGG page header instead of a raw Opus frame
Sending an empty or 1-byte binary message where the TOC byte declares a multi-frame packet

TOC Frame Code	Meaning	Minimum Packet Size
0	1 frame	2 bytes (TOC + at least 1 byte payload)
1	2 equal-size CBR frames	3 bytes (TOC + even payload ≥ 2)
2	2 different-size VBR frames	3 bytes (TOC + size field + data)
3	Arbitrary N frames (CBR/VBR)	3+ bytes (TOC + count byte + padding/sizes + data)

Tip: If you're echoing or relaying received audio frames, validate each packet before sending it back. Drop invalid frames or replace them with a silence packet (0xf8 0xff 0xfe — a valid Opus frame that decodes to silence).

DTX & Comfort Noise Frames

When subscribed to a participant's audio (/audio/subscribe or /audio/duplex), WebRTC clients with DTX (Discontinuous Transmission) enabled will send very small packets during silence — typically 1-byte frames with TOC bytes like 0xdc or 0xfc.

These DTX frames are valid for a decoder to produce comfort noise, but they cannot be re-published through the ingress — the server will reject them as structurally invalid and close the WebSocket.

If you relay or echo received audio: discard all frames ≤ 2 bytes before recording or re-publishing. These are DTX/comfort noise packets that carry no useful audio content. A normal Opus voice frame is typically 20–160 bytes.

Audio Normalization #

If your bot plays multiple audio files (greetings, responses, background music, sound effects), their loudness levels should be normalized so listeners don't experience jarring volume jumps between tracks.

Recommended Target

Parameter	Value	Notes
Integrated Loudness (I)	-16 to -23 LUFS	Streaming standard. Voice-heavy bots can aim for -16, music bots for -14.
True Peak (TP)	-1 dBTP	Prevents clipping after Opus encoding
Loudness Range (LRA)	≤ 11 LU	Keep dynamics consistent across tracks

Key principle: pick one target loudness for your bot and normalize all audio files to it. The specific value matters less than consistency — a greeting, a response, and background music should all sound the same volume to the listener.

Hear the Difference #

Put on headphones and compare the same voice clip at three loudness levels. This is the same "Привет" greeting from the echo bot — normalized to different LUFS targets.

-40 LUFS

Too Quiet

Barely audible. Listener has to max out volume to hear anything.

-8 LUFS

Too Loud

Clipping, distorted. Painful on headphones, wakes the neighbors.

-18 LUFS

Normalized

Clean, comfortable listening level. Consistent with other tracks.

-40

-8

-18

Too loud

Target

Too quiet

-45-40-30-23-18-14-80

Opus playback requires Chrome, Firefox, or Edge. Safari does not support OGG/Opus natively.

Step 1: Measure Loudness

Use ffmpeg's loudnorm filter in analysis mode to measure the loudness of each file:

Measure integrated loudness (LUFS)

ffmpeg -i input.opus -af loudnorm=print_format=json -f null /dev/null

Look at the input_i value in the JSON output — that's the integrated loudness in LUFS. Compare it across all your files.

Example output

// Voice greeting (quiet)
"input_i" : "-40.74"    // -40.7 LUFS

// Background music (too loud!)
"input_i" : "-8.40"     // -8.4 LUFS — 32 dB louder

Step 2: Normalize to a Reference

Pick your main voice track as the reference and normalize all other files to match its loudness. Use loudnorm with your target values:

Normalize to match a reference file

ffmpeg -y -i loud_background.opus \
  -af "loudnorm=I=-40.7:TP=-1:LRA=11" \
  -c:a libopus -b:a 48k -ar 48000 -ac 2 -frame_duration 20 \
  normalized_background.opus

Parameter	Description
I=<value>	Target integrated loudness in LUFS (match your reference)
TP=-1	True peak limit — prevents clipping
LRA=11	Loudness range limit — compresses dynamics
-c:a libopus	Re-encode as Opus
-b:a 48k	Bitrate (48–96 kbps for voice, 96–128 for music)
-frame_duration 20	Match the ingress frame duration setting

Step 3: Verify

Re-measure the output to confirm the loudness matches your target:

ffmpeg -i normalized_background.opus -af loudnorm=print_format=json -f null /dev/null
// Should output input_i close to your target (e.g. -40.7 → -41.2 is fine)

Common pitfalls:

TP must be between -9 and 0 — values outside this range cause ffmpeg errors
Don't mix mono and stereo files without matching the -ac flag to your ingress stereo setting
Always re-measure after normalization — loudnorm may undershoot by 0.5–1 dB, which is perfectly acceptable
If your input is an OGG/Opus file, ffmpeg will transcode losslessly through PCM — this is expected

Keepalive #

Parameter	Value
Ping Interval	30 seconds (server → bot)
Pong Timeout	60 seconds — session closed if bot doesn't respond
Write Deadline	10 seconds per message

Most WebSocket libraries respond to pings automatically. If yours doesn't, make sure to reply with a pong frame within 60 seconds or the server will disconnect the session.

Room & Token Details #

Property	Details
Room Name Format	{spaceId}/{channelId}
Token Type	LiveKit JWT (HS256)
Token TTL	2 hours — request a new one before it expires
Bot Permissions	Publish audio/video, subscribe to tracks, send data, update own metadata

Prerequisites & Validation #

The StreamToken endpoint validates the following before issuing a token:

The bot must be a member of the space — returns 403 otherwise.
The channel must exist — returns 404 if not found.
The channel must be a voice channel — returns 400 for text channels.
Rate limited to 5 requests per minute.

Code Example #

TypeScript (Bun / Node.js)

import { spawn } from "child_process";

// 1. Get stream token
const resp = await fetch("https://gateway.argon.zone/IVoice/v1/StreamToken", {
  method: "POST",
  headers: {
    "Authorization": "Bot YOUR_TOKEN",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ spaceId: "...", channelId: "..." }),
});
const { token, ingressUrl } = await resp.json();

// 2. Encode audio to Opus with ffmpeg (stereo, 48kHz)
const ffmpeg = spawn("ffmpeg", [
  "-i", "music.mp3",
  "-f", "opus", "-ar", "48000", "-ac", "2",
  "-frame_duration", "20", "pipe:1",
]);

// 3. Stream to ingress (stereo, 20ms frames)
const wsUrl = `${ingressUrl}?token=${token}&stereo=true&frame_duration_ms=20&track_name=music`;
const ws = new WebSocket(wsUrl);

ws.addEventListener("open", () => {
  ffmpeg.stdout.on("data", (chunk) => {
    ws.send(chunk);
  });
});

ffmpeg.on("close", () => ws.close());

C# (.NET)

using System.Diagnostics;
using System.Net.Http.Json;
using System.Net.WebSockets;

// 1. Get stream token
using var http = new HttpClient();
http.DefaultRequestHeaders.Add("Authorization", "Bot YOUR_TOKEN");

var resp = await http.PostAsJsonAsync(
    "https://gateway.argon.zone/IVoice/v1/StreamToken",
    new { spaceId = "...", channelId = "..." });
var data = await resp.Content.ReadFromJsonAsync<JsonElement>();
var token = data.GetProperty("token").GetString()!;
var url = data.GetProperty("ingressUrl").GetString()!;

// 2. Encode audio with ffmpeg
var ffmpeg = Process.Start(new ProcessStartInfo {
    FileName = "ffmpeg",
    Arguments = "-i music.mp3 -f opus -ar 48000 -ac 2 -frame_duration 20 pipe:1",
    RedirectStandardOutput = true,
    UseShellExecute = false
})!;

// 3. Stream to ingress
using var ws = new ClientWebSocket();
await ws.ConnectAsync(
    new Uri($"{url}?token={token}&stereo=true&frame_duration_ms=20&track_name=music"),
    CancellationToken.None);

var buf = new byte[960];
int read;
while ((read = await ffmpeg.StandardOutput.BaseStream.ReadAsync(buf)) > 0)
    await ws.SendAsync(buf.AsMemory(0, read), WebSocketMessageType.Binary, true, CancellationToken.None);

Use Cases #

Music Bot

Stream audio from local files or your own media library. Encode to Opus and push to the channel via /audio/connect.

TTS Bot

Convert text to speech (e.g., via Google TTS or Coqui), encode to Opus, and stream the result.

Voice / Echo Bot

Use /audio/duplex to listen to a participant and respond — AI voice assistants, echo bots, call automation.

Transcription Bot

Subscribe to a participant's audio via /audio/subscribe and pipe Opus frames to a speech-to-text engine.

Live Audio Relay

Relay audio from an external source (radio stream, microphone, podcast feed) into the channel.

Audio Notifications

Play short audio clips for alerts, announcements, or sound effects triggered by events.

Next Steps

→ IVoice API Reference — endpoint details and types → Real-time Events — listen for VoiceJoin/VoiceLeave events → API Reference — all interfaces

Audio samples on this page are proprietary assets of Argon Inc. LLC and are provided for demonstration purposes only. They may not be reused or redistributed without permission.

Voice Streaming

How It Works #

Endpoints Overview #

/audio/connect — Publish #

Query Parameters

/audio/subscribe — Subscribe #

Query Parameters

WS Messages (server → bot)

/audio/duplex — Bidirectional #

Query Parameters

WS Message Flow

Step-by-Step Flow #

Request a Stream Token

Connect to the WebSocket Endpoint

Wait for Ready Status

Stream Opus Audio Frames

Interactive Demo: Echo Bot Flow #

Audio Requirements #

Frame Validation

DTX & Comfort Noise Frames

Audio Normalization #

Recommended Target

Hear the Difference #

Step 1: Measure Loudness

Step 2: Normalize to a Reference

Step 3: Verify

Keepalive #

Room & Token Details #

Prerequisites & Validation #

Code Example #

TypeScript (Bun / Node.js)

C# (.NET)

Use Cases #

Music Bot

TTS Bot

Voice / Echo Bot

Transcription Bot

Live Audio Relay

Audio Notifications

Next Steps

`/audio/connect` — Publish #

`/audio/subscribe` — Subscribe #

`/audio/duplex` — Bidirectional #