Manual mode
Bring your own LLM as the brain of a phone call — Saperly sends it text turns and executes the directives it returns, while speech-to-text, text-to-speech, and the audio all stay in-network.
Manual mode makes your LLM the brain of a phone call. The caller speaks;
the network transcribes the turn; Saperly forwards you the text; you reply
with a directive (speak, wait_for_user, hangup, transfer,
send_dtmf); the network speaks it back. Your code never touches audio.
This is the alternative to a hosted connection, which keeps the LLM in-network too — reach for manual mode when you want to own the brain. Either way, speech-to-text and text-to-speech run in-network: you see only text turns and you emit only directives.
How a turn flows
☎ caller ☎ caller
│ speech speech ▲
▼ │
speech-to-text ──text──▶ Saperly ──────▶ text-to-speech
(in-network) │ text turn ▲ (in-network)
▼ │ directive
┌───────────────┐ │
│ YOUR brain │────────────┘
│ (your LLM) │ speak / wait / hangup / …
└───────────────┘
audio never reaches your server · speech-to-text + text-to-speech stay in-networkSaperly hands your brain a sequence of events (inbound_call, then a turn
per caller utterance, then call_ended) and applies the directive you return
for each. Every frame carries a requestId you must echo so the right directive
lands on the right call.
Switch a line between hosted and manual
Manual mode is a mode of a connection — the same
handler you bind to a number. Switching a line between hosted (an in-network
voice assistant is the brain) and manual (your agent is the brain) is a
first-class operation: you flip mode and Saperly reconciles everything for you.
Changing a connection's mode automatically:
- mints the connection's
manualSecret(once, on first switch to manual); - reconfigures the in-network voice assistant for the new mode;
- re-routes every number bound to the connection so inbound calls reach the new brain.
Switching back to hosted re-points the same numbers at the hosted assistant —
the manualSecret is retained, so flipping to manual again later reuses it.
There are three supported ways to switch, all driving the same operation:
Open Connections → your connection, set Mode to hosted or manual, and
save. For a manual connection, the page also reveals the manualSecret to copy
into your agent's connector.
PATCH the connection with the new mode. The response carries the updated
connection, including the manualSecret once it's a manual line.
curl -X PATCH https://api.saperly.com/connections/$CONNECTION_ID \
-H "Authorization: Bearer $SAPERLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{ "mode": "manual" }'Switch back with { "mode": "hosted" }. See Connections
for the full connection shape and endpoints.
The same operation is exposed as a tool over Saperly's
MCP server — point any MCP-capable agent at the endpoint with a
scoped sk_ key (it needs connections:write) and have it switch the line's mode
as a tool call.
No manual setup steps
Switching a mode is the only way to wire (or rewire) a line. You never create the assistant, register the secret, or repoint numbers by hand — the mode switch does all of it, idempotently. Re-running it never duplicates anything.
Two ways to be the brain
There are two transports. Both speak the same event/directive vocabulary — pick by whether your brain has a public URL.
Saperly POSTs each turn to a URL you host. Set manualWebhookUrl on the
connection; every event (inbound_call, a caller turn, call_ended) arrives as a
signed POST, and you reply with a directive in the response body.
POST <your manualWebhookUrl>The request is signed with the connection's manualSecret so you can verify it.
Use this transport when your brain runs somewhere with a public URL — a
serverless function, your own backend. There is no socket to hold open.
Your agent — which has no public URL — connects out and becomes the brain. This is the agent-as-channel model: the session you already have open becomes phone-reachable.
wss://api.saperly.com/v2/manual/{connectionId}/ws
Authorization: Bearer <manualSecret>One socket multiplexes every live call on that connection. Use this when the brain runs somewhere with no inbound URL — a Claude Code session, an openclaw agent, a laptop. The ready-made connectors that implement this socket are in Voice channels.
Where to get the manualSecret
manualSecret is mc_ followed by hex, minted once per connection. Find it
in the dashboard under Connections → your manual connection (or copy it when
you create a manual connection). See Connections.
The WebSocket protocol
The websocket is the full-fidelity transport, so it is documented in detail here.
After the socket opens, the client sends one hello handshake; thereafter the
server pushes request frames (events) and the client returns directive
frames (replies).
agent → server : hello (once, on connect)
server → agent : inbound_call | turn | call_ended (each with requestId)
agent → server : directive (echoes requestId; carries one directive)
server → agent : error (advisory — a frame the server rejected)Auth
Present the connection's manualSecret as a bearer on the upgrade request:
Authorization: Bearer <manualSecret>, orSec-WebSocket-Protocol: bearer.<manualSecret>— the browser-compatible escape hatch, since a nativeWebSocketcannot set anAuthorizationheader (the server echoes the subprotocol on accept).
The handshake
The first frame the client sends:
{
"type": "hello",
"connectionId": "conn_123",
"protocolVersion": 1,
"client": "my-agent"
}connectionId must match the path; protocolVersion is 1; client is an
optional free-form label for the event trail.
Events the server sends
Each request frame carries a requestId (echo it on your reply) and a
conversationId (the unique id for the call this turn belongs to).
| Event | Fields |
|---|---|
inbound_call | requestId, conversationId, callControlId, from, to |
turn | requestId, conversationId, userText |
call_ended | requestId, conversationId, reason? |
userText is the transcript of the caller's turn. inbound_call and
call_ended expect a directive reply too — your opening line and a terminal
acknowledgement.
Directives the brain returns
Wrap exactly one directive in a directive frame, tagged with the requestId it
answers:
| Directive | Fields |
|---|---|
speak | text, endCall? (boolean) |
wait_for_user | timeoutMs? |
hangup | reason? |
transfer | to (E.164 or SIP URI) |
send_dtmf | digits |
speak with endCall: true is the say-a-final-line-then-hang-up primitive:
the line plays, then the call ends — one round-trip, no separate hangup.
A round-trip on the wire
An inbound call arrives:
{
"type": "inbound_call",
"requestId": "req_a1b2",
"conversationId": "v3:call_ctrl_9f...",
"callControlId": "v3:call_ctrl_9f...",
"from": "+15555550123",
"to": "+15555550199"
}Your brain greets the caller:
{
"type": "directive",
"requestId": "req_a1b2",
"directive": { "type": "speak", "text": "Hi, this is Acme. How can I help?" }
}Later, after the caller's last turn, you close the call out in one frame:
{
"type": "directive",
"requestId": "req_e5f6",
"directive": { "type": "speak", "text": "All set — goodbye!", "endCall": true }
}Timing
Reply within about 18 seconds. If your brain is silent, Saperly falls back to a short hold line at roughly 20 seconds so the caller is never left in dead air. Keep turns snappy.
Text in, directives out
Manual mode only ever moves text in and directives out — there is no audio socket to your server. Speech-to-text and text-to-speech stay in-network, so no call audio ever reaches your machine. See Core concepts.
Bring your own agent: openclaw
You don't have to implement the protocol from scratch. openclaw is a worked
example of a BYO agent on a Saperly manual line: the voice channels
connector loads an openclaw extension that holds the manual-mode websocket and
makes your running openclaw agent the brain of the line — in its own context,
with its tools and memory. Each caller turn is injected as input; the agent's
reply (or its saperly_voice_reply tool) becomes a directive; the network speaks
it back.
To point an openclaw agent at a manual line:
Switch the line to manual (above) and copy its manualSecret.
Load the connector in your openclaw Gateway, configured with the connection
id and secret (or an sk_ key for auto-discovery). It connects out over the
websocket — your agent needs no public URL.
Call the number. The turn reaches your agent in its own session; its reply is spoken back.
For the full run — config, env vars, the saperly_voice_reply tool, and
networking — see Voice channels, which walks
through the openclaw and Claude Code connectors step by step.
Not the openclaw voice-call plugin
openclaw also ships a separate voice-call plugin that streams raw call audio
to a realtime provider and holds a media socket for the call — exactly the model
Saperly avoids. The Saperly connector binds the in-network manual-mode
websocket instead: signaling, speech-to-text, and text-to-speech stay
in-network, and only text turns reach your agent.
Next steps
Create a manual connection and copy its manualSecret — see
Connections.
Hold the socket. Use the ready-made Voice channels connectors (Claude Code / openclaw), or implement the protocol above directly.
- Voice channels — call a phone number and talk to your own Claude Code or openclaw agent over this websocket.
- Connections — hosted vs. manual handlers.
- Core concepts — the Saperly model and how a call flows.
- API reference — every endpoint, with a try-it playground.
Voice
Place and control outbound calls and fetch call records over the v2 API — audio always stays in-network, funds are reserved on start and settled on end.
Voice channels
Call a phone number and talk to your own Claude Code or openclaw agent — in its own context, with its tools and memory — over Saperly's manual mode. No audio ever reaches your machine.