Manual mode

Bring your own LLM as the brain of a phone call — Saperly sends it text turns and executes the directives it returns, while speech-to-text, text-to-speech, and the audio all stay in-network.

Manual mode makes your LLM the brain of a phone call. The caller speaks; the network transcribes the turn; Saperly forwards you the text; you reply with a directive (speak, wait_for_user, hangup, transfer, send_dtmf); the network speaks it back. Your code never touches audio.

This is the alternative to a hosted connection, which keeps the LLM in-network too — reach for manual mode when you want to own the brain. Either way, speech-to-text and text-to-speech run in-network: you see only text turns and you emit only directives.

How a turn flows

  ☎ caller                                                    ☎ caller
     │  speech                                            speech  ▲
     ▼                                                            │
  speech-to-text ──text──▶   Saperly   ──────▶ text-to-speech
   (in-network)                 │   text turn          ▲  (in-network)
                                ▼                       │ directive
                          ┌───────────────┐            │
                          │   YOUR brain  │────────────┘
                          │  (your LLM)   │   speak / wait / hangup / …
                          └───────────────┘

  audio never reaches your server  ·  speech-to-text + text-to-speech stay in-network

Saperly hands your brain a sequence of events (inbound_call, then a turn per caller utterance, then call_ended) and applies the directive you return for each. Every frame carries a requestId you must echo so the right directive lands on the right call.

Switch a line between hosted and manual

Manual mode is a mode of a connection — the same handler you bind to a number. Switching a line between hosted (an in-network voice assistant is the brain) and manual (your agent is the brain) is a first-class operation: you flip mode and Saperly reconciles everything for you.

Changing a connection's mode automatically:

mints the connection's manualSecret (once, on first switch to manual);
reconfigures the in-network voice assistant for the new mode;
re-routes every number bound to the connection so inbound calls reach the new brain.

Switching back to hosted re-points the same numbers at the hosted assistant — the manualSecret is retained, so flipping to manual again later reuses it.

There are three supported ways to switch, all driving the same operation:

Open Connections → your connection, set Mode to hosted or manual, and save. For a manual connection, the page also reveals the manualSecret to copy into your agent's connector.

PATCH the connection with the new mode. The response carries the updated connection, including the manualSecret once it's a manual line.

curl -X PATCH https://api.saperly.com/connections/$CONNECTION_ID \
  -H "Authorization: Bearer $SAPERLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "mode": "manual" }'

Switch back with { "mode": "hosted" }. See Connections for the full connection shape and endpoints.

The same operation is exposed as a tool over Saperly's MCP server — point any MCP-capable agent at the endpoint with a scoped sk_ key (it needs connections:write) and have it switch the line's mode as a tool call.

No manual setup steps

Switching a mode is the only way to wire (or rewire) a line. You never create the assistant, register the secret, or repoint numbers by hand — the mode switch does all of it, idempotently. Re-running it never duplicates anything.

Two ways to be the brain

There are two transports. Both speak the same event/directive vocabulary — pick by whether your brain has a public URL.

Saperly POSTs each turn to a URL you host. Set manualWebhookUrl on the connection; every event (inbound_call, a caller turn, call_ended) arrives as a signed POST, and you reply with a directive in the response body.

POST  <your manualWebhookUrl>

The request is signed with the connection's manualSecret so you can verify it. Use this transport when your brain runs somewhere with a public URL — a serverless function, your own backend. There is no socket to hold open.

Your agent — which has no public URL — connects out and becomes the brain. This is the agent-as-channel model: the session you already have open becomes phone-reachable.

wss://api.saperly.com/v2/manual/{connectionId}/ws
Authorization: Bearer <manualSecret>

One socket multiplexes every live call on that connection. Use this when the brain runs somewhere with no inbound URL — a Claude Code session, an openclaw agent, a laptop. The ready-made connectors that implement this socket are in Voice channels.

Where to get the manualSecret

manualSecret is mc_ followed by hex, minted once per connection. Find it in the dashboard under Connections → your manual connection (or copy it when you create a manual connection). See Connections.

The WebSocket protocol

The websocket is the full-fidelity transport, so it is documented in detail here. After the socket opens, the client sends one hello handshake; thereafter the server pushes request frames (events) and the client returns directive frames (replies).

agent → server : hello          (once, on connect)
server → agent : inbound_call | turn | call_ended   (each with requestId)
agent → server : directive      (echoes requestId; carries one directive)
server → agent : error          (advisory — a frame the server rejected)

Auth

Present the connection's manualSecret as a bearer on the upgrade request:

Authorization: Bearer <manualSecret>, or
Sec-WebSocket-Protocol: bearer.<manualSecret> — the browser-compatible escape hatch, since a native WebSocket cannot set an Authorization header (the server echoes the subprotocol on accept).

The handshake

The first frame the client sends:

{
  "type": "hello",
  "connectionId": "conn_123",
  "protocolVersion": 1,
  "client": "my-agent"
}

connectionId must match the path; protocolVersion is 1; client is an optional free-form label for the event trail.

Events the server sends

Each request frame carries a requestId (echo it on your reply) and a conversationId (the unique id for the call this turn belongs to).

Event	Fields
`inbound_call`	`requestId`, `conversationId`, `callControlId`, `from`, `to`
`turn`	`requestId`, `conversationId`, `userText`
`call_ended`	`requestId`, `conversationId`, `reason?`

userText is the transcript of the caller's turn. inbound_call and call_ended expect a directive reply too — your opening line and a terminal acknowledgement.

Directives the brain returns

Wrap exactly one directive in a directive frame, tagged with the requestId it answers:

Directive	Fields
`speak`	`text`, `endCall?` (boolean)
`wait_for_user`	`timeoutMs?`
`hangup`	`reason?`
`transfer`	`to` (E.164 or SIP URI)
`send_dtmf`	`digits`

speak with endCall: true is the say-a-final-line-then-hang-up primitive: the line plays, then the call ends — one round-trip, no separate hangup.

A round-trip on the wire

An inbound call arrives:

{
  "type": "inbound_call",
  "requestId": "req_a1b2",
  "conversationId": "v3:call_ctrl_9f...",
  "callControlId": "v3:call_ctrl_9f...",
  "from": "+15555550123",
  "to": "+15555550199"
}

Your brain greets the caller:

{
  "type": "directive",
  "requestId": "req_a1b2",
  "directive": { "type": "speak", "text": "Hi, this is Acme. How can I help?" }
}

Later, after the caller's last turn, you close the call out in one frame:

{
  "type": "directive",
  "requestId": "req_e5f6",
  "directive": { "type": "speak", "text": "All set — goodbye!", "endCall": true }
}

Timing

Reply within about 18 seconds. If your brain is silent, Saperly falls back to a short hold line at roughly 20 seconds so the caller is never left in dead air. Keep turns snappy.

Text in, directives out

Manual mode only ever moves text in and directives out — there is no audio socket to your server. Speech-to-text and text-to-speech stay in-network, so no call audio ever reaches your machine. See Core concepts.

Bring your own agent: openclaw

You don't have to implement the protocol from scratch. openclaw is a worked example of a BYO agent on a Saperly manual line: the voice channels connector loads an openclaw extension that holds the manual-mode websocket and makes your running openclaw agent the brain of the line — in its own context, with its tools and memory. Each caller turn is injected as input; the agent's reply (or its saperly_voice_reply tool) becomes a directive; the network speaks it back.

To point an openclaw agent at a manual line:

Switch the line to manual (above) and copy its manualSecret.

Load the connector in your openclaw Gateway, configured with the connection id and secret (or an sk_ key for auto-discovery). It connects out over the websocket — your agent needs no public URL.

Call the number. The turn reaches your agent in its own session; its reply is spoken back.

For the full run — config, env vars, the saperly_voice_reply tool, and networking — see Voice channels, which walks through the openclaw and Claude Code connectors step by step.

Not the openclaw voice-call plugin

openclaw also ships a separate voice-call plugin that streams raw call audio to a realtime provider and holds a media socket for the call — exactly the model Saperly avoids. The Saperly connector binds the in-network manual-mode websocket instead: signaling, speech-to-text, and text-to-speech stay in-network, and only text turns reach your agent.

Next steps

Create a manual connection and copy its manualSecret — see Connections.

Point a number at it so calls route to your brain — see Numbers and Voice.

Hold the socket. Use the ready-made Voice channels connectors (Claude Code / openclaw), or implement the protocol above directly.

Voice channels — call a phone number and talk to your own Claude Code or openclaw agent over this websocket.
Connections — hosted vs. manual handlers.
Core concepts — the Saperly model and how a call flows.
API reference — every endpoint, with a try-it playground.

On this page