How we run OpenCode in the cloud
Codecloud runs OpenCode agents in the cloud. Each run often clones a private GitHub repo and needs full filesystem access. That means each run needs its own isolated environment (filesystem, processes, network, credentials) and that environment needs to be destroyed when the run ends. The best solution we found for this is E2B, who provide a firecracker-based VM with a convenient, programmatic API.
Why E2B
E2B gives you an ephemeral sandbox instance that spins up almost instantly and lets you run commands or read / write files via their SDK. The sandboxes can stay alive for up to 24 hours and you can configure them with the resources you need (e.g. vCPUs and RAM). This made it the perfect choice for running isolated OpenCode instances with robust security guarantees.
E2B also recently started offering sandbox images with OpenCode pre-installed, which is a convenient way to get up and running quickly.
Can we use containers?
Containers are great for running trusted code in a relatively isolated environment, but they don't provide full hardware isolation. In a multi-tenant environment where we're cloning private repos and running agents, this isolation is key. If you're deploying agents just for your organization, running them inside a container could make sense. But for this use-case, sandboxes are more well suited.
The sandbox lifecycle primitives (create, connect, kill), the command and file APIs, and the private networking model meant we could build the control plane without managing VMs ourselves.
Anatomy of a run
When a codecloud run starts, we:
- Create a fresh E2B sandbox
- Mint a temporary, scoped GitHub token for the user
- Clone the repo & check out the target branch
- Start
opencode serveinside the sandbox. - Start a relay process for event streaming from the sandbox (explained below).
- Stream agent output to various sources (our Convex database, Webhooks, Linear)
- Listen to a done signal and destroy the sandbox
Private networking
Sandboxes run in secure mode: processes inside the sandbox can reach the internet (GitHub, provider APIs, package registries), but all connections to the sandbox must go through E2B's proxy and require a traffic access token:
const client = createOpencodeClient({
baseUrl: `https://${sandbox.getHost(port)}`,
fetch: (request) => {
const headers = new Headers(request.headers);
headers.set("e2b-traffic-access-token", sandbox.trafficAccessToken);
return fetch(new Request(request, { headers }));
},
});Streaming events past the 10-minute convex limit
Our backend runs on Convex, and Convex actions have a 10-minute limit. So if we connect to the sandbox from a Convex action, we have 10 minutes to complete the agent run. For complex agent workloads, runs can take much longer than that! E2B themselves allow up to 24 hours for a sandbox's liveness.
Our first approach was to keep a long-lived Convex action connected to the sandbox and stream OpenCode events in real time. Just before the 10 minute timeout, we'd schedule a new Convex action to continue streaming. This works, but there is a non-zero chance of missing events while the new convex action boots up, and some of those events can be important! (for example session.idle for run completions). But an even worse case is if the convex action itself crashes or gets cancelled (e.g. by a deployment), and a new one doesn't get scheduled. This means we can miss the rest of a run, including its completion, entirely.
The solution was to reverse the flow of events: instead of Convex pulling events from the sandbox, a relay script inside the sandbox pushes events out. The relay subscribes to OpenCode's event stream locally and sends them to a webhook on our backend. Our backend only handles short webhook requests instead of a long-running streaming action.
Each webhook request is secured with a token only valid for the duration of the run.
Reliability lessons
Most of our debugging time went into edge cases, not architecture. Here are a few that bit us:
E2B's sandbox.commands.run() doesn't just wait for the shell to finish—it waits for every descendant process to exit. So if you start a background server with nohup server &, the .run() call blocks until that server dies too. To get around this, we used the `background: true` parameter from the E2B SDK, which runs the sandbox command in the background.
E2B sandboxes can live for a long time, so it's essential to know that the OpenCode process inside it is still running and healthy. If it isn't, we want to kill the run as soon as possible and notify the user. To achieve this we built a watchdog that runs every 30 seconds and kills runs after a period of inactivity. The watchdog re-schedules itself to run again in 30 seconds so we don't need a long-lived convex action.
For most runs, we can rely on LLM output as an activity indicator. But we saw that more complex problems can take 10-15 minutes of reasoning before the LLM produces any output. So how do we know the sandbox is still running and hasn't stalled? We actually frequently had issues where the watchdog would kill the run even though it was technically running. Our workaround is a process monitor inside the sandbox that looks at both log output from OpenCode and also to check the process is still running and healthy. We send these as heartbeats to our webhook running inside Convex to make sure the watchdog is aware.
References
For setup details, see the documentation. Questions or issues? Use the support portal.