Cleo has a local API now

My home agent used to live in a chat window. Now it answers to a URL - and everything behind that URL runs on a Mac mini in my house, not someone else's cloud.

For a while, Cleo - the small AI agent that lives on a Mac mini in my house - could be reached exactly one way: a chat window. To make it do something, I opened a chat and typed.

Now Cleo has an API. A real HTTPS address I can call from my phone, a script, or another app, from anywhere. The part I like most: everything behind that address runs on the Mac mini itself. No cloud model, no per-token bill, nothing leaving the house.

What Cleo can do through it

The API wraps three local models, each with one job:

Talk - a chat model, Gemma in its small "E4B" size, run through Ollama. This is the part that reads and writes.
Listen - Parakeet, a speech-to-text model. Send it a voice note, get back a transcript. (I catch myself saying "text-to-speech"; the arrow points the other way.)
Remember - EmbeddingGemma, an embedding model. It turns a piece of text into a list of numbers that captures its meaning rather than its exact words. That is the quiet engine behind search and memory: I can ask Cleo "what did I say about the robot last week," and embeddings are how it finds the right note even if I never used the word "robot." Every message, transcript, and page Cleo keeps gets one of these fingerprints, and finding something is just looking for the closest fingerprints to the question.

None of these are locked in. If you are copying the idea, the chat model could be Qwen2.5 or Mistral; speech could be Whisper or Distil-Whisper; embeddings could be nomic-embed-text or mxbai-embed-large. The shape stays the same.

One door, not three

Three models is awkward if each needs its own address. So I put one small service in front of all of them - a single API that speaks the same shape as the OpenAI API and routes by path. From the outside it looks like one endpoint. Pick a route and watch where a call actually goes:

Then I opened exactly one door to it with Tailscale, which gives that local service an HTTPS address on the public internet without my touching the router or opening a firewall port. Every call needs a key. From my laptop it looks like this:

curl https://<cleo-host>.ts.net/v1/chat/completions \
  -H "Authorization: Bearer <key>" \
  -d '{ "model": "cleo", "messages": [
        { "role": "user", "content": "summarise this in one line" } ] }'

The real host and key stay private, for obvious reasons. The service restarts itself if the Mac reboots, so the door is open again before I wake up.

The one setting that made it work

This cost me an afternoon, so it is worth one paragraph. Gemma can "think" silently before it answers. With a small reply budget, it spent the entire budget thinking and handed back an empty string - the model looked broken when it was just talking to itself. One flag fixed it. Flip it and watch the budget:

How hard can my house push it

This is a 16GB Mac mini, not a datacenter, and it helps to know its limits before leaning on them. Gemma can take up to 128,000 tokens of context in principle, but on a 16GB machine it falls over long before that, so I cap each call at about 16,000 - a long article's worth. Past that point the model's working set no longer stays in the GPU's fast path, the machine starts paging it back and forth through its shared memory, and latency collapses. I measured it on the actual box. Drag the slider to feel where the cliff is:

The honest capacity is one call at a time: quick on short prompts, about a request a minute when I hand it a full 16,000 tokens. For a house that is plenty. For a startup it would not be, and that is fine - this was never trying to be.

I measured the whole thing, properly

Since the point is to lean on this, I wanted real numbers rather than a feeling. I ran each part on the Mac mini itself, sent unique inputs every time so nothing came back from a cache, read the timing straight out of the engine, and sampled memory while it worked. One stream at a time, because a single GPU does not do two things at once.

What	Model	Speed	Time	Memory
Chat, short prompt	gemma4:e4b	reads ~370 tok/s, writes ~30 tok/s	~6s	9.6 GB
Chat, full 16K read	gemma4:e4b	reads ~310 tok/s, writes ~28 tok/s	~50s (about one a minute)	9.6 GB
Embedding, one string	embeddinggemma	n/a	23 ms	4.9 GB
Embedding, batch of 64	embeddinggemma	287 vectors/s	0.22s	4.9 GB
Transcription	parakeet	4-5x faster than real time	12.5s for a 62s clip	0.8 GB

Four things the numbers taught me:

The clock is the prompt, not the answer. Per token, generating is actually the slower step - Gemma writes at a steady ~30 tokens a second whatever I ask. But answers are short and prompts are long, so what I wait on is the reading: 370 tokens a second on a short prompt, 310 once the context fills. In practice the latency of a question is mostly the length of the question.
Transcription is the quiet winner. Parakeet runs four to five times faster than real time and stays there whether the clip is seventeen seconds or seventeen minutes - a one-minute voice note comes back in about twelve, on under a gigabyte of memory.
Memory is the real ceiling, not speed. The chat model alone sits at roughly 9.6 GB on a 16 GB machine. That is why only one model runs hot at a time, and why the context cap earns its keep: there is not much room to spare.
The first call after idle costs about 3.6 seconds while the model loads from disk into memory. Every call after that is warm.

None of this is datacenter throughput, and it does not need to be. It is one person's assistant, answering quickly enough, on a box that costs nothing per call.

Why I like this

Until now, Cleo only existed inside a chat window: useful to me, invisible to everything else. An API is the difference between a gadget and something other tools can build on. A shortcut on my phone, a script that files my voice notes, another app that asks Cleo a question mid-task - all of them can now tap the same local brain, and none of them cost anything per call or send my data anywhere.

Cleo is good enough at the everyday jobs - sorting, summarising, transcribing, remembering - to run them locally all day for the price of electricity. The rare, quality-sensitive job can still go to a bigger model when it earns it. The Mac mini does not have to be the best model in the world. It has to be mine, always on, and a phone call away.