Skip to content

How Olsie hears

A common and very fair question: "Do we have to train it on our voices? And how would that even work — is there a GPU in the box?"

The short answer: you never train Olsie, and there is no GPU in the box because nothing in your house ever needs one. "Voice" is really three separate jobs, and each is solved in a different place.

1. Waking to its name — solved before the box ships

The wake word ("Hey Olsie", or whatever your household renames it to) is detected by a tiny model — about 36 kilobytes — running on each speaker itself. That model is deliberately speaker-independent: we train it in our build pipeline on thousands of synthetic voices saying the name, with room noise and echoes mixed in, on cloud GPUs we rent by the hour.

By the time it reaches your kitchen it already knows its name in anyone's voice — yours, the kids', a guest's, your gran's. Your voices were never part of training it, because they didn't need to be.

Renaming it works the same way: when your household votes on a new name, we train a fresh model for that name (again on synthetic voices, again in our cloud) and deliver it to your speakers as a signed update. The only thing we learn is the name you chose.

2. Knowing who's talking — an introduction, not a training session

Olsie's per-person answers (kids get kid answers, guests get a polite minimal set) need it to recognise who spoke. This is the only part that involves your actual voices, and the distinction matters:

  • During setup, each person says three short phrases.
  • The Hub runs a pretrained voice-fingerprint model over them — a forward pass on the Hub's CPU, well under a second — and stores the averaged fingerprint.
  • From then on, every request is fingerprinted the same way and matched. No match = treated as a guest.

Nothing is "learned", no model weights change, and there is no training bar to watch. It's a handshake, not a school term. Those fingerprints are computed and stored on the Hub only — they are never synced, never uploaded, and a factory reset destroys them.

3. Understanding the words — pretrained, in your house

Speech-to-text (and Olsie's speaking voice) run on the Hub, bound to localhost, using pretrained models. Microphone audio is processed in your house and discarded; when a request genuinely needs cloud-scale reasoning, only the text of that request is sent on.

Why there's no GPU in the unit

Because the only real training anywhere in the system — teaching a model to hear a name — happens once per name, in our cloud, on rented hardware, before deployment. Everything voice-related inside your home is lightweight inference a Raspberry Pi-class CPU handles comfortably. A GPU in the box would add cost, heat, and noise to do work that simply doesn't exist there.

Trained where Runs where Your voice involved?
Wake word Our cloud (synthetic voices) Each speaker Never
Who's talking Pretrained Hub CPU A 3-phrase introduction, kept on the Hub
Speech-to-text Pretrained Hub Processed in-house, discarded