Host architecture: events, settings, scheduler
Three decisions changed how I think about this host. None is dramatic on its own. Together, they make all the difference.
Three decisions changed how I think about this host. None is dramatic on its own — they’re all “the thing you should do”. Together they have prevented three months of bugs I already know I won’t have.
The context
The host has three simultaneous jobs to coordinate:
- Read media (images, videos) and turn it into RGB frames at the strip’s resolution.
- Apply color transforms (gamma, kelvin, brightness) frame by frame.
- Encode and ship over serial at the rate the scheduler dictates.
On top of that, it has to be drivable by two different clients: a Unity process over stdin (legacy contract), and a web UI over WebSocket (new). Without duplicating logic across the two paths.
If all of that lives in a single thread with print() as
the reporting channel, you’re back in the same movie as
before. Three decisions change the mechanics.
The decision
1. An EventBus instead of print().
scripts/core/events.py exposes a thread-safe pub/sub
with 14 well-defined event types:
PLAYBACK_*— started, paused, stopped, frame-sent, metrics.TRANSPORT_*— connected, disconnected, error, rx.RESOURCE_*— loaded, error.CACHE_*— evicted, cleared.SETTINGS_CHANGED.
Producers (orchestrator, scheduler, transport) emit
events. Consumers (cli/ for Unity, server/ for the
web UI) format them. Core doesn’t know who it’s
talking to — it only knows what happened. Adding a new
client doesn’t touch a single line of core.
2. Immutable settings.
Settings is a pydantic.BaseModel with frozen=True. It
isn’t mutated — it’s replaced. SettingsStore.patch(...)
does exactly that: take the keys in the patch, merge them
with the current snapshot, re-validate the whole thing,
swap the reference atomically. Then emit
SETTINGS_CHANGED and schedule a debounced disk write
(0.5 s).
The debounce isn’t cosmetic: it’s what stops a thousand slider movements from causing a thousand disk writes. The UI sends patches at 60 Hz; the disk sees at most 2/s.
3. FrameScheduler on time.perf_counter.
The old code did time.sleep(1/fps) between frames. Looks
right. It isn’t: time.time() is not monotonic (it can
jump backward on NTP correction), and sleep accumulates
drift. It’s invisible at first; after an hour of playback,
frame 100,000 no longer falls where it should.
The FrameScheduler computes
target_t = start_t + idx / fps with time.perf_counter()
(monotonic, high-resolution) and sleeps until exactly that
moment. If we arrive late — because the previous frame
took longer than the budget — we don’t sleep and we
count the frame as “dropped” in the metrics window.
Every second it emits PLAYBACK_METRICS with
measured_fps, dropped, latency_p50_ms,
latency_p99_ms. The UI charts that without knowing
anything about the scheduler.
class Settings(BaseModel):
model_config = ConfigDict(frozen=True, extra="forbid")
requested_fps: int = Field(default=30, gt=0, le=120)
gamma: float = Field(default=0.4, gt=0)
# ...
def patch(self, partial: dict[str, Any]) -> Settings:
with self._lock:
merged = {**self._current.model_dump(), **partial}
new = Settings.model_validate(merged) # validators run
self._current = new
self._bus.publish(EventType.SETTINGS_CHANGED, ...)
if self._persist_path:
self._schedule_write() # debounced
return new
Why the EventBus survives a misbehaving subscriber. Subscribers run on the publisher’s thread (zero per-event thread allocations), but inside a try/except: if a subscriber raises, the exception is caught and re-emitted as
TRANSPORT_ERROR. The cascade is guarded — aTRANSPORT_ERRORsubscriber that also raises does not trigger anotherTRANSPORT_ERROR, the inner exception is dropped instead. A bad listener (a dropped WebSocket mid-message, say) can never block the other listeners from receiving their event.
What comes next
We have a structured host. But none of this would have been viable without a way to plan the work before writing it. Next post: BMAD and the discipline of one branch per story.