The Smart Classroom

What if a classroom was a multi-agent system? Not a room with computers in it — a room whose computation was distributed across cameras, code, and the people who walk in. Thirteen students, two instructors, six cameras, one event bus. Over a semester, we prototyped it. This is what one class looks like when the room is paying attention.

Read it as a clock. The chapters below trace a single Monday session, 4:45 PM to 7:50 PM. Each phase introduces the projects that activate during it. The story isn't ten projects in a row; it's one room over three hours.

The room comes online before the people. Smart Stage, the room's orchestrator, boots on a Raspberry Pi 5, polls the V-JEPA classifier sitting on a GPU machine across the network, and reads the schedule. The room is empty; V-JEPA confirms it; Smart Stage settles into a pre-arrival posture — ambient music up, captions off, sensing quiet.

Gordon's Smart Stage absorbs much of what was originally scoped as a separate room-mode project: ambience that adapts to occupancy, music that knows the difference between an empty room and a room filling up. Yuxuan's and Seren's earlier explorations of those room-mode transitions live inside the orchestrator now. The room's brain has many authors.

In the corner, Sophie's Forest in the Classroom seeds a faint generative landscape on the back projection — a visual ambience that will respond to voices later in the session. Right now it is barely there. A room reading itself.

Gus, the dog, is online but not yet projected. The Viam Rover at Kathy's apartment has booted, the camera has handshake. Gus's appearance is scheduled for break, not now. The system knows to wait.

In our first dive into world models, we found that training a V-JEPA probe was fast — with some planning. Carrie trained one first at home, on herself in the office, then carried those learnings into class, and in a short session with the students the classroom probe was trained too. See the setup readme and the clip-gathering strategy.

The obvious first move with a multi-camera classroom would have been attendance — names mapped to seats, presence logged, late arrivals flagged. The class steered away from it. Too obvious, too boring, too much surveillance. What got built reads bodies as posture, not as identity: bodies count as input for what kind of class is happening, not as a record of who showed up. We joked: more a vibe than a roster.

Three ceiling-mounted PoE cameras — left, right, and center-back of the room — produce the bird's-eye view that Gordon's overhead dashboard stitches into one canvas. YOLO6-nano runs on the Pi; blue calibration markers anchor the geometry. As students enter, dots appear on a flat plan of the room. The room knows there are people in it now.

Calibration view of Gordon's overhead dashboard: a top-down camera frame from one of three ceiling-mounted OAK-D-POE cameras, with three blue ArUco markers detected on a long classroom table. A side panel shows the marker counts for each camera and a 100% calibration-readiness gauge. — Fig. 01 · Calibrating the overhead dashboard. Three ceiling cameras agree on the geometry through blue ArUco markers on the table; three out of three see at least two markers each. Gordon Cheng.

Music fades. V-JEPA, polled every ten seconds, shifts its prediction from empty to group work and then, as people sit, to lecture. Smart Stage confirms lecture mode. The room hasn't been told what to do; it has been told what it is.

Smart Stage pivots to lecture posture. Sherpa-onnx streaming ASR turns the instructor's voice into live captions, fed to a wall display and to a Discord channel as a rolling summary. This is Smart Stage doing what it was designed for — coordinating capture, voice synthesis, and ambient state from a single state machine, ~990 lines of Python on a Raspberry Pi.

Feifey's Focus Beam stands by. When the instructor points at a region of the projected slide, MediaPipe reads the gesture and an overlay dims everything else. The beam doesn't ask for attention; it removes the friction of finding it.

Then I move my hand again, but a bit faster this time, and the highlight follows, slightly lagging. After a few seconds, he tilts his head and asks if I was controlling that.
— Feifey, on Jason watching Focus Beam for the first time

The interaction sits in an interesting place — foreground attention, but background initiative. The user doesn't know whether the system is responding to them or just performing. That ambiguity is the tell. Feifey hasn't yet resolved when "pointing intentionally" is distinct from "gesturing naturally"; the line between control and ambient assistance is the unfinished work.

Diagram of Focus Beam: gesture detection feeding a slide region map and a dimming overlay. — Fig. 02 · Focus Beam — speech and gesture into slide regions. Diagram by Feifey Wang.

Vid. 01 · Focus Beam — earliest working demo. The highlighted regions on the slide track Feifey's pointing gesture. Rough, but the mechanic is visible. Feifey Wang.

Behind both, Shuyang's Assignment Tracker listens to the captions and the whiteboard for anything that sounds like a deadline or a task — an ambient archivist. Smart Stage's voice assistant stays muted until invoked.

One attention tool waits, opt-in. Kevin and Mingyue's Sleep Detection is silent unless a student opts in. The whitepaper makes a strong claim that the room should have a place to hold context, but the class also made a strong claim that the room shouldn't watch people who haven't asked to be watched. Lecture is where those two principles negotiate.

The design decision behind Sleep Detection came from a real moment Kevin had in a guest talk.

I didn't want the professor to call me out. If they noticed me and said my name, it would feel embarrassing in front of everyone. At the same time, I did wish someone could have helped me in that moment. If a friend sitting next to me had quietly nudged me — I think that would have been the best kind of reminder, something small and supportive, without drawing attention.
— Kevin, on the moment that became the project

The system the team built routes the camera's read of a drowsy student not to the instructor, but to a friend in the room — a quiet ping, not a call-out. The interaction is deliberately lateral, peer to peer, and the tone of voice matters as much as the detection. When Kevin tested it on a friend, Maggie said it felt "less aggressive than calling someone out" — and then surfaced the tension that the project hasn't resolved.

Yeah… a little. Because the camera is still watching you the whole time. Even if the response is private, that part doesn't really go away.
— Maggie, when asked if it still felt like surveillance

That's the unfinished work. The response is private; the watching is not. Opt-in is the simplest answer the class arrived at, and it's the same answer NodCheck reaches by a different path. The room can offer support without imposing it — but only if the people in the room have said yes first.

Hand-drawn diagram of Sleep Detection: camera reads drowsy state, system pings a designated friend, friend taps the sleeper. — Fig. 03 · Sleep Detection — camera read, lateral notification, peer nudge. Diagram by Kevin & Mingyue.

Vid. 02 · Sleep Detection demo. Kevin & Mingyue.

Tony lives in the corner of the classroom — Ramon and Shuyang's classroom agent, an 18-channel robotic spider with YOLO running behind his eyes and Groq powering his words. He was built for a simple problem: students have questions they don't always feel comfortable asking out loud, and professors can't always pause mid-lecture to address every confusion in the room. Tony sits between those two realities and makes the distance feel smaller.

Tony turned to face her before she even finished raising her hand. That was the moment we knew — he wasn't just detecting. He was paying attention.
— Ramon & Shuyang, on Tony

Tony's world is split between the physical and the digital. In the classroom he moves: he turns, wiggles, stretches, tracks a raised hand the way a good teaching assistant would. On Discord he listens — answering questions in the General channel, archiving every answer the professor has ever given in Tony's Chat Room so the second student to ask the same question never has to wait. The point is not to make the user be part of Tony but for Tony to be part of you. Tony's first build had the channels swapped — reds for blues — and the heat sensor reads 84°F constantly. The body of the agent is its own debug log.

Hand-drawn 7-panel storyboard: Tony scanning the room with YOLO, detecting a raised hand, turning to face the student, then mirroring the interaction across Discord channels. — Fig. 04 · Tony's interaction flow — physical attention in the room, conversational threads on Discord. Storyboard by Ramon Naula & Shuyang Tian.

Yuxuan's English Communication Coach, called Lumi, holds quiet on a desk, ready for Q&A. The original design was real-time correction; class feedback re-cast it as private, delayed, gentle. The shift came from one Monday discussion.

She said she did not want the feedback to feel like, "you made a mistake here. Fix it." She suggested that the system should sound more encouraging, like, "this is not bad, but have you considered this?"
— Yuxuan, on Kathy's redesign of the feedback layer

Lumi catches a small set of common ESL patterns — more good becoming better, informations becoming information — and offers them as suggestions after the speaking moment, not during. The biggest unsolved problem is the transcript itself: browser speech recognition is unreliable, and so the prototype now treats the transcript as a draft for the student to review before analysis.

Diagram of English Communication Coach: input through review through private feedback cards. — Fig. 05 · English Communication Coach — transcript in, gentle private feedback out. Diagram by Yuxuan Chen.

Vid. 03 · English Communication Coach prototype walkthrough. Yuxuan Chen.

The room exhales. Smart Stage transitions ambience — music returns, captions stop. The break is explicit; it is scheduled. And then, on the far wall, Gus appears. Not on a screen — the team segmented the rover's video and projected the dog onto the wall directly, so Gus reads as a window, not a feed.

Gus on the wall felt less like a screen and more like a window.
— Kathy, JuJu, and Seren, retro

Classroom on break: a projection screen on the left wall begins to light up with Gus the dog while three students sit at the long center table. — Fig. 06 · Gus appears. Projection lighting up on the left wall; three students at the table mid-break.

The trio — Kathy, JuJu, Seren — built a two-location system: a Viam Rover 2 with a camera at Kathy's apartment, segmentation in the classroom, voice triggered by a phrase ("hey buddy!") that students could call out. The interaction is small on purpose: a low-stakes social beat in the middle of a class day.

What broke is honest: the rover's Pi was already configured for Viam's network and didn't want to join IxD's Wi-Fi without a full reset. Latency was unconfirmed end-to-end — rover to laptop to segmenter to projector — and the team never identified which step was the bottleneck. The treat dispenser, planned, was not built.

Diagram: Viam Rover camera stream, segmentation pipeline, projector to wall. — Fig. 07 · The Gus pipeline. Viam Rover livestream, segmentation, projection. Diagram by JuJu Kim.

Phil's Timer activates here too — it scopes the break. Magnetic arUco markers stuck to a foamcore board face up; Horizon, the OAK-D on a tripod near the whiteboard, reads them and sets the countdown. The interaction is tactile in a way the rest of the system is not. You walk to the board, you flip a tag.

V-JEPA's classifier trips over to group work; Smart Stage shifts ambience accordingly. The whiteboard becomes the room's center of gravity. Phil's Timer scopes the activity — this is its most natural use, and the moment it became something more than a personal tool.

Bruno is at the whiteboard. Phil lets Bruno figure out the interaction for himself using the icons taped to the arUco tags. Bruno sets the timer by aiming the central tag's arrow at the 15-minute mark. He flips the play tag and the timer display on Phil's laptop confirms it's running. Carrie and Phil exclaim, "Yes!" — excited by the fact that the smart classroom is beginning to come to life.
— Phil, on Bruno using the timer for the first time

The interesting thing about the timer wasn't the timer. It was that setting one became a group gesture. To set a timer is to free oneself from wondering, "are we spending too much time on this?" Phil hasn't yet resolved how the camera mounts in a way that disappears for people without losing sight of the board — the open hardware question.

Diagram of Timer: arUco tags, OAK-D detection, state machine, web display. — Fig. 08 · Timer — arUco tags into countdown state. Diagram by Phil Cote.

Vid. 04 · Timer in use — flipping arUco tags to set the countdown. Phil Cote.

Darren's Inprint reads handwriting from any surface in the room and makes notes from it — the kind of capability that only matters if it is genuinely surface-agnostic. Two moments earlier in the semester pointed Darren toward it.

The first was when a workshop covered the classroom tables with paper. At first the paper was just paper. Over the weeks, the tables began to change — doodles during lectures, then notes, then diagrams drawn directly on the table during group work, sometimes by people who weren't quite aware they were doing it. As students moved around the room, doodles started interacting with other doodles. Whiteboards take setup and cleanup, and they get erased every session. Tables, when you let them, accumulate.

Those tables to this day are a living classroom artifact — notes encompassing a diverse set of insights and personal touches contributing to a growing creative ecosystem.
— Darren, on the paper-covered tables

The second was the FigJam migration. Group work that happened analog — sticky notes, markers, organic conversation, a ha! moments where someone draws a line from one idea to another — only got captured digitally as a flat photograph at the end. Inprint addresses that gap directly: capture the writing continuously, so the digital archive is the work itself rather than a photograph of where it ended up.

The issue, for a while, has been portability. With Inprint, I want to bridge that gap between freeform writing and drawing and make that portable on both digital and analog mediums.
— Darren, on Inprint's purpose

Hand-drawn diagram of Inprint by Darren Chia — labeled blocks for physical writing, an overhead webcam, a processing script, and output canvases. — Fig. 09 · Inprint — diagram by Darren Chia.

Vid. 05 · Inprint — pen on paper, overhead camera, ink captured to a digital canvas. Darren Chia.

Tony, in this phase, shifts posture from observer to embodied helper. The same agent the room watched in lecture is now something you can ask a question of. Same body, different role.

Sophie's Forest, dormant in lecture, reactivates: the more voices the room produces, the more the projected forest grows. Late one Monday, alone in the room playing a recording back into it, Sophie noticed something the design hadn't intended — Forest doesn't know who is talking. The power difference that exists in the room disappears in the forest. Teacher and student each grow a tree. The work surfaced a quality through use that it hadn't been built toward.

When Sophie showed the demo to her mother over a video call — no setup, no explanation — the read came back unprompted.

Isn't it just like the little spirits in a Miyazaki forest? Like they just appeared?
— Sophie's mom

Diagram of Forest in the Classroom: voices into a generative landscape that grows tree by tree. — Fig. 10 · Forest in the Classroom — voices into a generative landscape. Diagram by Sophie Lee.

Demos and focused work share a phase. Smart Stage holds attention but lowers its own footprint — the room knows demos are not lectures, even when the speaker is at the podium. (The simulator flagged this as a place where the room can mis-infer; a body at the podium with the rest of the class seated reads identically as lecture from the camera angle.)

Demos are a particular kind of attention test. The instructor is walking the room through how something works — how to set up Node, how a sensor pipeline fits together, how an ambient system listens without being seen — and the room either follows or doesn't. Most rooms don't say which.

Kathy's NodCheck exists because of one of those moments.

She scans the room, notices all the blank stares, and asks if it makes sense. Nobody says anything. I vigorously shake my head no. She immediately perks up and genuinely thanks me for being honest. "I had a feeling people were lost from all the blank stares!"
— Kathy, on the moment that became NodCheck

The lesson Carrie was teaching was, with quiet irony, on ambient and invisible tech — sensing systems that read the room without being seen. The room was illegible to her until one person made a single explicit gesture. That's the design problem in one frame: there should be a low-risk way for a student to say I'm lost, and the smallest possible signal — a head shake — should be enough to make it legible.

In Labs & Demos posture, NodCheck opens a short comprehension window tied to a question and accumulates yes-nods and head-shakes on the instructor's screen. The interaction is simple and physical — you don't have to interrupt, raise your hand, or admit anything in front of the room. You just nod or don't.

Vid. 06 · NodCheck — nod and head-shake registering against an open question. Kathy Choi.

What broke is the staging. The trigger button doesn't currently live anywhere useful for either students or instructor to reach. Multi-user sensing isn't built yet. The version Kathy wants — multiple students at once, totals displayed on the instructor's device, the trigger placed somewhere both parties can find — is the clearest path forward, and it's the version the Labs & Demos phase actually needs.

The session resolves. Smart Stage posts a Discord summary; Assignment Tracker posts deadlines it caught; Lumi has a small backlog of suggestions that no one will see unless someone opens the private review pane. Forest stops growing. Gus has been gone for half an hour. Person count drops, V-JEPA settles back to empty, and the room dims.

The classroom logged the class.

A dense projected forest of generative trees — Sophie's Forest at the end of a full classroom session of voices. — Fig. 11 · The forest grows by the voice. Captured by Sophie Lee.

Most multi-agent systems literature — AutoGen, CrewAI, LangGraph, MCP-based frameworks — assumes all agents are software. This classroom puts three different kinds on the same bus: perception agents that sense the room (cameras, V-JEPA), reactive agents that respond to events (the student projects, some LLM-backed, some not), and human agents — the students and professors physically present, whose goals the system exists to serve.

The interesting design problem isn't "how do we make software agents collaborate." It's how a coordination layer holds context across cameras, code, and people, and routes events — ambient, broadcast, or directed — based on what the room is doing. A fatigue alert during a five-minute break is noise. During a forty-five-minute lecture, it is signal. The same reading, different context.

V-JEPA, YOLO, MediaPipe, Sherpa-onnx, Groq, and large segmentation networks do the heavy lifting. The room-specific work is thin: a V-JEPA probe that tells apart empty, lecture, group work, break; a Gus segmentation trained on a few hundred annotated frames. Custom layers standing on the shoulders of foundational scaffolding.

A polygon-annotation interface — Smart Select mode — showing one frame of the rover stream of Gus the dog being labeled with the 'gus' class. — Fig. 12 · Training the Gus segmentation: one frame of the Viam Rover stream being polygon-labeled with the *gus* class. JuJu Kim, Kathy Choi, Seren Kim.

The room's brain isn't located. It's a society of small minds — perception agents, reactive agents, human agents — running on the same bus. No single piece knows the room; the room knows itself by their cooperation.

The cross-project conductor — an agent above the agents, routing between them by context — and the full scope of each project are still in the design/prototype stage. Watch a simulation →

Contributors

Instructors

Carrie Kengle
Bruno Kruse

Students

Darren Chia — Inprint
Feifey Wang — Focus Beam
Gordon Cheng — Smart Stage
JuJu Kim — Gus Mode
Kathy Choi — NodCheck, Gus Mode
Kevin Shi — Sleep Detection
Mingyue Zhou — Sleep Detection
Phil Cote — Timer
Ramon Naula — Tony
Seren Kim — Gus Mode
Shuyang Tian — Tony, Assignment Tracker
Sophie Lee — Forest in the Classroom
Yuxuan Chen — English Communication Coach

Source Repositories

smart-objects-cameras — class template, detectors, classroom API
Conversational Machines — Week 6 lecture and lab materials
Interactive Spaces — Week 7 lecture and lab materials
so-smart-stage — Smart Stage orchestrator
so-vjepa-probe — V-JEPA classifier server
so-overhead-dashboard — bird's-eye person tracking
NodCheck — Kathy's nod/head-shake comprehension check

Companion Artifacts

Pre-Arrival

Arrival

Lecture

Break

Group Work

Labs & Demos

Wrap