№ 014 · Apps · 2026

Hand City

Two-handed gesture-controlled 3D city editor: MediaPipe hand tracking + Three.js, zero backend.

Year

2026

What it does

A tiny "Blender-in-the-air." A procedurally generated toy city that you arrange and explore entirely with hand gestures over your webcam: one hand moves buildings, the other flies the camera, and together they zoom. Hand tracking and rendering run 100% in the browser, with no backend, no install, no depth model, and no GPU driver setup.

The city is scenery: a road grid with lane markings, varied buildings, pocket gardens with low-poly trees, cars driving the avenues, and soft shadows. Only the buildings are interactive, and the layout is seeded deterministically so it looks the same on every load.

The gesture model

Pinch is the only "button." Each hand is tracked independently, and a strict priority rule keeps the two hands from fighting.

Right hand, the manipulator. Point your index finger and a ring on the ground shows where you are aiming. Pinch (thumb + index) on a highlighted building to grab it, move your hand to slide it across the blocks, and release to drop it snapped to the nearest grid cell.
Left hand, the camera. Two schemes you switch live in the Tuning panel. Inspect (default) orbits the selected building when you pinch. Explore orbits as you move an open hand and pans Blender-style when you pinch.
Both hands, zoom. Pinch with both hands and spread them apart to zoom in, bring them together to zoom out. Both pinching always wins, so zoom never gets confused with move or orbit.

Every sensitivity (move, orbit, pan, zoom) is tunable live from an in-app panel, and the left-hand scheme flips on the fly with no reload.

How it works

Everything runs client-side, every frame:

webcam (getUserMedia)
      │
      ▼
[ MediaPipe HandLandmarker (WASM) ]  ──> 21 landmarks x up to 2 hands + handedness
      │
      ▼
[ per-hand pinch detection ]         ──> (thumb-index distance / palm size), with hysteresis
      │
      ▼
[ two-hand state machine ]           ──> both = ZOOM, right = MOVE, left = ORBIT / PAN
      │
      ▼
[ Three.js scene ]                   ──> raycast pick + grid snap, spherical orbit camera,
 (procedural city + shadows)            free-pan target, soft shadow maps

Webcam frames become 21 landmarks per hand, then a two-hand state machine drives the Three.js scene.

A few design notes:

No depth model. The third dimension comes from deliberate controls, not from estimating distance off the webcam. Buildings move on the ground plane via a camera-to-ground raycast, and the angled view makes near and far read clearly. This keeps it light and dependency-free.
Pinch with hysteresis. It engages at a closer distance than it releases, so it does not flicker mid-action. The thumb-to-index gap is normalized by palm size, so it works at any distance from the camera.
Spherical orbit. Hand-motion deltas map to azimuth and elevation around a target. Inspect mode points the target at the selected building; explore mode lets you pan that target freely.

Tech stack

MediaPipe Tasks Vision 0.10.14: HandLandmarker tracking up to two hands in the browser via WebAssembly, GPU-accelerated with automatic CPU fallback.
Three.js r160: WebGL scene, procedural geometry, and PCF soft shadow maps.
getUserMedia for direct webcam access, with all tracking on-device. The camera feed is never recorded or transmitted.
Plain HTML + ES modules, libraries loaded from a CDN. No build step.