Biometric Recognition & Object-Tracking Engagement with Universal Sensing
A real-time vision system that detects objects, tracks hands, recognizes gestures, and understands hand animations, all running simultaneously in a single pipeline.
The entire system runs on CPU at ~21 FPS. No GPU required.
A camera feed is processed through four parallel subsystems running simultaneously: YOLO-World detection, MediaPipe dual-hand tracking, learning-based gesture classification, and DTW-based animation recognition.
Clicking a detected object triggers a grasp affordance heatmap showing optimal contact surfaces. Every subsystem runs in real-time on CPU.
Open-vocabulary detection using YOLO-World. The operator decides what exists in the scene.
Real-time detection with user-driven search list. Classes added and removed on the fly.
Traditional YOLO is locked to 80 COCO classes. YOLO-World accepts arbitrary text queries at runtime.
BROTEUS starts with zero classes. The operator adds object names through the UI, and BROTEUS begins searching for them. Removing a class is a single click. The search list persists across restarts.
Static hand pose classification with rotation-invariant 35-dimensional encoding.
Dual-hand recognition with independent left/right tracking.
BROTEUS tracks up to two hands using MediaPipe HandLandmarker, giving 21 3D keypoints per hand.
Left and right are tracked independently with separate classifiers, separate memory, and separate state. Each hand pose is compressed into a 35-dimensional feature vector.
The palm orientation features separate this from typical "is the finger up or down" approaches. When a hand rotates, curl angles barely change but the palm normal flips completely. That signal is encoded.
Temporal motion recognition via Dynamic Time Warping with Sakoe-Chiba band constraint.
Recognizing a learned hand animation in real-time.
Static gestures capture frozen poses. Animations capture movements: a beckoning curl, a wave, a circular spin.
BROTEUS solves this with Dynamic Time Warping. A 12-d temporal feature vector (curls + palm normal + position + velocity) is pushed into a sliding window covering the last ~3 seconds.
DTW handles speed variation naturally: a fast wave and slow wave produce different frame counts but share the same motion shape.