ORION · Perception Layer

BROTEUS.

Biometric Recognition & Object-Tracking Engagement with Universal Sensing

A real-time vision system that detects objects, tracks hands, recognizes gestures, and understands hand animations, all running simultaneously in a single pipeline.

The entire system runs on CPU at ~21 FPS. No GPU required.

21FPS
CPU Throughput
87%
Confidence
42
Hand Keypoints
35-d
Feature Vector
Project Classification
Perception Pipeline
LIVE
RolePerception Layer
Portlocalhost:8100
DetectorYOLO-World
DepthMiDaS (mono)
HandsMediaPipe
StackPyTorch · FastAPI
Built By
Swan Yi Htet
+ David Young
§ 01

Pipeline Overview

A camera feed is processed through four parallel subsystems running simultaneously: YOLO-World detection, MediaPipe dual-hand tracking, learning-based gesture classification, and DTW-based animation recognition.

Clicking a detected object triggers a grasp affordance heatmap showing optimal contact surfaces. Every subsystem runs in real-time on CPU.

§ 02

Object Detection

Open-vocabulary detection using YOLO-World. The operator decides what exists in the scene.

Real-time detection with user-driven search list. Classes added and removed on the fly.

Traditional YOLO is locked to 80 COCO classes. YOLO-World accepts arbitrary text queries at runtime.

BROTEUS starts with zero classes. The operator adds object names through the UI, and BROTEUS begins searching for them. Removing a class is a single click. The search list persists across restarts.

Confidence79-87%
Throughput21 FPS · CPU
TrackingIoU · Persistent IDs
§ 03

Gesture Recognition

Static hand pose classification with rotation-invariant 35-dimensional encoding.

Dual-hand recognition with independent left/right tracking.

BROTEUS tracks up to two hands using MediaPipe HandLandmarker, giving 21 3D keypoints per hand.

Left and right are tracked independently with separate classifiers, separate memory, and separate state. Each hand pose is compressed into a 35-dimensional feature vector.

Feature Vector Breakdown

The 35-Dimensional Encoding

35-d
per hand pose
5
finger curls
5
tip-palm dist
10
inter-tip dist
5
z-depth ratios
4
thumb proximity
3
palm normal
3
palm direction

The palm orientation features separate this from typical "is the finger up or down" approaches. When a hand rotates, curl angles barely change but the palm normal flips completely. That signal is encoded.

§ 04

Animation Recognition

Temporal motion recognition via Dynamic Time Warping with Sakoe-Chiba band constraint.

Recognizing a learned hand animation in real-time.

Static gestures capture frozen poses. Animations capture movements: a beckoning curl, a wave, a circular spin.

BROTEUS solves this with Dynamic Time Warping. A 12-d temporal feature vector (curls + palm normal + position + velocity) is pushed into a sliding window covering the last ~3 seconds.

DTW handles speed variation naturally: a fast wave and slow wave produce different frame counts but share the same motion shape.

§ 05

Architecture

BROTEUS Server
FastAPI · Port 8100
YOLO-World
Detection
MediaPipe
Dual Hands
Gesture 35D
Anim 12D + DTW
MiDaS
Depth
IoU Object Tracker
Persistent IDs · Class-vote stability
Grasp Affordance Scorer
WebSocket Frame Streaming
WebSocket / JSON
Live Dashboard
Browser · localhost:8100