ORION · Perception Layer

BROTEUS.

Biometric Recognition & Object-Tracking Engagement with Universal Sensing

A real-time vision system that detects objects, tracks hands, recognizes gestures, and understands hand animations, all running simultaneously in a single pipeline.

The entire system runs on CPU at ~21 FPS. No GPU required.

Source on GitHub View Live Detection

21FPS

CPU Throughput

87%

Confidence

Hand Keypoints

35-d

Feature Vector

Project Classification

Perception Pipeline

LIVE

RolePerception Layer

Portlocalhost:8100

DetectorYOLO-World

DepthMiDaS (mono)

HandsMediaPipe

StackPyTorch · FastAPI

Built By

Swan Yi Htet

+ David Young

§ 01

Pipeline Overview

A camera feed is processed through four parallel subsystems running simultaneously: YOLO-World detection, MediaPipe dual-hand tracking, learning-based gesture classification, and DTW-based animation recognition.

Clicking a detected object triggers a grasp affordance heatmap showing optimal contact surfaces. Every subsystem runs in real-time on CPU.

§ 02

Object Detection

Open-vocabulary detection using YOLO-World. The operator decides what exists in the scene.

Real-time detection with user-driven search list. Classes added and removed on the fly.

Traditional YOLO is locked to 80 COCO classes. YOLO-World accepts arbitrary text queries at runtime.

BROTEUS starts with zero classes. The operator adds object names through the UI, and BROTEUS begins searching for them. Removing a class is a single click. The search list persists across restarts.

Confidence79-87%

Throughput21 FPS · CPU

TrackingIoU · Persistent IDs

§ 03

Gesture Recognition

Static hand pose classification with rotation-invariant 35-dimensional encoding.

Dual-hand recognition with independent left/right tracking.

BROTEUS tracks up to two hands using MediaPipe HandLandmarker, giving 21 3D keypoints per hand.

Left and right are tracked independently with separate classifiers, separate memory, and separate state. Each hand pose is compressed into a 35-dimensional feature vector.

Feature Vector Breakdown

The 35-Dimensional Encoding

35-d

per hand pose

finger curls

tip-palm dist

inter-tip dist

z-depth ratios

thumb proximity

palm normal

palm direction

The palm orientation features separate this from typical "is the finger up or down" approaches. When a hand rotates, curl angles barely change but the palm normal flips completely. That signal is encoded.

§ 04

Animation Recognition

Temporal motion recognition via Dynamic Time Warping with Sakoe-Chiba band constraint.

Recognizing a learned hand animation in real-time.

Static gestures capture frozen poses. Animations capture movements: a beckoning curl, a wave, a circular spin.

BROTEUS solves this with Dynamic Time Warping. A 12-d temporal feature vector (curls + palm normal + position + velocity) is pushed into a sliding window covering the last ~3 seconds.

DTW handles speed variation naturally: a fast wave and slow wave produce different frame counts but share the same motion shape.

§ 05

Architecture

BROTEUS Server

FastAPI · Port 8100

YOLO-World

Detection

MediaPipe

Dual Hands

Gesture 35D

Anim 12D + DTW

MiDaS

Depth

IoU Object Tracker

Persistent IDs · Class-vote stability

Grasp Affordance Scorer

WebSocket Frame Streaming

WebSocket / JSON

Live Dashboard

Browser · localhost:8100