🤖 Hand Action Prediction with VITRA

Upload a landscape, egocentric (first-person) image containing hand(s) and provide instructions to predict future 3D hand trajectories.

Use Left/Right Hand: Select which hand to predict based on what's detected and what you want to predict.
Instruction: Provide clear and specific imperative instructions separately for the left and right hands, and enter them in the corresponding fields. If the results are unsatisfactory, try providing more detailed instructions (e.g., color, orientation, etc.).
For best inference quality, it is recommended to capture landscape view images from a camera height close to that of a human head. Highly unusual or distorted hand poses/positions may cause inference failures.
It is worth noting that each generation produces only a single action chunking starting from the current state, which does not necessarily complete the entire task. Executing an entire chunking in one step may lead to reduced precision.

👆 Click any example below to load the image and instruction