![Credit: Pixabay/CC0 Public Domain cutlery](https://scx1.b-cdn.net/csz/news/800a/2024/cutlery.jpg)
Credit: Pixabay/CC0 Public Domain
Imagine a pizza maker working in a kitchen: measuring out the flour, adding the water and yeast, kneading it into a dough, slicing the pepperoni and other toppings while they let it rise, stretching the dough before assembling the pizza and sliding it into the oven.
Most people can't walk through the steps of pizza-making with the fluency of a seasoned chef, but they can see and identify what goes on. They can watch as the chef opens a bag of flour and swirls it around with a flour scoop, pulls pepperoni from the fridge and runs it through a slicer multiple times, and grates cheese on a box grater. They understand that, ultimately, flour becomes dough, which becomes pizza.
Can computer vision software achieve the same connection?
Notes for success
For Zhu Bin, assistant professor of computer science at SMU, the answer lies in VISOR (VIdeo Segmentations and Object Relations), a dataset that Zhu and his collaborators are working on.
By outlining specific objects such as hands, knives, flour scoops, and graters on first-person videos (also known as egocentric videos) and assigning them identifying labels, VISOR aims to better identify individual objects, understand hand-object interactions, and better infer and understand object transformations, such as flour turning into dough or potatoes turning into french fries.
This process of outlining and labeling objects is called “annotation” and can be achieved using either “sparse masks” or “dense masks”.
“A sparse mask is an annotation that is applied to select keyframes in a video, rather than every frame in the video,” explains Professor Zhu.
“These masks are curated to outline objects at key moments or intervals in a video sequence. Dense masks are detailed continuous pixel-level annotations that cover every frame in a segment of video. In VISOR, they are often generated by interpolation between sparse masks, using computer vision algorithms to fill in the gaps.”
“Sparse masks are extremely useful for fine-grained, egocentric video understanding, such as action recognition (e.g. 'cutting a potato') and object state changes. In contrast, dense annotations enable analysis of how objects are manipulated over time, providing insights into human-object interactions that may be missed with sparse annotations alone.”
VISOR features over 10 million dense marks across 2.8 million images, and each annotated item has a mask that is assigned an entity class (e.g., “knife”, “fork”, “plate”, “cupboard”, “onion”, “egg”) and a macro category (e.g., “cutlery”, “appliance”, “container”, “vegetable”). For example, the entity classes “knife” and “fork” fall into the macro category “cutlery”. Overall, VISOR has 1,477 labeled entities that identify and annotate many kitchen objects.
Besides identifying objects and annotating how objects interact with the human hand, VISOR also proposes the task of “where does this come from?”. In the case of the pizza maker, the flour is identified as coming from the flour bag. VISOR's annotations cover an average of 12 minutes of video, which is significantly longer than most existing datasets. This allows for detailed analysis and inferences about the state of objects over time, facilitating the study of persistent interactions and changes.
Obstacles and future uses
Unlike many other datasets that focus on third-person perspectives, such as UVO (Unidentified Video Objects), VISOR's use of egocentric videos from the EPIC-KITCHENS dataset poses an additional challenge: egocentric videos are dynamic in nature: objects are often blocked when a hand moves over an item, causing the item to deform, as seen in the flour to dough pizza example.
VISOR aims to overcome obstacles in the following ways:
- Fine-grained egocentric video understanding: VISOR provides object masks with clear object boundaries even in the presence of large deformations. This accuracy enables the development of advanced deep models for analyzing fine-grained interactions and deformations in videos, such as egocentric action recognition and object state analysis.
- Enhanced interaction understanding: Detailed annotations of how hands interact with different objects help study and model human behavior, especially in natural settings like the kitchen.
- Long-term video understanding: VISOR supports research on long-term reasoning in videos, such as long-term object tracking, by continuously annotating object actions and transformations across time (e.g., peeling and cooking an onion).
“Once the technology matures and technical challenges such as real-time processing are resolved, technologies such as VISOR can be used to develop assistive technologies to help people with disabilities and the elderly operate and manage real-world tasks more independently,” Professor Zhu told the Research Bureau.
“With the ability to understand complex object interactions and predict future actions, robots can be used in a variety of activities, including cooking, cleaning and manufacturing.”
He added, “Egocentric video understanding can also be leveraged to develop virtual reality (VR) or augmented reality (AR) based training and education tools, providing step-by-step guidance from a first-person perspective.”
Provided by Singapore Management University
Quote: Building Computer Vision in the Kitchen (May 31, 2024) Retrieved May 31, 2024 from https://techxplore.com/news/2024-05-vision-kitchen.html
This document is subject to copyright. It may not be reproduced without written permission, except for fair dealing for the purposes of personal study or research. The content is provided for informational purposes only.