首页 | 官方网站   微博 | 高级检索  
     


Multimodal Scene Understanding Framework and Its Application to Cooking Recognition
Authors:Ryosuke Kojima  Osamu Sugiyama  Kazuhiro Nakadai
Affiliation:1. Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Tokyo, Japan;2. Honda Research Institute Japan Co., Ltd, Saitama, Japan
Abstract:We propose a multimodal “scene understanding” framework using sensory and text information. Scene understanding is defined by extracting information such as What, When, Where, Who, Why, and How on the surrounding environment. Although scene understanding has been studied, information on why and how was not considered. We constructed a framework for extracting how information, in addition to the conventional information based on multimodality and background knowledge. This framework was applied to a cooking scene, in which how information was defined as a cooking procedure. This framework was evaluated by constructing an audio-visual multimodal cooking recognition system, utilizing recipes as background knowledge. A Convolutional Neural Network (CNN) and a Hierarchical Hidden Markov Model (HHMM) were adopted in this system. Our experiments showed the robustness of the proposed framework in noisy and/or occluded situations. An interactive cooking support system based on the proposed framework might suggest the next step for cooking procedures via human–robot communications.
Keywords:
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号