首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper we present an annotated audio–video corpus of multi-party meetings. The multimodal corpus provides for each subject involved in the experimental sessions six annotation dimensions referring to group dynamics; speech activity and body activity. The corpus is based on 11 audio and video recorded sessions which took place in a lab setting appropriately equipped with cameras and microphones. Our main concern in collecting this multimodal corpus was to explore the possibility of providing feedback services to facilitate group processes and to enhance self awareness among small groups engaged in meetings. We therefore introduce a coding scheme for annotating relevant functional roles that appear in a small group interaction. We also discuss the reliability of the coding scheme and we present the first results for automatic classification.  相似文献   

2.
Video conferencing provides an environment for multiple users linked on a network to have meetings. Since a large quantity of audio and video data are transferred to multiple users in real time, research into reducing the quantity of data to be transferred has been drawing attention. Such methods extract and transfer only the features of a user from video data and then reconstruct a video conference using virtual humans. The disadvantage of such an approach is that only the positions and features of hands and heads are extracted and reconstructed, whilst the other virtual body parts do not follow the user. In order to enable a virtual human to accurately mimic the entire body of the user in a 3D virtual conference, we examined what features should be extracted to express a user more clearly and how they can be reproduced by a virtual human. This 3D video conferencing estimates the user’s pose by comparing predefined images with a photographed user’s image and generates a virtual human that takes the estimated pose. However, this requires predefining a diverse set of images for pose estimation and, moreover, it is difficult to define behaviors that can express poses correctly. This paper proposes a framework to automatically generate the pose-images used to estimate a user’s pose and the behaviors required to present a user using a virtual human in a 3D video conference. The method for applying this framework to a 3D video conference on the basis of the automatically generated data is also described. In the experiment, the framework proposed in this paper was implemented in a mobile device. The generation process of poses and behaviors of virtual human was verified. Finally, by applying programming by demonstration, we developed a system that can automatically collect the various data necessary for a video conference directly without any prior knowledge of the video conference system.  相似文献   

3.
倪宁  卢刚  卜佳俊 《计算机仿真》2006,23(8):184-187,195
目前场景检测的研究,主要是基于图像和视频。但音频同样具有丰富的场景信息,基于音频分析的计算量是比较少的,对自动或者半自动的场景检测,基于音频分析的方法也是更为让用户接受的。可以把基于音频分析的方法作为视频场景检测的辅助手段,以获得更为准确的场景检测和分割。该文提出了一个基于内容的音频分析系统,对视频序列实现基于音频分析的场景检测和分割。该系统能有效的解决许多诸如图像变化了,而实际场景并未变化的情形。且本系统整体运算复杂度较基于视频/图像的场景检测与分割系统要低。  相似文献   

4.
Automatic Speech Recognition (ASR) may increase access to spoken information captured in videos. ASR is needed, especially for online academic video lectures that gradually replace class lectures and traditional textbooks. This conceptual article examines how technological barriers to ASR in under-resourced languages impair accessibility to video content and demonstrates it with the empirical findings of Hebrew ASR evaluations. We compare ASR with Optical Character Recognition (OCR) as facilitating access to textual and speech content and show their current performance in under-resourced languages. We target ASR of under-resourced languages as the main barrier to searching academic video lectures. We further show that information retrieval technologies, such as smart video players that combine both ASR and OCR capacities, must come to the fore once ASR technologies have matured. Therefore, suggesting that the current state of information retrieval from video lectures in under-resourced languages is equivalent to a knowledge dam.  相似文献   

5.
Data streaming in telepresence environments   总被引:1,自引:0,他引:1  
In this paper, we discuss data transmission in telepresence environments for collaborative virtual reality applications. We analyze data streams in the context of networked virtual environments and classify them according to their traffic characteristics. Special emphasis is put on geometry-enhanced (3D) video. We review architectures for real-time 3D video pipelines and derive theoretical bounds on the minimal system latency as a function of the transmission and processing delays. Furthermore, we discuss bandwidth issues of differential update coding for 3D video. In our telepresence system - the blue-c - we use a point-based 3D video technology which allows for differentially encoded 3D representations of human users. While we discuss the considerations which lead to the design of our three-stage 3D video pipeline, we also elucidate some critical implementation details regarding decoupling of acquisition, processing and rendering frame rates, and audio/video synchronization. Finally, we demonstrate the communication and networking features of the blue-c system in its full deployment. We show how the system can possibly be controlled to face processing or networking bottlenecks by adapting the multiple system components like audio, application data, and 3D video.  相似文献   

6.
The MORE system is designed for observation and machine-aided analysis of social interaction in real life situations, such as classroom teaching scenarios and business meetings. The system utilizes a multichannel approach to collect data whereby multiple streams of data in a number of different modalities are obtained from each situation. Typically the system collects a 360-degree video and audio feed from multiple microphones set up in the space. The system includes an advanced server backend component that is capable of performing video processing, feature extraction and archiving operations on behalf of the user. The feature extraction services form a key part of the system and rely on advanced signal analysis techniques, such as speech processing, motion activity detection and facial expression recognition in order to speed up the analysis of large data sets. The provided web interface weaves the multiple streams of information together, utilizes the extracted features as metadata on the audio and video data and lets the user dive into analyzing the recorded events. The objective of the system is to facilitate easy navigation of multimodal data and enable the analysis of the recorded situations for the purposes of, for example, behavioral studies, teacher training and business development. A further unique feature of the system is its low setup overhead and high portability as the lightest MORE setup only requires a laptop computer and the selected set of sensors on site.  相似文献   

7.
Three studies of collaborative activity were conducted as part of research in developing multimedia technology to support collaboration. One study surveyed users' opinions of their use of video conference rooms. Users indicated that the availability of the video conference rooms was too limited, audio quality needed improvement, and a shared drawing space was needed. A second study analyzed videotapes of a work group when meeting face-to-face, video conferencing, and phone conferencing. The analyses found that the noticeable audio delay in video conferencing made it difficult for the participants to manage turn-taking and coordinate eye glances. In the third study, a distributed team was observed under three conditions: using their existing collaboration tools, adding a desktop conferencing prototype (audio, video, and shared drawing tool), and subtracting the video capability from the prototype. Qualitative and quantitative data were collected by videotaping the team, interviewing the team members individually, and recording their usage of the phone, electronic mail, face-to-face meetings, and desktop conferencing. The team's use of the desktop conferencing prototype dropped significantly when the video capability was removed. Analysis of the videotape data showed how the video channel was used to help mediate their interaction and convey visual information. Desktop conferencing apparently reduced e-mail usage and was perceived to reduce the number of shorter, two-person, face-to-face meetings.  相似文献   

8.
由于当前大多数笔记本电脑不再配置光驱,但英语教材依然以光盘形式提供视频资料,影响学习效果,因此采用Unity3D集成Vuforia SDK设计实现基于增强现实技术的英语视听说移动教学软件VBook。系统首先构建识别图数据库存于云端,根据识别图名称命名对应的视频文件;然后利用Unity3D设计和渲染场景,设计ImageTarget对象的虚拟视频播放按钮,编写脚本代码实现对识别图数据库及其相应视频的访问;最后生成便于用户使用的移动端应用。用户只需将Camera镜头对准书本插图,即可呈现出虚实叠加的视觉效果,实现移动设备的英语教学视频播放。将增强现实技术应用于英语视频教学,能使用户享受到新颖的学习方法和虚实结合的交互体验。  相似文献   

9.
We developed a low cost, user-friendly multimedia delivery system, to provide medical lectures saved as multimedia contents to persons engaged in medicine. This system was created using the RealSystem package with the TCP/IP network. Users can review lectures and medical meeting presentations with video and audio through the Internet, whenever convenient. Each medical source of video and slide has been clearly displayed on a screen. Members of medical associations or medical students can easily review the most interesting parts of these files. This system is being used efficiently in distance learning and aids the diffusion of the latest information and technology to busy physicians and medical students.  相似文献   

10.
Audio-Visual People Diarization (AVPD) is an original framework that simultaneously improves audio, video, and audiovisual diarization results. Following a literature review of people diarization for both audio and video content and their limitations, which includes our own contributions, we describe a proposed method for associating both audio and video information by using co-occurrence matrices and present experiments which were conducted on a corpus containing TV news, TV debates, and movies. Results show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa.  相似文献   

11.
Because of the media digitization, a large amount of information such as speech, audio and video data is produced everyday. In order to retrieve data from these databases quickly and precisely, multimedia technologies for structuring and retrieving of speech, audio and video data are strongly required. In this paper, we overview the multimedia technologies such as structuring and retrieval of speech, audio and video data, speaker indexing, audio summarization and cross media retrieval existing today for TV news detabase. The main purpose of structuring is to produce tables of contents and indices from audio and video data automatically. In order to make these technologies feasible, first, processing units such as words on audio data and shots on video data are extracted. On a second step, they are meaningfully integrated into topics. Furthermore, the units extracted from different types of media are integrated for higher functions. Yasuo Ariki, Ph.D.: He is a Professor in the Department of Electronics and Informatics at the Ryukoku University. He received his B.E., M.E. and Ph.D. in information science from Kyoto University in 1974, 1976 and 1979, respectively. He had been an Assistant in Kyoto University from 1980 to 1990, and stayed at Edinburgh University as visiting academic from 1987 to 1990. His research interests are in speech and image recognition and in information retrieval and database. He is a member of IPSJ, IEICE, ASJ, Soc. Artif. Intel. and IEEE.  相似文献   

12.
现有多数视频只包含单声道音频,缺乏双声道音频所带来的立体感。针对这一问题,本文提出了一种基于多模态感知的双声道音频生成方法,其在分析视频中视觉信息的基础上,将视频的空间信息与音频内容融合,自动为原始单声道音频添加空间化特征,生成更接近真实听觉体验的双声道音频。我们首先采用一种改进的音频视频融合分析网络,以编码器-解码器的结构,对单声道视频进行编码,接着对视频特征和音频特征进行多尺度融合,并对视频及音频信息进行协同分析,使得双声道音频拥有了原始单声道音频所没有的空间信息,最终生成得到视频对应的双声道音频。在公开数据集上的实验结果表明,本方法取得了优于现有模型的双声道音频生成效果,在STFT距离以及ENV距离两项指标上均取得提升。  相似文献   

13.
王妍  詹雨薇  罗昕  刘萌  许信顺 《软件学报》2023,34(2):985-1006
视频片段检索旨在利用用户给出的自然语言查询语句,在一个长视频中找到最符合语句描述的目标视频片段.视频中包含丰富的视觉、文本、语音信息,如何理解视频中提供的信息,以及查询语句提供的文本信息,并进行跨模态信息的对齐与交互,是视频片段检索任务的核心问题.系统梳理了当前视频片段检索领域中的相关工作,将它们分为两大类:基于排序的方法和基于定位的方法.其中,基于排序的方法又可细分为预设候选片段的方法和有指导地生成候选片段的方法,而基于定位的方法则可分为一次定位的方法和迭代定位的方法.同时对该领域的数据集和评价指标进行了介绍,并对一些模型在多个常用数据集上的性能进行了总结与整理.此外,介绍了该任务的延伸工作,如大规模视频片段检索工作等.最后,对视频片段检索未来的发展方向进行了展望.  相似文献   

14.
15.
Computer simulated avatars and humanoid robots have an increasingly prominent place in today's world. Acceptance of these synthetic characters depends on their ability to properly and recognizably convey basic emotion states to a user population. This study presents an analysis of the interaction between emotional audio (human voice) and video (simple animation) cues. The emotional relevance of the channels is analyzed with respect to their effect on human perception and through the study of the extracted audio-visual features that contribute most prominently to human perception. As a result of the unequal level of expressivity across the two channels, the audio was shown to bias the perception of the evaluators. However, even in the presence of a strong audio bias, the video data were shown to affect human perception. The feature sets extracted from emotionally matched audio-visual displays contained both audio and video features while feature sets resulting from emotionally mismatched audio-visual displays contained only audio information. This result indicates that observers integrate natural audio cues and synthetic video cues only when the information expressed is in congruence. It is therefore important to properly design the presentation of audio-visual cues as incorrect design may cause observers to ignore the information conveyed in one of the channels.  相似文献   

16.
Recent trends toward telecommuting, mobile work, and wider distribution of the work force, combined with reduced technology costs, have made video communications more attractive as a means of supporting informal remote interaction. In the past, however, video communications have never gained widespread acceptance. Here we identify possible reasons for this by examining how the spoken characteristics of video-mediated communication differ from face-to-face interaction, for a series of real meetings. We evaluate two wide-area systems. One uses readily available Integrated Services Digital Network (ISDN) lines but suffers the limitations of transmission lags, a half-duplex line, and poor quality video. The other uses optical transmission and video-switching technology with negligible delays, full duplex audio, and broadcast quality video. To analyze the effects of video systems on conversation, we begin with a series of conversational characteristics that have been shown to be important in face-to-face interaction. We identify properties of the communication channel in face-to-face interaction that are necessary to support these characteristics, namely, that it has low transmission lags, it is two way, and it uses multiple modalities. We compare these channel properties with those of the two video-conferencing systems and predict how their different channel properties will affect spoken conversation. As expected, when compared with face-to-face interaction, communication using the ISDN system was found to have longer conversational turns; fewer interruptions, overlaps, and backchannels; and increased formality when switching speakers. Communication over the system with broadcast quality audio and video was more similar to face-to-face meetings, although it did not replicate face-to-face interaction. Contrary to our expectations, formal techniques were still used to achieve speaker switching. We suggest that these may be necessary because of the absence of certain speaker-switching cues. The results imply that the advent of high-speed multimedia networking will improve but not remove all the problems of video conferencing as an interpersonal communications tool, and we describe possible solutions to the outstanding problems.  相似文献   

17.
介绍了一种基于ARM硬件平台和WinCE软件平台的无线音视频监控系统设计方案.该设计应用三星公司的S3C2440A为主芯片,结合摄像头、麦克风和GPRS模块等实现数据的采集、处理和网络传送.S3C2440A具有丰富的硬件接口资源,采用H.264作为音视频数据压缩的核心算法,具有很高的数据压缩比率,适合无线传输.  相似文献   

18.
We introduce a new paradigm for real-time conversion of a real world event into a rich multimedia database by processing data from multiple sensors observing the event. A real-time analysis of the sensor data, tightly coupled with domain knowledge, results in instant indexing of multimedia data at capture time. This yields semantic information to answer complex queries about the content and the ability to extract portions of data that correspond to complex actions performed in the real world. The power of such an instantly indexed multimedia database system, in content-based retrieval of multimedia data or in semantic analysis and visualization of the data, far exceeds that of systems which index multimedia data only after it is produced. We present LucentVision, an instantly indexed multimedia database system developed for the sport of tennis. This system analyzes video from multiple cameras in real time and captures the activity of the players and the ball in the form of motion trajectories. The system stores these trajectories in a database along with video, 3D models of the environment, scores, and other domain-specific information. LucentVision has been used to enhance live television and Internet broadcasts with game analyses and virtual replays in more than 250 international tennis matches.  相似文献   

19.
20.
We present a system for automatically extracting the region of interest (ROI) and controlling virtual cameras' control based on panoramic video. It targets applications such as classroom lectures and video conferencing. For capturing panoramic video, we use the FlyCam system that produces high resolution, wide-angle video by stitching video images from multiple stationary cameras. To generate conventional video, a region of interest can be cropped from the panoramic video. We propose methods for ROI detection, tracking, and virtual camera control that work in both the uncompressed and compressed domains. The ROI is located from motion and color information in the uncompressed domain and macroblock information in the compressed domain, and tracked using a Kalman filter. This results in virtual camera control that simulates human controlled video recording. The system has no physical camera motion and the virtual camera parameters are readily available for video indexing.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号