首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 718 毫秒
1.
真实对话数据量不足已经成为限制数据驱动的对话生成系统性能提升的主要因素,尤其是汉语语料。为了获得丰富的日常会话语料,可以利用字幕时间戳信息把英语电视剧的英文字幕及其对应的中文字幕进行同步,从而生成大量的汉英双语同步字幕。然后通过信息检索的方法把双语同步字幕的英文句子跟英语剧本的演员台词进行自动对齐,从而将剧本中的场景和说话者信息映射到双语字幕中,最后得到含有场景及说话者标注的汉英双语日常会话库。该文利用这种方法,自动构建了包含978 109对双语话语消息的接近人类日常会话的多轮会话数据库CEDAC。经过抽样分析,场景边界的标注准确率达到97.0%,而说话者的标注准确率也达到91.57%。该标注库为后续进行影视剧字幕说话者自动标注和多轮会话自动生成研究打下了很好的基础。  相似文献   

2.
We present the MATCH corpus, a unique data set of 447 dialogues in which 26 older and 24 younger adults interact with nine different spoken dialogue systems. The systems varied in the number of options presented and the confirmation strategy used. The corpus also contains information about the users’ cognitive abilities and detailed usability assessments of each dialogue system. The corpus, which was collected using a Wizard-of-Oz methodology, has been fully transcribed and annotated with dialogue acts and “Information State Update” (ISU) representations of dialogue context. Dialogue act and ISU annotations were performed semi-automatically. In addition to describing the corpus collection and annotation, we present a quantitative analysis of the interaction behaviour of older and younger users and discuss further applications of the corpus. We expect that the corpus will provide a key resource for modelling older people’s interaction with spoken dialogue systems.  相似文献   

3.
4.
This article compares one-dimensional and multi-dimensional dialogue act tagsets used for automatic labeling of utterances. The influence of tagset dimensionality on tagging accuracy is first discussed theoretically, then based on empirical data from human and automatic annotations of large scale resources, using four existing tagsets: damsl, swbd-damsl, icsi-mrda and maltus. The Dominant Function Approximation proposes that automatic dialogue act taggers could focus initially on finding the main dialogue function of each utterance, which is empirically acceptable and has significant practical relevance.  相似文献   

5.
The AMI Meeting Corpus contains 100 h of meetings captured using many synchronized recording devices, and is designed to support work in speech and video processing, language engineering, corpus linguistics, and organizational psychology. It has been transcribed orthographically, with annotated subsets for everything from named entities, dialogue acts, and summaries to simple gaze and head movement. In this written version of an LREC conference keynote address, I describe the data and how it was created. If this is “killer” data, that presupposes a platform that it will “sell”; in this case, that is the NITE XML Toolkit, which allows a distributed set of users to create, store, browse, and search annotations for the same base data that are both time-aligned against signal and related to each other structurally. This paper is an extended version of a Keynote Address presented at the Language Resources & Evaluation Conference, Genoa, May 2006.  相似文献   

6.
7.
This paper presents empirical results of an analysis on the role of prosody in the recognition of dialogue acts and utterance mood in a practical dialogue corpus in Mexican Spanish. The work is configured as a series of machine-learning experimental conditions in which models are created by using intonational and other data as predictors and dialogue act tagging data as targets. We show that utterance mood can be predicted from intonational information, and that this mood information can then be used to recognize the dialogue act.  相似文献   

8.
近年来,随着智能家居的普及,对话系统在生活中发挥着越来越重要的作用,基于神经网络构建的生成式对话系统由于其灵活性高受到了许多研究者的关注。以提高生成模型对话的流畅性、上下文相关性为目的,提出基于多视角对抗学习的开放域对话生成模型。其中,模型生成器是基于检索到的相似对话进行改写得到生成的对话;模型的判别器是由两个二分类器共同组成的,该二元判别器分别从句子、对话两个层面多视角地对生成句子进行判别。在中文对话语料上进行实验,该模型在人工评价和自动评测上的得分都高于目前常用的对话生成模型。实验结果表明,利用二元判别器多视角训练可以同时提高生成回复的流畅度和上下文相关性。  相似文献   

9.
知识对话系统旨在使用外部知识和对话上下文生成符合客观事实的回复。已有知识对话的研究较少关注知识对话系统的在线更新的问题。在知识对话系统在线更新中,面临因与知识配对的对话语料标注成本过高而导致零对话语料可用的问题。该文针对知识对话系统零资源更新问题,提出使用Pseudo Data进行模型的在线更新。首先,针对不同的场景,分析成因并提出了不同的Pseudo Data生成策略。 此外,该文在数据集KdConv上验证了当对话语料零资源时该文提出的方法的有效性。实验结果表明,使用Pseudo Data进行更新的模型在知识利用率、主题相关性上接近使用人类标注数据的在线更新模型,能有效使得知识对话系统在对话语料零资源的情况下完成在线更新。  相似文献   

10.
This paper presents our research on automatic annotation of a five-billion-word corpus of Japanese blogs with information on affect and sentiment. We first perform a study in emotion blog corpora to discover that there has been no large scale emotion corpus available for the Japanese language. We choose the largest blog corpus for the language and annotate it with the use of two systems for affect analysis: ML-Ask for word- and sentence-level affect analysis and CAO for detailed analysis of emoticons. The annotated information includes affective features like sentence subjectivity (emotive/non-emotive) or emotion classes (joy, sadness, etc.), useful in affect analysis. The annotations are also generalized on a two-dimensional model of affect to obtain information on sentence valence (positive/negative), useful in sentiment analysis. The annotations are evaluated in several ways. Firstly, on a test set of a thousand sentences extracted randomly and evaluated by over forty respondents. Secondly, the statistics of annotations are compared to other existing emotion blog corpora. Finally, the corpus is applied in several tasks, such as generation of emotion object ontology or retrieval of emotional and moral consequences of actions.  相似文献   

11.
Modern paraphrase research would benefit from large corpora with detailed annotations. However, currently these corpora are still thin on the ground. In this paper, we describe the development of such a corpus for Dutch, which takes the form of a parallel monolingual treebank consisting of over 2 million tokens and covering various text genres, including both parallel and comparable text. This publicly available corpus is richly annotated with alignments between syntactic nodes, which are also classified using five different semantic similarity relations. A quarter of the corpus is manually annotated, and this informs the development of an automatic tree aligner used to annotate the remainder of the corpus. We argue that this corpus is the first of this size and kind, and offers great potential for paraphrasing research.  相似文献   

12.
In this paper, we present a computational model of dialogue, and an underlying theory of action, which supports the representation of, reasoning about and execution of communicative and non-communicative actions. This model rests on a theory of collaborative discourse, and allows for cooperative human–machine communication in written dialogues. We show how cooperative behaviour, illustrated by the analysis of a dialogue corpus and formalized by an underlying theory of cooperation, is interpreted and produced in our model. We describe and illustrate in detail the main algorithms used to model the reasoning processes necessary for interpretation, planning, generation, as well as for determining which actions to perform and when. Finally, we present our implemented system.Our data are drawn from a corpus of human–human dialogues, selected and transcribed from a day-long recording of phone calls at a phone desk in an industrial setting (Castaing, 1993). We present an analysis of this corpus, focusing on dialogues which require, in order to succeed, helpful behaviour on the part of both the caller and the operator.The theoretical framework of our model rests on the theory of collaborative discourse developed by Grosz and Sidner (1986, 1990), Grosz and Kraus (1993, 1996), and further extended by Lochbaum (1994, 1995). An important objective guiding the design of our dialogue model was to allow the agent being modelled to interpret and manifest a type of cooperative behaviour which follows Grosz and Kraus's formalization of the commitment of each collaborative agent towards the actions of the other collaborative agents. The model we propose extends Lochbaum's approach to discourse processing in extending her interpretation algorithm to allow for the treatment of a wider range of dialogues, and in providing an algorithm of task advancement which guides the generation process and allows for the interleaving of execution and planning, thereby facilitating cooperation among agents. The cooperative behaviour of the agent being modelled rests on the use of communicative actions allowing agents to share additional knowledge and assist each other in performing their actions.  相似文献   

13.
近年来,在大规模标注语料上训练的神经网络模型大大提升了命名实体识别任务的性能.但是,新领域人工标注数据获取代价高昂,如何快速、低成本地进行领域迁移就显得非常重要.在目标领域仅给定无标注数据的情况下,该文尝试自动构建目标领域的弱标注语料并对其建模.首先,采用两种不同的方法对无标注数据进行自动标注;然后,采用留"同"去"异...  相似文献   

14.
中文电子病历命名实体和实体关系语料库构建   总被引:1,自引:0,他引:1  
电子病历是由医务人员撰写的面向患者个体描述医疗活动的记录,蕴含了大量的医疗知识和患者的健康信息.电子病历命名实体识别和实体关系抽取等信息抽取研究对于临床决策支持、循证医学实践和个性化医疗服务等具有重要意义,而电子病历命名实体和实体关系标注语料库的构建是首当其冲的.在调研了国内外电子病历命名实体和实体关系标注语料库构建的基础上,结合中文电子病历的特点,提出适合中文电子病历的命名实体和实体关系的标注体系,在医生的指导和参与下,制定了命名实体和实体关系的详细标注规范,构建了标注体系完整、规模较大且一致性较高的标注语料库.语料库包含病历文本992份,命名实体标注一致性达到0.922,实体关系一致性达到0.895.为中文电子病历信息抽取后续研究打下了坚实的基础.  相似文献   

15.
In this paper, we address the issue of generating in-domain language model training data when little or no real user data are available. The two-stage approach taken begins with a data induction phase whereby linguistic constructs from out-of-domain sentences are harvested and integrated with artificially constructed in-domain phrases. After some syntactic and semantic filtering, a large corpus of synthetically assembled user utterances is induced. In the second stage, two sampling methods are explored to filter the synthetic corpus to achieve a desired probability distribution of the semantic content, both on the sentence level and on the class level. The first method utilizes user simulation technology, which obtains the probability model via an interplay between a probabilistic user model and the dialogue system. The second method synthesizes novel dialogue interactions from the raw data by modelling after a small set of dialogues produced by the developers during the course of system refinement. Evaluation is conducted on recognition performance in a restaurant information domain. We show that a partial match to usage-appropriate semantic content distribution can be achieved via user simulations. Furthermore, word error rate can be reduced when limited amounts of in-domain training data are augmented with synthetic data derived by our methods.
Stephanie SeneffEmail:
  相似文献   

16.
THUUyMorph (Tsinghua University Uyghur Morphology Segmentation Corpus)是由清华大学自然语言处理与社会人文计算实验室构建的维吾尔语形态切分语料库。原始语料从2016年的天山网维文版http: //uy.ts.cn/下载,题材内容包含新闻、法律、财经、生活等。语料库构建步骤为: 爬虫、校对原始语料、分句、校对分句、人工和自动形态切分结合、人工标注语音和谐变化现象、人工校对形态切分和语音和谐变化现象。语料库包含10 596个文档、69 200个句子,词语类型为89 923个,分为词级和句子级两类标注,开源网址为http://thuuymorph.thunlp.org/。该研究不仅对维吾尔语语料库的建设具有参考意义,而且为维吾尔语自然语言处理的研究提供了有益的资源。  相似文献   

17.
基于语义依存关系的汉语语料库的构建   总被引:4,自引:1,他引:4  
语料库是自然语言处理中用于知识获取的重要资源。本文以句子理解为出发点,讨论了在设计和建设一个基于语义依存关系的汉语大规模语料库过程中的几个基础问题,包括:标注体系的选择、标注关系集的确定,标注工具的设计,以及标注过程中的质量控制。该语料库设计规模100万词次,利用70个语义、句法依存关系,在已具有语义类标记的语料上进一步标注句子的语义结构。其突出特点在于将《知网》语义关系体系的研究成果和具体语言应用相结合,对实际语言环境中词与词之间的依存关系进行了有效的描述,它的建成将为句子理解或基于内容的信息检索等应用提供更强大的知识库支持。  相似文献   

18.
We present a corpus-based prosodic analysis with the aim of uncovering the relationship between dialogue acts, personality and prosody in view to providing guidelines for the ECA Greta’s text-to-speech system. The corpus used is the SEMAINE corpus, featuring four different personalities, further annotated for dialogue acts and prosodic features. In order to show the importance of the choice of dialogue act taxonomy, two different taxonomies were used, the first corresponding to Searle’s taxonomy of speech acts and the second, inspired by Bunt’s DIT++, including a division of directive acts into finer categories. Our results show that finer-grained distinctions are important when choosing a taxonomy. We also show with some preliminary results that the prosodic correlates of dialogue acts are not always as cited in the literature and prove more complex and variable. By studying the realisation of different directive acts, we also observe differences in the communicative strategies of the ECA depending on personality, in view to providing input to a speech system.  相似文献   

19.
In this paper, we describe tools and resources for the study of African languages developed at the Collaborative Research Centre 632 “Information Structure”. These include deeply annotated data collections of 25 sub-Saharan languages that are described together with their annotation scheme, as well as the corpus tool ANNIS, which provides unified access to a broad variety of annotations created with a range of different tools. With the application of ANNIS to several African data collections, we illustrate its suitability for the purpose of language documentation, distributed access, and the creation of data archives.  相似文献   

20.
代词指代消解是全面理解口语对话不可缺少的一部分。根据口语不同于书面语的特点以及非名词指代先行项的特点,在前人工作的基础上提出了一套适合于在口语对话生语料上消解非名词指代的算法。算法基于非名词指代的右边界规则理论,给出了判断候选先行项属于"线性紧邻"还是"层次紧邻"的判别方法,同时给出了候选先行项的过滤规则。算法在公开发布的口语对话语料Tran is-93上进行了测试,实验结果表明,算法提高了消解的正确率和召回率,能消解更多不同的代词,且适用于口语对话生语料。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号