首页 | 官方网站   微博 | 高级检索  
     


Balanced corpus of contemporary written Japanese
Authors:Kikuo Maekawa  Makoto Yamazaki  Toshinobu Ogiso  Takehiko Maruyama  Hideki Ogura  Wakako Kashino  Hanae Koiso  Masaya Yamaguchi  Makiro Tanaka  Yasuharu Den
Affiliation:1. Department of Corpus Studies, National Institute for Japanese Language and Linguistics (NINJAL), Tokyo, Japan
2. College of Letters, Ritsumeikan University, Kusatsu, Japan
3. Department of Linguistic Theory and Structure, NINJAL, Tokyo, Japan
4. Faculty of Letters, Chiba University, Chiba, Japan
Abstract:The balanced corpus of contemporary written Japanese (BCCWJ) is Japan’s first 100 million words balanced corpus. It consists of three subcorpora (publication subcorpus, library subcorpus, and special-purpose subcorpus) and covers a wide range of text registers including books in general, magazines, newspapers, governmental white papers, best-selling books, an internet bulletin-board, a blog, school textbooks, minutes of the national diet, publicity newsletters of local governments, laws, and poetry verses. A random sampling technique is utilized whenever possible in order to maximize the representativeness of the corpus. The corpus is annotated in terms of dual POS analysis, document structure, and bibliographical information. The BCCWJ is currently accessible in three different ways including Chunagon a web-based interface to the dual POS analysis data. Lastly, results of some pilot evaluation of the corpus with respect to the textual diversity are reported. The analyses include POS distribution, word-class distribution, entropy of orthography, sentence length, and variation of the adjective predicate. High textual diversity is observed in all these analyses.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号