语文网-语言文学网-读书-中国古典文学、文学评论、书评、读后感、世界名著、读书笔记、名言、文摘-新都网移动版

首页 > 学术理论 > 语言学 > 词典学 >

基于组合特征的汉语名词词义消歧

基于组合特征的汉语名词词义消歧
A Study on Noun Sense Disambiguation Based on Syntagmatic Features
王惠
WANG Hui
Email: chswh@nus.edui.sg
    Abstract
    Word sense disambiguation (WSD) plays an important role in many areas of natural language processing, such as machine translation, information retrieval, sentence analysis, and speech recognition. Research on WSD has great theoretical and practical significance.  The main purposes of this study were to study the kind of knowledge that is useful for WSD, and to establish a new WSD model based on syntagmatic features, which can be used to disambiguate noun sense in Mandarin Chinese effectively.
    Close correlation has been found between lexical meaning and its distribution.  According to a study in the field of cognitive science [Choueka, 1983], people often disambiguate word sense using only a few other words in a given context (frequently only one additional word).  Thus, the relationships between one word and others can be effectively used to resolve ambiguity.  Based on a descriptive study of more than 4,000 Chinese noun senses, a multi-level framework of syntagmatic analysis was designed to describe the syntactic and semantic constraints of Chinese nouns.  All of these polyseme nouns were surveyed, and it was found that different senses have different and complementary distributions at the syntax and/or collocation levels.  This served as a foundation for establishing an WSD model by using grammatical information and a thesaurus provided by linguists.
    The model uses the Grammatical Knowledge-base of Contemporary Chinese [Yu Shiwen et al. 2002] as one of its main machine-readable dictionaries (MRDs).  It can provide rich grammatical information for disambiguation of Chinese lexicons, such as parts-of-speech (POS) and syntax functions.
    Another resource of the model is the Semantic Dictionary of Contemporary Chinese [Wang Hui et al. 1998], which provides a thesaurus and semantic collocation information of more than 20,000 nouns.  They were employed to analyze 635 Chinese polysemous nouns.
    By making full use of these two MRD resources and a very large POS-tagged corpus of Mandarin Chinese, a multi-level WSD model based on syntagmatic features was developed.  The experiment described at the end of the paper verifies that the approach achieves high levels of efficiency and precision.
    Key words:  Word Sense Disambiguation, syntagmatic features, noun sense, Chinese Language Information Processing
    1. 词义消歧(WSD)概述
    由於自然语言中一词多义现象普遍存在,因此,要让电脑正确地分析和理解自然语言,一个重要的前提就是能够在某个特定上下文中,自动排除歧义,确定多义词的意义。这就是通常所说的词义消歧(Word sense disambiguation)。
    词义消歧是大多数自然语言处理任务的一个必不可少的中间层次,使用带词义标注的文本可以提高资讯检索中的查全率和查准率,实现基於概念的检索;可以对汉语句法分析中类序同形的歧义问题的解决提供必要的语义信息,为自动句法消歧提供帮助;在机器翻译中有利於选择可以恰当表达语句中词的目标词,以提高翻译的准确性;利用大规模带词义标注的语料库还可以建立基於语义类的语言模型,为语音识别、手写体识别和音字转换提供帮助。因此,词义消歧研究在自然语言处理领域具有重要的理论和实践意义。从50年代初期开始就一直备受计算语言学家的关注[Ide, 1998]。
    1.1 词义消歧的知识源
    早期人们所使用的词义消歧知识一般是凭人手工编制的规则。但手工编写规则费时费力,存在严重的知识获取的“瓶颈”问题,只能处理为数有限的个别词,无法胜任处理大规模文本的词义标注工作。
    20世纪80年代以後,词典成为人们获取词义消歧知识的一个重要知识源。Lesk[1986]、Luk[1995]根据《Oxford Advanced Learner’s Dictionary》中的释义文本来判断多义词在上下文中的词义。Dagan[1991]、Gale[1993]利用双语对照词典来帮助多义词消歧。Voorhees [1993]、Resnik [1995] 从不同角度利用WordNet中的上下位关系、同义关系进行英语词义消歧探索。Yarowsky[1994]提出一种基於义类词典《Roget’s International Thesaurus》的词义消歧方法。使用词典作为词义消歧知识源的优点在於电脑可以从词典中自动获取识别多义词的各个词义的一些重要知识。但这种方法对词的上下文不能进行预测,而且,对词义消歧有帮助的一些组合特徵没有在词典中完全体现出来。
    近年来,随着电脑存储容量和运算速度的飞速提高,通过使用各种机用资源和大规模语料库,电脑能够自动获得各种动态的搭配知识及其统计资料,以此解决规则方法中的知识空缺问题。因而,词义消歧研究中涌现出许多基於语料库统计的方法。比如,Gale & Church[1992,1993]等利用双语语料库对英语多义词进行训练和测试。但使用双语语料库的主要问题是:获得多义词消歧知识的前提是一个多义词在另一种语言中具有不同的翻译词,并且翻译词在另一种语言中必须是单义词,这样必然限定了多义词的处理范围。其次,双语语料库的规模和多样性都很有限,大量多义词或多义词的某个词义在语料中可能从未出现;而且由於现在双语语料对齐技术尚不能达100%的正确,也使得这种方法只能限定在小规模的实验中。
    总的来说,不管是基於规则的方法,还是基於词典的方法,或者基於大规模语料库的方法,任何词义消歧系统都离不开词义消歧时所用知识的资料源,词义消歧知识库的质量已成为词义消歧系统成败的关键。英语词义消歧研究已有多年的历史,但大部分工作都由於缺少足够的词义知识,从而被限制在一个较小的规模(几个或十几个词),大规模英语语料库进行词义标注的工作迄今尚未见到。
基于组合特征的汉语名词词义消歧
(责任编辑:admin)