  位置: 首页 > 科学研究 > 研究方向 > 正文


作者:   时间:2019-06-19   来源:   点击数:


Statistical machine translation based on multi-level linguistic knowledge


Natural Language Processing

实验室关于自然语言处理的研究有基础研究和应用开发两个方面。 前者涉及到词法、句法、语义分析,包括汉语分词、词性标注、注音、命名实体识别、新词发现、句法分析、词义消歧等。在此基础上,开发了文本分类、 旅游资源对话查询 、科技文献检索、机器翻译等应用系统。本研究得到了国家高技术研究发展计划(“863”计划)项目《智能中文搜索引擎技术研究及平台构建 》的支持。

There are two directions, i.e. basic oriented and application oriented. The former includes Chinese word segmentation, part-of-speech tagging, pinyin tagging, named entity recognition, new word detection, syntactic parsing, word sense disambiguation, etc . Based on these, a series of application systems are developed, including the text categorization system, the dialogue based travel information query system, the scientific literature retrieval system, and the machine translation system. The research is supported by the National High Technology Research and Development Program (863 Program).



In participating domestic and international evaluations excellent achievements have been awarded:The 1st place in the named entity recognition task, “ 863” evaluations on Chinese information processing and intelligent human-machine interface, 2004; The 1st place in one task, NIST 2005 Automatic Context Extract (ACE) Evaluations; The 1st place in one task, SIGHAN 2006 Chinese Word Segmentation Evaluations.


Chinese Lexical Analysis

实验室基于统计机器学习理论和方法,结合汉语词法特点,在中文词法分析方面开展了深入研究,并开发了实用的高性能中文词法分析系统 。

Based on statistical machine learning theories and the morphological characteristics of Chinese language, a Chinese lexical analysis system was developed which demonstrated high performance.


The purpose of lexical analysis is to identify the words in the sentence and mark them with the syntactic tagging such as part of speech, and semantic tagging.


自动分词是中文词法分析的关键一步,与西方语言不同,中文的词和词之间没有显性的分隔 标记 。自动切分过程中会出现许多歧义,例如下图中只有红色标记的切分结果是正确的。

Automatic segmentation is a key step in the Chinese lexical analysis, because Chinese sentences are composed with the string of characters without spaces to mark word boundaries. When segmenting automatically , there may exist disambiguation. For instance, in the following figure, only the segmentation marked with the red color is correct.


Automatic segmentaion


Part of speech tagging, as part of syntactic tagging, is to mark each word's part of speech in a sentence, according to its definition and context.


POS tagging


Chinese New Word Detection


With the rapid progress of the society, new words come out continuously. New Word Detection is one of the most critical issues in Chinese word segmentation, a fundamental research topic in Chinese natural language processing.


These new words are mostly domain specific terms and time sensitive terms. From the linguistics point of view, Chinese new words can be categorized as shown in the right diagram according to their derivations.


Flow Chart of Chinese New Word Detection


With statistical approaches, the popular technology of document search index — PAT Array is used to find the frequency information of characters, words, and longest repeated character strings, and extract the new words according to the SCPCD (Symmetric Conditional Probability and Context Dependency) association measurement.


Entity Detection and Tracking


As a kind of critical technology in natural language processing, entity detection and tracking are also important in the field of information extraction, automatic question answering and machine translation.


Entity detection is to identify the named, nominal and pronominal entities from the text, such asperson,location,organization, etc. This task is usually formulated as a classification problem, through training proper classifiers to mark each entity's border and type 。


Flowchart of Entity Detection


Entity tracking, also named coreference resolution, is to identify the same object from different mentions. Usually it consists of two steps: First, obtaining the reference probability from trained classifiers; second, clustering all mentions that refer to the same entity according to the reference probability.


Diagram of Entity Detection and Tracking


Automatic Text Categorization


Text categorization, the task of automatically assigning one or more categories to free text documents according to their contents, is an important component in many information management missions, and is widely applied in information retrieval, machine translation, automatic text summarization, information filtering and mail classification, etc.


The system framework of text categorization


Outliers Learning Based Text Categorization

样本野点定义为无意义网页、错误标记的网页、位于多类类别边界上的网页、类别属性超出预先定义 类别标记集的网页等。本研究采用基于系综学习的野点学习方法剔除网页中的噪声样本,有效地提高了文 本分类的性能。

The main kinds of outliers include: the samples mislabeled or lying on the borders between different categories; the samples that are out of the defined categories and the garbage samples. Based on the ensemble learning method, the outliers are learned and deleted from the training corpus.


The diagram of system construction


Statistical Machine Translation


Statistical Machine Translation (SMT) is the text translation by the statistical parameter models obtained from the training corpus, which has become the mainstream of machine translation research.


Principle of the SMT


Given a source sentence , and based on the statistical model ,the system selects the string with the highest probability by statistical model from all possible target sentences .


Principle of the SMT


Parallel Training System —PGIZA

统计机器翻译模型的训练需要处理大量训练语料,这通常需要大型服务器才能实现。 本研究基于分布式计算,在集群机上实现了并行化模型训练系统-PGIZA,从而极大地提高了工作效率,降低了对计算设备的要 求。

The training module of SMT involves a large amount of corpus which was usually implemented in mainframe servers. A new parallel training system PGIZA was designed and implemented in a computer cluster, which demonstrated the high feasibility.


Parallel Training System

Copyright 版权所有©9659澳门新葡萄娱乐场app学院 All Rrights Reserved.

