主 办:北 京 中 医 药 大 学
ISSN 1006-2157 CN 11-3574/R

北京中医药大学学报 ›› 2015, Vol. 38 ›› Issue (9): 587-590.doi: 10.3969/j.issn.1006-2157.2015.09.006

• 文献研究 • 上一篇    下一篇

基于条件随机场的《伤寒论》中医术语自动识别*

孟洪宇1 谢晴宇2 常虹3 孟庆刚1#   

  1. 1 北京中医药大学基础医学院 北京 100029; 2 中国中医科学院中医临床基础医学研究所; 3 内蒙古包头医学院
  • 收稿日期:2015-04-07 出版日期:2015-09-30 发布日期:2015-09-30
  • 通讯作者: 孟庆刚,男,教授,博士生导师,研究方向:基于系统复杂性的中医药信息处理,E-mail:mqgangzy@126.com
  • 作者简介:孟洪宇,女,硕士
  • 基金资助:
    *国家自然科学基金项目(No.81273876, No.81473800, No.81072897)

Automatic identification of TCM terminology in Shanghan Lun based on conditional random field*

MENG Hongyu1 ,XIE Qingyu2 ,CHANG Hong3 ,MENG Qinggang1#   

  1. 1 School of Preclinical Medicine, Beijing University of Chinese Medicine,Beijing 100029; 2 Institute of Basic Clinical Medicine of China Academy of Chinese Medical Sciences; 3 Baotou Medical College, Inner Mongolia
  • Received:2015-04-07 Online:2015-09-30 Published:2015-09-30

摘要: 目的 探索中医术语的自动识别方法,扩充中医文本的自然语言处理形式。方法 采用基于条件随机场(CRF)的方法,针对《伤寒论》文本中的症状、病名、脉象、方剂等中医术语的自动识别标注问题,通过结合字本身、词性、词边界、术语类别标注的特征,分析不同特征组合对术语识别的影响,并探讨最具有效性的组合。结果 以字本身、词边界、词性、类别标签为特征组合的中医术语识别模型准确率为85.00%,召回率为68.00%,F值为75.56%。结论 字本身、词性、词边界、术语类别标注的多特征融合的模型识别效果最优。

关键词: 中医术语, 条件随机场, 伤寒论, 自动识别

Abstract: Objective To explore the methods of automatic identification of TCM terminology and to expand the forms of natural language processing in TCM documents. Methods Based on the methods of conditional random field(CRF), annotation and automatic identification on terms of symptoms, diseases, pulse-types and prescriptions recorded in Shanghan Lun as the research subjects, the effects of different combinations of the features, such as Chinese character itself, part of speech, word boundary and term category label, on identification of terminology were analyzed and the most effective combination was selected. Results The TCM terminology automatic identification model, combining with the features of Chinese character itself, part of speech, word boundary and term category label, had the precision of 85.00%, recall of 68.00% and F score of 75.56%. Conclusion The multi-features model of combination of Chinese character itself, part of speech, word boundary and the term category label achieved the best identifying result in all combinations.

Key words: TCM terminology, conditional random fields, ShangHan Lun, automatic identification

中图分类号: 

  • R222.19