基于Web网页与PDF文档自动构建更新语言模型 Automatic construction and update of language model based on Web and PDF documents期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于Web网页与PDF文档自动构建更新语言模型

引用本文：	张强,陶宏才.基于Web网页与PDF文档自动构建更新语言模型[J].成都信息工程学院学报,2009,24(5):461-465.

作者姓名：	张强陶宏才

作者单位：	西南交通大学信息科学与技术学院,四川,成都,610031

摘要：	提出了利用HTMLParser和PDFBox工具包来编程实现对Web网页文本内容自动提取和PDF文档格式的转换并这些数据进行处理来适合HTK语言建模工具的要求。最后通过实验证明了采用该方法可以较好地实现语言模型的自动更新从而适应识别对象不断变化，同时减少了识别过程中的集外词并提高了语言模型的性能。
关键词：	语音识别语言模型集外词自动更新
Automatic construction and update of language model based on Web and PDF documents

ZHANG Qiang,TAO Hong-cai.Automatic construction and update of language model based on Web and PDF documents[J].Journal of Chengdu University of Information Technology,2009,24(5):461-465.

Authors:	ZHANG Qiang TAO Hong-cai

Institution:	ZHANG Qiang,TAO Hong-cai(School of Information Science & Technology,SWJTU,Chengdu 610031,China)

Abstract:	A new method which makes use of the HTMLParser and PDFBos toolkits to program to extract6 the contents of the Web and transfer the PDF style to TXT style is introduced and deals with them in line with the LM tools in the HTK(Hidden Markov Model Toolkit).The experiment shows that it construct and update the LM automatically so that the perplexity and the OOV(Out of Vocabulary) of the based line LM are estimated.

Keywords:	speech recognition language model out of vocabulary automatic update
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏