基于PCA和信息增益的肿瘤特征基因选择方法Tumor feature gene selection method based on PCA and information gain
徐久成,黄方舟,穆辉宇,王云,徐战威
Xu Jiucheng,Huang Fangzhou,Mu Huiyu,Wang Yun,Xu Zhanwei
摘要(Abstract):
针对肿瘤基因数据因维度高和冗余基因较多而导致分类精度低的问题,提出一种基于PCA和信息增益的肿瘤特征基因选择方法.该方法首先使用PCA算法剔除冗余基因,获得预选特征基因子集;然后利用信息增益算法对预选特征基因子集进行优化选取,得到特征基因子集;最后采用不同分类模型对特征基因子集进行仿真实验.实验结果表明,所提方法提高了基因表达谱的分类精度,从而表明致病基因被有效地选取出来.
Aiming at the low classification accuracy of tumor genetic data with the characterstic of high dimensional and unrelated genes,a tumor feature gene selection method based on PCA and information gain is proposed.Firstly,the PCA algorithm is used to eliminate miscellaneous genes and select the preselected feature gene subset in this method.Then,the information gain algorithm is used to optimize the subset of the preselected feature gene subset,and the feature gene subset is obtained.Finally,different sorting algorithms are used to simulate the feature gene subset.The experimental results show that the method proposed in this paper improves the classification accuracy of gene expression profile,thus indicating that the pathogenic gene is effectively selected.
关键词(KeyWords):
基因分类;主成分分析;信息增益;特征选择
gene classification;PCA;information gain;feature selection
基金项目(Foundation): 国家自然科学基金(61370169;60873104);; 河南省科技攻关重点项目(142102210056;162102210261)
作者(Author):
徐久成,黄方舟,穆辉宇,王云,徐战威
Xu Jiucheng,Huang Fangzhou,Mu Huiyu,Wang Yun,Xu Zhanwei
DOI: 10.16366/j.cnki.1000-2367.2018.02.017
参考文献(References):
- [1]于化龙,顾国昌,赵靖,等.基于DNA微阵列数据的癌症分类问题研究进展[J].计算机科学,2010,37(10):16-32.
- [2]Golub T R,Slonim D K,Tamayop,et al.Class discovery and class prediction by gene expression monitoring[J].Science,1999,286(2):531-537.
- [3]Krishnapuram B,Carin L,Hartemink A.Gene expression analysis:Joint feature selection and classifier design[M].Massachusetts:MIT Press,2004:299-318.
- [4]Chen W,Zheng R,Baade P D,et al.Cancer statistics in China,2015[J].CA Cancer J Clin,2016,66(2):115-132.
- [5]汪荆琪,徐林莉.一种基于多视图数据的半监督特征选择和聚类方法[J].数据采集与处理,2015,30(1):106-116.
- [6]Xing E P,Jordan M I,Karp R M.Feature selection for high-dimensional genomic microarraydata[C].Proceedings of the 18th international conference on Machine Learing,Williamstown,2001.
- [7]Callow M J,Dudoit S,Gong E L,et al.Microarray expression profiling identifies genes with altered expression in HDL-deficient mice[J].Genome Res,2000,10(12):2022-2029.
- [8]胡洋,李波.基于Fisher准则和多类相关矩阵分析的肿瘤基因特征选择方法[J].计算机应用与软件,2016,33(7):76-79.
- [9]陈涛,洪增林,邓方安.基于优化的邻域粗糙集的混合基因选择算法[J].计算机科学,2014,41(10):291-294.
- [10]徐天贺,马媛媛,徐久成.一种基于邻域互信息最大化和粒子群优化的特征基因选择方法[J].小型微型计算机系统,2016,8(37):1775-1779.
- [11]谢娟英,高红超.基于统计相关性与K-means的区分基因子集选择算法[J].软件学报,2014,25(9):2050-2075.
- [12]魏莎莎,陆慧娟,安春霖,等.一种基于互信息最大化的模型无关基因选择方法[J].计算机科学,2014,41(9):224,243-247.
- [13]吴辰文,王伟,李长生,等.一种结合随机森林和邻域粗糙集的特征选择方法[J].小型微型计算机系统,2017,6(38):1358-1362.
- [14]徐久成,冯森,穆辉宇.基于信噪比与随机森林的肿瘤特征基因选择[J].河南师范大学学报(自然科学版),2017,45(2):87-92.
- [15]沈宁敏,李静,周培云,等.一种基于稀疏主成分的基因表达数据特征提取方法[J].计算机科学,2015,42(6A):453-458.
- [16]张玉春,郝平波,王明宇,等.结肠癌基因表达谱的分类检测问题研究[J].计算机工程与应用,2011,47(17):231,244-248.
- [17]周昉,何洁月.生物信息学中基因芯片的特征选择技术综述[J].计算机科学,2007,34(12):143-150.
- [18]刘庆和,梁正友.一种基于信息增益的特征优化选择方法[J].计算机工程与应用,2011,47(12):130-136.
- [19]张小康,帅建梅,史林.基于加权信息增益的恶意代码检测方法[J].计算机工程,2010,36(6):149-151.
- [20]芮兰兰,张洁,郭少勇,熊翱.基于样本加权的基因特征选择模型[J].北京邮电大学学报,2016,39(s1):72-75.
- [21]王洪喜,彭宏.一种基于主成分分析的异常点挖掘方法[J].计算机科学,2007,34(10):192-194.
- [22]阮越,陈汉武,刘志昊,等.量子主成分分析算法[J].计算机学报,2014,37(3):666-676.
- [23]李航.统计学习方法[M].北京:清华大学出版社,2012.
- [24]徐久成,李涛,孙林,等.基于信噪比与邻域粗糙集的特征基因选择方法[J].数据采集与处理,2015,5(30):973-981.
- [25]Laing A A,Harrison C J,Gibson B E S,et al.Unlocking the potential of anti-CD33therapy in adult and childhood acute myeloid leukemia[J].Experimental Hematology,2017,54:40-50.
- [26]Bernusso V A,Machado-Neto J A.Pericole FV Imatinib restores VASP activity and its interaction with Zyxin in BCR-ABL leukemic cells[J].Biochimica Et Biophysica Acta-molecular Cell Research,2015,1853(2):388-395.
- [27]Carvalho S,Reis C A,Pinho S S.Cadherins Glycans in Cancer:Sweet Players in a Bitter Process[J].Trends Cancer,2016,2(9):519-531.
- [28]Xu K,Zhang T T,Wang L,et al.Walleye dermal sarcoma virus:expression of a full-length clone or the rv-cyclin(orf a)gene is cytopathic to the host and human tumor cells[J].Molecular Biology Reports,2013,40(2):1451-1461.