从基因表达数据中发现知识摘要

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

从基因表达数据中发现知识

摘要

OPSM模型作为一种基于模式的双聚类方法,在分析基因数据矩阵等方面被广泛的应用。在一个OPSM聚类中,形成聚类的若干基因在特定的条件子集下有一致的表达模式。这种关联的共同表达隐含着基因的关联调控。所以在基因数据矩阵上进行的双聚类分析有极大的生物意义。将挖掘OPSM聚类,转化为序列模式挖掘,双聚类问题就转化为频繁项集的挖掘问题。然而随着越来越多的基因被发现,基因数据矩阵变得越来越庞大。目前针对基因表达数据的双聚类算法都存在时间效率较低的问题。这给频繁项集的发现带来了困难。特别是一些支持度较小的长频繁项集,更是以往的双聚类方法难以发现的有意义信息。Deep-OPSM问题,针对基因数据矩阵中一些支持度较小的长频繁模式的挖掘。将在基因数据分析上有更大的生物意义。但现有的双聚类模型,在针对大型基因数据矩阵的分析时,性能都会受到严重影响。以致于一些隐含在大型基因数据矩阵的深层意义信息难以被发现。所以亟需更加高效的寻找OPSM的方法。

本文根据OPSM模型,建立了一个快速有效的精确性寻找方法,来挖掘分散在基因数据矩阵中的OPSM聚类。首先在基因数据矩阵中的每两行寻找其公共子序列,然后利用STL map,在整个基因数据矩阵的范围内,对找到的公共子序列进行支持度的统计,并将达到支持度阈值的OPSM聚类输出。实验证明该方法能够快速地找到符合条件的OPSM聚类,并且能够通过条件存储,针对长频繁模式进行寻找分析,挖掘出更具生物意义的Deep-OPSM聚类。此外,通过条件存储,可以在多台计算机上实现并行计算,提高分析处理速度,适应大型数据矩阵的分析需求。最后从生物学的角度,验证了该方法的可行性。

关键词:OPSM,序列模式,Deep-OPSM,STL map

Mine the knowledge from the gene expression data

A bstract

Order-preserving submatrix (OPSM) has been widely accepted as a biologically meaningful cluster model, capturing the general tendency of gene expression across a subset of experiments. In an OPSM, the expression levels of all genes induce the same linear ordering of the experiments. The OPSM problem is to discover those statistically significant OPSMs from a given data matrix. The problem is reducible to a special case of the sequential pattern mining problem, where a pattern and its supporting sequences uniquely specify an OPSM. However, as more and more genes are discovered, data sets containing more and more experiments and genes. And existing methods do not scale well to massive data sets containing many experiments and hundreds of thousands of genes because of the low efficiency problem. It makes it difficult to discovery OPSM in a massive data sets. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs in their discovery and are completely pruned off by existing methods. Deep-OPSM problem is to discover long-frequent patterns with few supporting sequences in a data sets. It will have greater biological significance in the analysis of data matrix. Therefore it is needed to find more efficient ways to find OPSMs.

In this paper, We propose a accuracy method that is rapid and efficiency, to find all OPSMs in a data sets , as well as Deep-OPSMs. Firstly, we find the all the common subsequence in the data matrix for each of its two rows, and then we use the STL map, to count the supports of every common subsequence in the range of the data matrix. If the support of the common subsequence is grater than the support threshold , we find a OPSM. Experimental results show that this method can quickly find qualified OPSMs. And we can only digging out more Deep-OPSMs with more biological significance by selecting the long frequent patterns according to their lengths. In addition, because of the storage conditions(the length of the common

相关文档
最新文档