数据挖掘_Yeast Gene Regulation Prediction Dataset(酵母基因规则回归预测数据集)
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Yeast Gene Regulation Prediction Dataset(酵母基因
规则回归预测数据集)
数据摘要:
This dataset was used in the 2002 kdd cup data mining competition. The data describes the activity of some (hidden) biological system in yeast cells. Particularly, a set of yeast strains have been generated where each of them is characterized by a single gene being knocked out (i.e. disabled). Thus each example in the data set is related to a single knocked-out gene which is labeled with a discrete measurement of how active the hidden system in the cell is when this gene is knocked out.
中文关键词:
数据挖掘,生物学,酵母基因,KDD Cup 1998,
英文关键词:
Data mining,Biology,Yeast Gene,KDD Cup 1998,
数据格式:
TEXT
数据用途:
The data can be used for regulation prediction.
数据详细介绍:
Yeast Gene Regulation Prediction
dataset
∙Description:This dataset was used in the 2002 kdd cup data mining competition. The data describes the activity of some (hidden) biological
system in yeast cells. Particularly, a set of yeast strains have been
generated where each of them is characterized by a single gene being
knocked out (i.e. disabled). Thus each example in the data set is related to
a single knocked-out gene which is labeled with a discrete measurement of
how active the hidden system in the cell is when this gene is knocked out.
There are three labels:
o"nc": This label indicates that the activity of the hidden system (i.e.
yeast strain) was not significantly different than the baseline.
o"control": Indicates that the activity of the hidden system was
significantly different than the baseline for the given instance, but
that the activity of another hidden system (the control) was also
significantly changed versus its baseline.
o"change": Describes examples which the activity of the hidden system was significantly changed, but the activity of the control
system was not significantly changed.
A variety of other information accompanies the above data. These data
sources include categorical features describing gene localization and function, abstracts from the scientific literature(MEDLINE), and a table of protein-protein interactions that relate the products of pairs of genes. A local copy of file describing the whole database is available here.
∙Objections: Too complex/challenging for a DME mini-project. There are three sources of data to take into account (including carrying out
information extraction from the scientific abstracts). One of these sources
is information on protein-protein interactions, which will have to be coded in an appropriate way for machine learning. Also the classes are very
imbalanced.
Reference