R语言命令Tutorial-更新后,CCA ,RDA,PCA, heatmap
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
一、GeoChip 数据处理
1 准备数据
•登录数据库,用户名ieg\jianqiang,PW:ieg123?
•选择GeoChip4数据,再次输入用户密码;
•点击Prepare microarray data,点击选择要分析的数据,点击submit,勾选“Remove the spots SNR less than 2”,此即为SNR数据;若勾选“Adjust SNR according to Thermophile probes less than 5%”,此即为Thermo数据;
•勾选“by dividing the mean of each sample”,即为DBM数据,若不勾选,则为relative abundance(RA)数据;
•这样就有四套数据分别为SNR DBM、SNR RA、Thermo DBM、Thermo RA,选择好要下载的数据后,输入“Experimental Name:”,点击submit,点击“set it as default”
点击go,随后点击main回到主界面,点击“Analyze microarray data”在跳出的对话框中点cancel,点击如下图所示项目后,点击submit;
在show information框中选择除了“All targets”外的所有项,点击submit;
点击download后,待所有数据出来后另存为文本文件,这样就准备好一套数据,将所有4套数据都如此下载好。
•数据下载后,在excel中去除各样品重复中只有一个重复有检测到基因信号的数据,即为cut 1;
•Relative abundance数据需要将各基因信号值分别除以该样品中所有基因信号值得和,再乘以各个样品基因信号值和的平均值,即data1/sum1*average。
这样即得到Relative abundance数据
•Relative abundance数据继续做两种处理,一是将数据+1后取ln,一是将数据除以1000;这样总共是6套数据,将所有数据中0值替换为空白,同时只留下gene ID和genename两项,
另存为tab delimited txt文件,即可用于DCA(Detrended Correspondence Analysis)、Dissimilarity Test、cluster(A simple hierarchical clustering analysis)分析;
2 数据预分析
2.1 DCA分析
在数据分析界面点击以下项后,上传刚刚准备的数据,即可做DCA分析,
结果可获得DCA图及DCA数据,可拷贝出数据自行作图;
2.2 Dissimilarity Test
点击后,上传数据,选择需要比较的样品,即可做MRPP、anosim、adonis比较,记录distance和sig值;
2.3 cluster分析
点击,将数据按各样品取平均值后上传分析,即可得cluster图。
比较以上6套数据分析结果,选择最好的一套进行数据处理,目前一般SNR RA ln的数据最好,可只比较其与Thermo RA ln之间的差异。
3 数据处理
3.1 Diversity and summary
3.1.1 alpha-diversity
点击,上传用于做DCA的数据,即可得到alpha-diversity;
3.1.2 beta-diversity
点击,上传用于做DCA的数据,即可得到beta-diversity;
3.1.3 Summary gene categories
点击,上传数据,此时只用relative abundance的数据,不要取ln,若保留genename 则只统计gene水平,保留各个不同级别的category,则统计为各个category的
加和;数据上传后选择show intensity,将所得的结果保存;求出各个样品的平均值,取ln去掉0值后保存为txt文件,即可用于genecluster分析;
3.1.4 Summary phylogenetic
点击,同上但只保留Organism或Lineage即可对phylogenetic基因分布进行统计;
3.2 Gene cluster
将3.1.3得到文件进行GeneCluster分析,打开GeneCluster软件,点击“load files”载入文件,如下图勾选计算后可达到3个文件cdt、atr、gtr,atr为每列信号和,gtr为每行信号和;
用GeneTreeView打开cdt文件即可看到图,save tree image保存树图,save zoomed image保存信号图,其中红色方块深浅代表信号值,即该基因丰度,若采用category数据,则表示某类基因丰度,将树图和该图用photoshop拼合即可。
3.3 Response ratio
点击,上传与3.1.3相同数据,gene 水平和category水平数据均可
3.4 环境因子拟合
采用3.1.3中基因水平或category水平的相对丰度,计算标准差,或采用total intensity,直接与环境因子拟合,correlation test相关性分析即可,excel即可做。
3.5CCA using R
•打开R软件,输入
All=read.table('D:/桌面/SNR RA ln for cca.csv',header=T, sep=","),输入准确的路径名,以及要读入的数据文件名,采用csv格式数据,Relative abundance取ln的数据,空白值替换为0,不要genename等,格式如下:
dim(All)
T.All<-t(All)
dim(T.All),这三个命令是将数据转置为样品在列;
•All.group=read.table('D:/桌面/Chemical property.csv',header=T, sep=",") Y=T.All
X=All.group,读入理化参数数据,格式如下
行为参数值,列为样品名,参数不要用符号空格。
•library(vegan),调用vegan程序包;
•输入,A=cca(Y ~ pH+TOC+TN+TC+TS+CEC+NH4+AK, data=X),即所有的参数名字,即开始计算CCA.
•输入a(A),看计算结果,可导出自行作图,用“Site scores (weighted averages of species scores)”数据样品点,“Biplot scores for constraining variables”做环境因子;
•输入a(A),看各参数的vif值,<10表示该参数为单独影响,>10表示该参数为共同影响,一般选择<20的参数;
•输入plot(A,dis=c('wa','cn'))输出CCA排序图;
•输入a(A, by = "term", step = 200)计算方差,各环境因子显著性检验;
输入anova(A,by="axis",perm=1000),看各轴解释量;再次输入A=cca(Y ~ pH+TOC+TN+TC+TS+CEC+NH4+AK, data=X),可得以下结果:解释量即为某轴值除以Total值,如高亮部分,其解释量为12.81%
Inertia Proportion Rank
Total 0.3537 1.0000
Constrained 0.1336 0.3778 8
Unconstrained 0.2201 0.6222 30
Inertia is mean squared contingency coefficient
414 species (variables) deleted due to missingness
Eigenvalues for constrained axes:
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7 CCA8
0.045309 0.034580 0.013917 0.009750 0.009280 0.008328 0.007542 0.004894
Eigenvalues for unconstrained axes:
CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
0.036623 0.013366 0.011339 0.010286 0.009655 0.009085 0.008615 0.007996
(Showed only 8 of all 30 unconstrained eigenvalues)
•输入bioenv(Y ~ pH+TOC+TN+TC+TS+CEC+NH4+AK, data=X),
该项也可在网站上进行分析,可得出矩阵,各个因子,上传DCA数据和环境因子数据。
Best Subset of Environmental Variables with Maximum (Rank) Correlation with Community Dissimilarities
3.6 Partial CCA
输入cca(Y ~ CEC+Condition(pH+TOC+TN+TC+TS+NH4+AK), data=X),可得CEC的部分解释量
cca(Y ~ pH+ TOC+Condition(CEC+ TN+TC+TS+NH4+AK), data=X),则得到pH和TOC 的解释量,最终做出Variation partitioning analysis图,即VPA分析。
3.7 Partial Mantel Test
To calculate the correlation between microarray data and environmental factor(s)
点击,上传做DCA的数据,再上传做CCA的环境因子数据,即可针对不同基因和不同基因category做mantel test,若只选择几个环境因子数据,则为Partial Mantel Test。
结果如下:
r为相关系数,significance
3.8 MRT
Multivariate Regression Tree
读入数据与CCA分析相同,先读基因数据
All=read.table('D:/桌面/SNR RA ln for cca.csv',header=T, sep=",")
dim(All)
T.All<-t(All)
dim(T.All),这三个命令是将数据转置为样品在列;
再读treatment数据
All.group=read.table('D:/桌面/treatment.csv',header=T, sep=",")
Y=T.All
X=All.group
MRT=rpart(Y ~ age+plant+landusing,data=X),加亮为处理名字
plot(MRT)
text(MRT)
treatment数据如下图排列
所得结果中字母a b即按照处理中字母顺序排列,如Ancient 和Modern,前者为a,后者为b。
画图前应将绘图窗口放大。
二、Vegan tutor
Common commands
write.table(shan,quote=FALSE,sep="\t",file="shan.txt")
x11(1),打开一个绘图窗口
dev.list(),看有几个设备
dev.set(2), 改变当前设备为2
1 DCA using R
选用做DCA的数据,只保留geneID即可。
library(vegan)
x<-read.csv(file.choose(), s=1)
x[is.na(x)]<-0
dim(x)
X<-t(x)
dim(X)
X.dca<-decorana(X)
summary(X.dca)
a = cca(X)
total.eigen = a$tot.chi
dca1.percent = round(X.dca$evals[1]/total.eigen,digits=3)*100 dca2.percent = round(X.dca$evals[2]/total.eigen,digits=3)*100 bel =paste("DCA1 ",dca1.percent, "%",sep="")
bel =paste("DCA2 ",dca2.percent,"%",sep="")
jpeg(file="1304744832.jpeg")
plot(X.dca,dis="site",type="text",xlab=bel,ylab=bel) points(X.dca, pch=21, col="red", bg="red", cex=0.5)
dev.off()
2 Diversity using R
2.1 a-diversity
数据表格同3.9 DCA分析
library(vegan)
X<-read.csv(file.choose(), s=1)
X[is.na(X)]<-0
dim(X)
spe<-t(X)
dim(spe) 转置,数据量大,excel无法转置
X<-read.csv("Chem property.csv",header=T,s=1,sep=",")
dim(env)
Shannon指数:diversity(spe, index = "shannon", MARGIN = 1, base = exp(1))或者:shan<-diversity(spe)
Simpson指数:
simp <- diversity(spe, "simpson")
invsimp <- diversity(spe, "inv")
pairs(cbind(shan, simp, invsimp), pch="+", col="blue", type="text") Species richness:S <- specnumber(spe)
Pielou's evenness:J <- shan/log(S)
2.2 b-diversity
library(MASS)
Non-metric Multidimensional scaling
BrayCurtis:bray<-vegdist(spe),vegdist的默认index是braycurtis write.matrix(bray, file = "", sep = " ") 即可写出数据,拷贝出来
vare.mds0<-isoMDS(bray)
stressplot(vare.mds0,bray)
ordiplot(vare.mds0,type="t")
Community dissimilarities
rankindex(scale(env), spe, c("euc", "man", "bray", "jac", "kul"))
euclidean indice:euclidean <- vegdist(decostand(A, "norm"), "euclidean")
euclidean <- vegdist(decostand(spe, "hell"), "euclidean")
chao <- vegdist(decostand(spe, "norm"), "chao")
vare.mds0<-isoMDS(euclidean)
stressplot(vare.mds0,euclidean)
ordiplot(vare.mds0,type="t")
3 PCA using R
数据表格同DCA分析(上接DCA)
vare.pca <- rda(X)
vare.pca
summary(vare.pca)
bel =paste("PCA1 ",10.61, "%",sep="")
bel =paste("PCA2 ",3.69,"%",sep="")
plot(vare.pca, dis="site",type="text", xlab=bel,ylab=bel) points(vare.pca, pch=21, col="red", bg="red", cex=0.5)
spe.h=decostand(spe,"hellinger")
spe.pca=rda(spe.h)
plot(spe.pca, display = "sites", scaling = 1, type = "text")
plot(spe.pca, display = "sites", scaling = 1, type = "text", choices = c(1,3)) plot(spe.pca, display = "sites", scaling = 1, type = "n", choices = c(1, 2)) sc <- scores(spe.pca, display = "sites", scaling = 1, choices = c(1, 2)) my.colors <- as.character(as.integer(group$group) + 1)
points(sc, pch = 21, col = my.colors, bg = my.colors)
Correspondence analysis (CA):
vare.ca <- cca(spe)
vare.ca
summary(vare.ca)
Correspondence analysis is based on Chi-squared distance
chisq.test(spe/sum(spe))
env.s=decostand(env,"standardize")标准化环境因子
4 Environmental interpretation using R
4.1 Vector fitting using R
vare.mds <- metaMDS(spe, trace = FALSE)
ef <- envfit(vare.mds, env, permu = 999)
ef
plot(vare.mds, display = "sites", type="text")
plot(ef, p.max = 0.1) 该值为自己设定的p最大值,环境样品可为0.1 4.2 Surface fitting
ef <- envfit(vare.mds ~ NO3N + AP, env)
plot(vare.mds, display = "sites", type="text")
plot(ef)
tmp <- with(env, ordisurf(vare.mds, NO3N, add = TRUE))
with(env, ordisurf(vare.mds, AP, add = TRUE,+ col = "green4"))
4.3 Factors
spe.ca <- cca(spe)
ef <- envfit(spe.ca, env, permutations = 999)
ef
plot(spe.ca, display = "sites", type="text")
plot(ef)
4.4 Non-metric Multidimensional scaling
library(vegan)
library(MASS)
pyrospe=read.csv(file.choose(),s=1)
pyrospe=t(pyrospe)
pyrospe.dis <- vegdist(pyrospe)
pyrospe.mds0 <- isoMDS(pyrospe.dis)
Nmds maps observed community dissimilarities nonlinearly onto ordination space and it can handle nonlinear species responses of any shape
stressplot(pyrospe.mds0, pyrospe.dis)
ordiplot(pyrospe.mds0, type = "t")
pyrospe.mds <- metaMDS(pyrospe, trace = FALSE)
plot(pyrospe.mds, type = "t")
plot(pyrospe.mds, dis="site", type = "t")
5 Constrained ordination
5.1 Model specification
用~符号,左边是群落数据,右边是限制方程
a <- cca(spe ~ Nit+AMM+AP+pH+AK, env)
a
bel =paste("CCA1 ",30, "%",sep="")
bel =paste("CCA2 ",20,"%",sep="")
plot(a) or plot(a, dis=c('wa','cn'))
points(a, pch=21, col="red", bg="red", cex=0.5)
library(scatterplot3d)
ordiplot3d(a, type = "h")
pl <- ordiplot3d(a, angle=15, type="n")
points(pl, "points", pch=16, col="red", cex = 0.7)
text(pl, "arrows", col="blue", pos=3)
text(pl, "points", col="blue", pos=1, cex = 1.2)
5.2 Permutation tests
anova(a)
anova(a, by="term", step = 200) 计算方差,各环境因子显著性检验;anova(a, by = "margin", perm = 500)
anova(a,by="axis",perm=1000) 看各轴解释量
6 Corelation test using R
Y<-read.csv("correlation test.csv",s=1)
or Y<- read.csv(file.choose(), s=1)可选择文件
attach(E3)将每列都设为一个变量,名字为每列的名字
Y.cor<-cor(nosZ,nirS, method = "method")得出两个变量之间的相关系数,默认是pearson,还有"kendall" or "spearman"
E1.cor.test<-cor.test(nosZ,nirS, method = "method")显著性检验
E3.cor.mat<-cor(E3)计算各个变量之间的相关系数,生成相关系数矩阵symnum(Y.cor.mat)符号化相关程度
plot(x=nosZ, y=nirS, xlab="nosZ", ylab="nirS", pch=21)画出散点图
abline(lm(nirS~nosZ))加一条拟合曲线
7 Classification
7.1 Cluster analysis
dis <- vegdist(A)或者dis <- dist(X)
clus <- hclust(dis, "single")
plot(clus)
cluc <- hclust(dis, "complete")
plot(cluc)
clua <- hclust(dis, "average")
plot(clua)
range(dis)
cor(dis, cophenetic(clus))
cor(dis, cophenetic(cluc))
cor(dis, cophenetic(clua)),计算各方法的异化性
7.2 Display and interpretation of classes
plot(cluc)
rect.hclust(cluc, 3)
grp <- cutree(cluc, 3) 分为不同级别的cluster
boxplot(Nit ~ grp, data = env, notch = TRUE)
ord <- cca(dune)
plot(ord, display = "sites")
ordihull(ord, grp, lty = 2, col = "red") ordispider and ordiellipse 都可使用7.3 Classfied community tables
wa <- scores(ord, display = "sites", choices = 1)
den <- as.dendrogram(clua)
oden <- reorder(den, wa, mean)
op <- par(mfrow = c(2, 1), mar = c(3, 5, 1, 2) +0.1)
plot(den)
plot(oden)
par(op)
8 Dissimilarities and environment
8.1 adonis: Multivariate ANOVA based on dissimilarities
betad <- betadiver(spe, "z")
adonis(betad ~ Nit, env, perm = 200)
adonis(betad ~ Nit+AMM+AP+pH+TC+TN, env, perm = 200)
8.2 Homogeneity of groups and beta diversity
mod <- with(env, betadisper(betad, Nit+AMM+AP+pH+TC+TN))
plot(mod)
boxplot(mod)
anova(mod)
permutest(mod)
8.3 Mantel test
data(varespec)
data(varechem)
veg.dist <- vegdist(varespec) # Bray-Curtis
env.dist <- vegdist(scale(varechem), "euclid")
mantel(veg.dist, env.dist)
mantel(veg.dist, env.dist, method="spear")
8.4 Protest: Procrustes test
pc <- scores(pc, choices = 1:2)
pro <- protest(vare.mds, pc)
plot(pro)
pro
8.5 anosim
spe.dist <- vegdist(spe) or spe.dist <- vegdist(spe, method="euclidean") group=read.csv(file.choose(),s=1)如下图
attach(group)
spe.ano <- anosim(spe.dist, usage)
summary(spe.ano)
plot(spe.ano)
adonis(spe ~ usage, data=group, permutations=99,method="bray")
spe.mrpp <- mrpp(spe, group$usage, distance="bray")
9 Multivariate homogeneity of groups dispersions (variances)
用于检验重复中的outlier,此处spe用非相对丰度数据
spe.h=decostand(spe,"hellinger")
spe.h.dist=vegdist(spe.h, "euclidean")
var.test <- betadisper(spe.h.dist, group = group$group, type = "median", bias.adjust = TRUE)
boxplot(var.test)
permutest(var.test)
permutest(var.test, pairwise = TRUE)
10 PcoA Principal Coordinates Analysis
library(labdsv)
data(spe)
dis.bc <- dsvdis(spe,'bray/curtis')
veg.pco <- pco(dis.bc,k=4)
plot(veg.pco, dis="site",type="text")
三、Other application
1 seqRFLP
1.1 Rename sequence
library(seqRFLP)
x=read.fasta(file.choose())
gnames.fas(x)
y=rename.fas(x, name = paste("P04B1_", as.character(1:3678), sep = ""))
write.fasta(y,"P04B1.fasta")
2 heatmap
2.1 Simple heatmap
数据如下图
P1=read.csv(file.choose(),s=1)
P1_matrix <- data.matrix(P1)
P1_heatmap <-heatmap(P1_matrix, Rowv=NA, Colv=NA, col = heat.colors(256), scale="column", margins=c(5,10))
cm.colors看使用什么颜色,改变col参数
nba <- nba[order(nba$PTS),],使其按PTS由小到大排列
2.2 pheatmap function
X=read.csv(file.choose(),s=1)
X=as.matrix(X)
library(pheatmap)
pheatmap(X)
pheatmap(P1.n,scale="row")
pheatmap(spe,scale="row", clustering_distance_rows = "euclidean", clustering_distance_cols="euclidean", clustering_method = "average", display_numbers = "T", filename="4.pdf", fontsize = 8)
3 Phylogenetic tree
I made the tree figure with phyloseq’s plot_tree function:
plot_tree(temp, nodelabf=nodeplotblank, color="Group", size="abundance", label.tips="taxa_names", plot.margin=0.6)
where temp was a phyloseq object containing only the 16 OTUs. These OTUs were were all unclassified Bacteria with more than 500 sequences each.
四、Network analysis (igraph)
pheatmap(spe,scale="row", clustering_distance_rows = "FALSE", clustering_distance_cols="FALSE", clustering_method = "average", display_numbers = "T", filename="4.pdf", fontsize = 8)。