R语言命令Tutorial-更新后,CCA ,RDA,PCA, heatmap

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

一、GeoChip 数据处理
1 准备数据
•登录数据库，用户名ieg\jianqiang，PW：ieg123?
•选择GeoChip4数据，再次输入用户密码；
•点击Prepare microarray data，点击选择要分析的数据，点击submit，勾选“Remove the spots SNR less than 2”，此即为SNR数据；若勾选“Adjust SNR according to Thermophile probes less than 5%”，此即为Thermo数据；
•勾选“by dividing the mean of each sample”，即为DBM数据，若不勾选，则为relative abundance（RA）数据；
•这样就有四套数据分别为SNR DBM、SNR RA、Thermo DBM、Thermo RA，选择好要下载的数据后，输入“Experimental Name:”，点击submit，点击“set it as default”
点击go，随后点击main回到主界面，点击“Analyze microarray data”在跳出的对话框中点cancel，点击如下图所示项目后，点击submit；
在show information框中选择除了“All targets”外的所有项，点击submit；
点击download后，待所有数据出来后另存为文本文件，这样就准备好一套数据，将所有4套数据都如此下载好。

•数据下载后，在excel中去除各样品重复中只有一个重复有检测到基因信号的数据，即为cut 1；
•Relative abundance数据需要将各基因信号值分别除以该样品中所有基因信号值得和，再乘以各个样品基因信号值和的平均值，即data1/sum1*average。

这样即得到Relative abundance数据
•Relative abundance数据继续做两种处理，一是将数据+1后取ln，一是将数据除以1000；这样总共是6套数据，将所有数据中0值替换为空白，同时只留下gene ID和genename两项，
另存为tab delimited txt文件，即可用于DCA（Detrended Correspondence Analysis）、Dissimilarity Test、cluster（A simple hierarchical clustering analysis）分析；
2 数据预分析
2.1 DCA分析
在数据分析界面点击以下项后，上传刚刚准备的数据，即可做DCA分析，
结果可获得DCA图及DCA数据，可拷贝出数据自行作图；
2.2 Dissimilarity Test
点击后，上传数据，选择需要比较的样品，即可做MRPP、anosim、adonis比较，记录distance和sig值；
2.3 cluster分析
点击，将数据按各样品取平均值后上传分析，即可得cluster图。

比较以上6套数据分析结果，选择最好的一套进行数据处理，目前一般SNR RA ln的数据最好，可只比较其与Thermo RA ln之间的差异。

3 数据处理
3.1 Diversity and summary
3.1.1 alpha-diversity
点击，上传用于做DCA的数据，即可得到alpha-diversity；
3.1.2 beta-diversity
点击，上传用于做DCA的数据，即可得到beta-diversity；
3.1.3 Summary gene categories
点击，上传数据，此时只用relative abundance的数据，不要取ln，若保留genename 则只统计gene水平，保留各个不同级别的category，则统计为各个category的
加和；数据上传后选择show intensity，将所得的结果保存；求出各个样品的平均值，取ln去掉0值后保存为txt文件，即可用于genecluster分析；
3.1.4 Summary phylogenetic
点击，同上但只保留Organism或Lineage即可对phylogenetic基因分布进行统计；
3.2 Gene cluster
将3.1.3得到文件进行GeneCluster分析，打开GeneCluster软件，点击“load files”载入文件，如下图勾选计算后可达到3个文件cdt、atr、gtr，atr为每列信号和，gtr为每行信号和；
用GeneTreeView打开cdt文件即可看到图，save tree image保存树图，save zoomed image保存信号图，其中红色方块深浅代表信号值，即该基因丰度，若采用category数据，则表示某类基因丰度，将树图和该图用photoshop拼合即可。

3.3 Response ratio
点击，上传与3.1.3相同数据，gene 水平和category水平数据均可
3.4 环境因子拟合
采用3.1.3中基因水平或category水平的相对丰度，计算标准差，或采用total intensity，直接与环境因子拟合，correlation test相关性分析即可，excel即可做。

3.5CCA using R
•打开R软件，输入
All=read.table('D:/桌面/SNR RA ln for cca.csv',header=T, sep=",")，输入准确的路径名，以及要读入的数据文件名，采用csv格式数据，Relative abundance取ln的数据，空白值替换为0，不要genename等，格式如下：
dim(All)
T.All<-t(All)
dim(T.All)，这三个命令是将数据转置为样品在列；
•All.group=read.table('D:/桌面/Chemical property.csv',header=T, sep=",") Y=T.All
X=All.group，读入理化参数数据，格式如下
行为参数值，列为样品名，参数不要用符号空格。

•library(vegan)，调用vegan程序包；
•输入，A=cca(Y ~ pH+TOC+TN+TC+TS+CEC+NH4+AK, data=X)，即所有的参数名字，即开始计算CCA.
•输入a(A)，看计算结果，可导出自行作图，用“Site scores (weighted averages of species scores)”数据样品点，“Biplot scores for constraining variables”做环境因子；
•输入a(A)，看各参数的vif值，<10表示该参数为单独影响，>10表示该参数为共同影响，一般选择<20的参数；
•输入plot(A,dis=c('wa','cn'))输出CCA排序图；
•输入a(A, by = "term", step = 200)计算方差，各环境因子显著性检验；
输入anova(A,by="axis",perm=1000)，看各轴解释量；再次输入A=cca(Y ~ pH+TOC+TN+TC+TS+CEC+NH4+AK, data=X)，可得以下结果：解释量即为某轴值除以Total值，如高亮部分，其解释量为12.81%
Inertia Proportion Rank
Total 0.3537 1.0000
Constrained 0.1336 0.3778 8
Unconstrained 0.2201 0.6222 30
Inertia is mean squared contingency coefficient
414 species (variables) deleted due to missingness
Eigenvalues for constrained axes:
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7 CCA8
0.045309 0.034580 0.013917 0.009750 0.009280 0.008328 0.007542 0.004894
Eigenvalues for unconstrained axes:
CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
0.036623 0.013366 0.011339 0.010286 0.009655 0.009085 0.008615 0.007996
(Showed only 8 of all 30 unconstrained eigenvalues)
•输入bioenv(Y ~ pH+TOC+TN+TC+TS+CEC+NH4+AK, data=X)，
该项也可在网站上进行分析，可得出矩阵，各个因子，上传DCA数据和环境因子数据。

Best Subset of Environmental Variables with Maximum (Rank) Correlation with Community Dissimilarities
3.6 Partial CCA
输入cca(Y ~ CEC+Condition(pH+TOC+TN+TC+TS+NH4+AK), data=X)，可得CEC的部分解释量
cca(Y ~ pH+ TOC+Condition(CEC+ TN+TC+TS+NH4+AK), data=X)，则得到pH和TOC 的解释量，最终做出Variation partitioning analysis图，即VPA分析。

3.7 Partial Mantel Test
To calculate the correlation between microarray data and environmental factor(s)
点击，上传做DCA的数据，再上传做CCA的环境因子数据，即可针对不同基因和不同基因category做mantel test，若只选择几个环境因子数据，则为Partial Mantel Test。

结果如下：
r为相关系数，significance
3.8 MRT
Multivariate Regression Tree
读入数据与CCA分析相同，先读基因数据
All=read.table('D:/桌面/SNR RA ln for cca.csv',header=T, sep=",")
dim(All)
T.All<-t(All)
dim(T.All)，这三个命令是将数据转置为样品在列；
再读treatment数据
All.group=read.table('D:/桌面/treatment.csv',header=T, sep=",")
Y=T.All
X=All.group
MRT=rpart(Y ~ age+plant+landusing,data=X)，加亮为处理名字
plot(MRT)
text(MRT)
treatment数据如下图排列
所得结果中字母a b即按照处理中字母顺序排列，如Ancient 和Modern，前者为a，后者为b。

画图前应将绘图窗口放大。

二、Vegan tutor
Common commands
write.table(shan,quote=FALSE,sep="\t",file="shan.txt")
x11(1)，打开一个绘图窗口
dev.list()，看有几个设备
dev.set(2), 改变当前设备为2
1 DCA using R
选用做DCA的数据，只保留geneID即可。

library(vegan)
x<-read.csv(file.choose(), s=1)
x[is.na(x)]<-0
dim(x)
X<-t(x)
dim(X)
X.dca<-decorana(X)
summary(X.dca)
a = cca(X)
total.eigen = a$tot.chi
dca1.percent = round(X.dca$evals[1]/total.eigen,digits=3)*100 dca2.percent = round(X.dca$evals[2]/total.eigen,digits=3)*100 bel =paste("DCA1 ",dca1.percent, "%",sep="")
bel =paste("DCA2 ",dca2.percent,"%",sep="")
jpeg(file="1304744832.jpeg")
plot(X.dca,dis="site",type="text",xlab=bel,ylab=bel) points(X.dca, pch=21, col="red", bg="red", cex=0.5)
dev.off()
2 Diversity using R
2.1 a-diversity
数据表格同3.9 DCA分析
library(vegan)
X<-read.csv(file.choose(), s=1)
X[is.na(X)]<-0
dim(X)
spe<-t(X)
dim(spe) 转置，数据量大，excel无法转置
X<-read.csv("Chem property.csv",header=T,s=1,sep=",")
dim(env)
Shannon指数：diversity(spe, index = "shannon", MARGIN = 1, base = exp(1))或者：shan<-diversity(spe)
Simpson指数：
simp <- diversity(spe, "simpson")
invsimp <- diversity(spe, "inv")
pairs(cbind(shan, simp, invsimp), pch="+", col="blue", type="text") Species richness：S <- specnumber(spe)
Pielou's evenness：J <- shan/log(S)
2.2 b-diversity
library(MASS)
Non-metric Multidimensional scaling
BrayCurtis：bray<-vegdist(spe)，vegdist的默认index是braycurtis write.matrix(bray, file = "", sep = " ") 即可写出数据，拷贝出来
vare.mds0<-isoMDS(bray)
stressplot(vare.mds0,bray)
ordiplot(vare.mds0,type="t")
Community dissimilarities
rankindex(scale(env), spe, c("euc", "man", "bray", "jac", "kul"))
euclidean indice：euclidean <- vegdist(decostand(A, "norm"), "euclidean")
euclidean <- vegdist(decostand(spe, "hell"), "euclidean")
chao <- vegdist(decostand(spe, "norm"), "chao")
vare.mds0<-isoMDS(euclidean)
stressplot(vare.mds0,euclidean)
ordiplot(vare.mds0,type="t")
3 PCA using R
数据表格同DCA分析(上接DCA)
vare.pca <- rda(X)
vare.pca
summary(vare.pca)
bel =paste("PCA1 ",10.61, "%",sep="")
bel =paste("PCA2 ",3.69,"%",sep="")
plot(vare.pca, dis="site",type="text", xlab=bel,ylab=bel) points(vare.pca, pch=21, col="red", bg="red", cex=0.5)
spe.h=decostand(spe,"hellinger")
spe.pca=rda(spe.h)
plot(spe.pca, display = "sites", scaling = 1, type = "text")
plot(spe.pca, display = "sites", scaling = 1, type = "text", choices = c(1,3)) plot(spe.pca, display = "sites", scaling = 1, type = "n", choices = c(1, 2)) sc <- scores(spe.pca, display = "sites", scaling = 1, choices = c(1, 2)) my.colors <- as.character(as.integer(group$group) + 1)
points(sc, pch = 21, col = my.colors, bg = my.colors)
Correspondence analysis (CA)：
vare.ca <- cca(spe)
vare.ca
summary(vare.ca)
Correspondence analysis is based on Chi-squared distance
chisq.test(spe/sum(spe))
env.s=decostand(env,"standardize")标准化环境因子
4 Environmental interpretation using R
4.1 Vector fitting using R
vare.mds <- metaMDS(spe, trace = FALSE)
ef <- envfit(vare.mds, env, permu = 999)
ef
plot(vare.mds, display = "sites", type="text")
plot(ef, p.max = 0.1) 该值为自己设定的p最大值，环境样品可为0.1 4.2 Surface fitting
ef <- envfit(vare.mds ~ NO3N + AP, env)
plot(vare.mds, display = "sites", type="text")
plot(ef)
tmp <- with(env, ordisurf(vare.mds, NO3N, add = TRUE))
with(env, ordisurf(vare.mds, AP, add = TRUE,+ col = "green4"))
4.3 Factors
spe.ca <- cca(spe)
ef <- envfit(spe.ca, env, permutations = 999)
ef
plot(spe.ca, display = "sites", type="text")
plot(ef)
4.4 Non-metric Multidimensional scaling
library(vegan)
library(MASS)
pyrospe=read.csv(file.choose(),s=1)
pyrospe=t(pyrospe)
pyrospe.dis <- vegdist(pyrospe)
pyrospe.mds0 <- isoMDS(pyrospe.dis)
Nmds maps observed community dissimilarities nonlinearly onto ordination space and it can handle nonlinear species responses of any shape
stressplot(pyrospe.mds0, pyrospe.dis)
ordiplot(pyrospe.mds0, type = "t")
pyrospe.mds <- metaMDS(pyrospe, trace = FALSE)
plot(pyrospe.mds, type = "t")
plot(pyrospe.mds, dis="site", type = "t")
5 Constrained ordination
5.1 Model specification
用~符号，左边是群落数据，右边是限制方程
a <- cca(spe ~ Nit+AMM+AP+pH+AK, env)
a
bel =paste("CCA1 ",30, "%",sep="")
bel =paste("CCA2 ",20,"%",sep="")
plot(a) or plot(a, dis=c('wa','cn'))
points(a, pch=21, col="red", bg="red", cex=0.5)
library(scatterplot3d)
ordiplot3d(a, type = "h")
pl <- ordiplot3d(a, angle=15, type="n")
points(pl, "points", pch=16, col="red", cex = 0.7)
text(pl, "arrows", col="blue", pos=3)
text(pl, "points", col="blue", pos=1, cex = 1.2)
5.2 Permutation tests
anova(a)
anova(a, by="term", step = 200) 计算方差，各环境因子显著性检验；anova(a, by = "margin", perm = 500)
anova(a,by="axis",perm=1000) 看各轴解释量
6 Corelation test using R
Y<-read.csv("correlation test.csv",s=1)
or Y<- read.csv(file.choose(), s=1)可选择文件
attach(E3)将每列都设为一个变量，名字为每列的名字
Y.cor<-cor(nosZ,nirS, method = "method")得出两个变量之间的相关系数，默认是pearson，还有"kendall" or "spearman"
E1.cor.test<-cor.test(nosZ,nirS, method = "method")显著性检验
E3.cor.mat<-cor(E3)计算各个变量之间的相关系数，生成相关系数矩阵symnum(Y.cor.mat)符号化相关程度
plot(x=nosZ, y=nirS, xlab="nosZ", ylab="nirS", pch=21)画出散点图
abline(lm(nirS~nosZ))加一条拟合曲线
7 Classification
7.1 Cluster analysis
dis <- vegdist(A)或者dis <- dist(X)
clus <- hclust(dis, "single")
plot(clus)
cluc <- hclust(dis, "complete")
plot(cluc)
clua <- hclust(dis, "average")
plot(clua)
range(dis)
cor(dis, cophenetic(clus))
cor(dis, cophenetic(cluc))
cor(dis, cophenetic(clua))，计算各方法的异化性
7.2 Display and interpretation of classes
plot(cluc)
rect.hclust(cluc, 3)
grp <- cutree(cluc, 3) 分为不同级别的cluster
boxplot(Nit ~ grp, data = env, notch = TRUE)
ord <- cca(dune)
plot(ord, display = "sites")
ordihull(ord, grp, lty = 2, col = "red") ordispider and ordiellipse 都可使用7.3 Classfied community tables
wa <- scores(ord, display = "sites", choices = 1)
den <- as.dendrogram(clua)
oden <- reorder(den, wa, mean)
op <- par(mfrow = c(2, 1), mar = c(3, 5, 1, 2) +0.1)
plot(den)
plot(oden)
par(op)
8 Dissimilarities and environment
8.1 adonis: Multivariate ANOVA based on dissimilarities
betad <- betadiver(spe, "z")
adonis(betad ~ Nit, env, perm = 200)
adonis(betad ~ Nit+AMM+AP+pH+TC+TN, env, perm = 200)
8.2 Homogeneity of groups and beta diversity
mod <- with(env, betadisper(betad, Nit+AMM+AP+pH+TC+TN))
plot(mod)
boxplot(mod)
anova(mod)
permutest(mod)
8.3 Mantel test
data(varespec)
data(varechem)
veg.dist <- vegdist(varespec) # Bray-Curtis
env.dist <- vegdist(scale(varechem), "euclid")
mantel(veg.dist, env.dist)
mantel(veg.dist, env.dist, method="spear")
8.4 Protest: Procrustes test
pc <- scores(pc, choices = 1:2)
pro <- protest(vare.mds, pc)
plot(pro)
pro
8.5 anosim
spe.dist <- vegdist(spe) or spe.dist <- vegdist(spe, method="euclidean") group=read.csv(file.choose(),s=1)如下图
attach(group)
spe.ano <- anosim(spe.dist, usage)
summary(spe.ano)
plot(spe.ano)
adonis(spe ~ usage, data=group, permutations=99,method="bray")
spe.mrpp <- mrpp(spe, group$usage, distance="bray")
9 Multivariate homogeneity of groups dispersions (variances)
用于检验重复中的outlier，此处spe用非相对丰度数据
spe.h=decostand(spe,"hellinger")
spe.h.dist=vegdist(spe.h, "euclidean")
var.test <- betadisper(spe.h.dist, group = group$group, type = "median", bias.adjust = TRUE)
boxplot(var.test)
permutest(var.test)
permutest(var.test, pairwise = TRUE)
10 PcoA Principal Coordinates Analysis
library(labdsv)
data(spe)
dis.bc <- dsvdis(spe,'bray/curtis')
veg.pco <- pco(dis.bc,k=4)
plot(veg.pco, dis="site",type="text")
三、Other application
1 seqRFLP
1.1 Rename sequence
library(seqRFLP)
x=read.fasta(file.choose())
gnames.fas(x)
y=rename.fas(x, name = paste("P04B1_", as.character(1:3678), sep = ""))
write.fasta(y,"P04B1.fasta")
2 heatmap
2.1 Simple heatmap
数据如下图
P1=read.csv(file.choose(),s=1)
P1_matrix <- data.matrix(P1)
P1_heatmap <-heatmap(P1_matrix, Rowv=NA, Colv=NA, col = heat.colors(256), scale="column", margins=c(5,10))
cm.colors看使用什么颜色，改变col参数
nba <- nba[order(nba$PTS),]，使其按PTS由小到大排列
2.2 pheatmap function
X=read.csv(file.choose(),s=1)
X=as.matrix(X)
library(pheatmap)
pheatmap(X)
pheatmap(P1.n,scale="row")
pheatmap(spe,scale="row", clustering_distance_rows = "euclidean", clustering_distance_cols="euclidean", clustering_method = "average", display_numbers = "T", filename="4.pdf", fontsize = 8)
3 Phylogenetic tree
I made the tree figure with phyloseq’s plot_tree function:
plot_tree(temp, nodelabf=nodeplotblank, color="Group", size="abundance", label.tips="taxa_names", plot.margin=0.6)
where temp was a phyloseq object containing only the 16 OTUs. These OTUs were were all unclassified Bacteria with more than 500 sequences each.
四、Network analysis (igraph)
pheatmap(spe,scale="row", clustering_distance_rows = "FALSE", clustering_distance_cols="FALSE", clustering_method = "average", display_numbers = "T", filename="4.pdf", fontsize = 8)。