LncRNA的组装和鉴定（下游流程）

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

LncRNA的组装和鉴定（下游流程）
咱们《生信技能树》的B站有一个LncRNA数据分析实战视频课程，缺乏配套笔记。

年前雇佣了一个实习生系统性跟着我的课程做了一个中国人群的肾癌队列转录组测序数据的LncRNA的组装和鉴定，并且形成了完善的笔记小册子。

前面我们跑完了hisat2和stringtie流程，
拿到了组装好的gtf文件。

详见：LncRNA鉴
定上游分析
接下来就需要对组装好的gtf文件里面的lincRNA 进行一系列的评估和过滤操作。

Gffcompare 获取转录本组装情况
我使用的代码是：
gtf=$HOME/reference/human/gtf/gencode.v37.annotation. gtf
nohup gffcompare -R -r $gtf -
o ./merged ../05.stringtie/02.merge_gtf/stringtie_merged.gtf > g ffcompare.log 2>&1 &
# 查看比对结果的准确性和预测率。

cat merged.stats
#= Summary for dataset: ../05.stringtie/02.merge_gtf/string tie_merged.gtf
# Query mRNAs : 433529 in 70170 loci (403256 multi-exon transcripts)
# (26951 multi-
transcript loci, ~6.2 transcripts per locus)
# Reference mRNAs : 232728 in 56612 loci (207676 multi -exon)
# Super-loci w/ reference transcripts: 50985
#-----------------| Sensitivity | Precision |
Base level: 99.6 | 25.3 |
Exon level: 85.7 | 58.0 |
Intron level: 99.4 | 65.9 |
Intron chain level: 98.7 | 50.8 |
Transcript level: 97.4 | 52.3 |
Locus level: 96.0 | 71.4 |
Matching intron chains: 204982
Matching transcripts: 226593
Matching loci: 54373
Missed exons: 453/638637 ( 0.1%)
Novel exons: 185652/937901 ( 19.8%)
Missed introns: 667/388679 ( 0.2%)
Novel introns: 100294/586077 ( 17.1%)
Missed loci: 0/56612 ( 0.0%)
Novel loci: 19167/70170 ( 27.3%)
Total union super-loci across all input datasets: 70152
433529 out of 433529 consensus transcripts written in ./mer ged.annotated.gtf (0 discarded as redundant)
# 统计class code 类型
awk '$3!~/class/ {print $3}' merged.stringtie_merged.gtf.tma p | sort -V | uniq -c
1075 c
1 e
23378 i
95142 j
21682 k
8203 m
22690 n
7640 o
201 p
27 s
13609 u
10689 x
1087 y
228105 =
得到如下文件：
total 607M
145 2月 18 20:24 gffcompare.log
488M 2月 18 20:24 merged.annotated.gtf
15M 2月 18 20:24 merged.loci
1.4K 2月 18 20:24 merged.stats
12M 2月 18 20:24 merged.stringtie_merged.gtf.refmap
45M 2月 18 20:24 merged.stringtie_merged.gtf.tmap
49M 2月 18 20:24 merged.tracking
接下来主要的操作对象是merged.stringtie_merged.gtf.tmap 文件。

step1：保留指定class_code的transcripts
过滤，只保留class_code="u","x","i","j","o"的transcripts ，这个时候需要参考 stringtie官网提供的分类：
我使用的脚本：
#过滤，只保留class_code="u","x","i","j","o"的 transcripts
awk '{if ($3=="u" || $3=="x" || $3=="i" || $3=="j" || $3=="o") {print $0}}' ~/lncRNA_project/06.gffcompare/merged.stringtie_ merged.gtf.tmap > filter1_by_uxijo.tmap
$ wc -l filter1_by_uxijo.tmap
150458 filter1_by_uxijo.tmap
#获取剩余transcripts的exons位置信息，提取序列并组装成转录本序列
# 获取剩余的transcripts的ID
awk '{print $5}' filter1_by_uxijo.tmap > filter1_transcript_ID $ wc -l filter1_transcript_ID
150458 filter1_transcript_ID
# 剩余的transcripts得到gtf
grep -w -Ff filter1_transcript_ID -
w ~/lncRNA_project/06.gffcompare/merged.annotated.gtf > filt er1_transcript.gtf
# 把filter2_transcript.gtf中的class_code "=" 替换为L
# 去除剩余为去除的class_code "="
awk -
F ';''{if ($8!=" L"){print $0}}' filter1_transcript.gtf > filter1
mv filter1 filter1_transcript.gtf
$ wc -l filter1_transcript.gtf
1596482 filter1_transcript.gtf
step2: 根据长度进行过滤
过滤，只保留exon>1并且长度>200bp的transcripts
# 过滤，只保留exon>1并且长度>200bp的transcripts
awk '($6>1 && $10>=200){print$0}' ../step1/filter1_by_uxijo .tmap > filter2_by_exon_length.tmap
$ wc -l filter2_by_exon_length.tmap
145635 filter2_by_exon_length.tmap
awk '{print $5}' filter2_by_exon_length.tmap > filter2_transcr ipt_ID
grep -Ff filter2_transcript_ID -
w ../step1/filter1_transcript.gtf > filter2_transcript.gtf
# 剩余transcripts的exon组成gtf
awk '($3=="exon"){print$0}' filter2_transcript.gtf > filter2_tr anscript_exon.gtf
# 根据exon位置信息提取基因组序列，组装成转录本序列
gffread -w filter2_transcript_exon.fa -
g /home/data/server/reference/genome/hg38/hg38.fa ./filter2_t ranscript_exon.gtf
$ grep -c "^>" filter2_transcript_exon.fa
145635
这个步骤的保留exon>1的选择是值得商榷的，也有很多流程里面，并不会做这个操作。

step3：转录本编码能力预测
转录本编码能力预测，主要是4个软件，需要取交集：
nohup cpat.py -x ../dat/Human_Hexamer.tsv \
-d ../dat/Human_logitModel.RData \
-
g ~/lncRNA_project/07.identification/step2/filter2_transcript_ex on.fa \
-
o ~/lncRNA_project/07.identification/step3/CPAT/cpat_result.txt > cpat.log 2>&1 &
nohup Rscript ./LncFinder.R > LncFinder.log 2>&1 &
conda create -n py2test python=2.7
mamba create -n py2test python=2.7
conda install biopython=1.70
nohup python CPC2.py -
i ~/lncRNA_project/07.identification/step2/filter2_transcript_exo n.fa -
o ~/lncRNA_project/07.identification/step3/CPC2/CPC2_result.t xt > cpc2.log 2>&1 &
nohup python CNCI.py \
-
f ~/lncRNA_project/07.identification/step2/filter2_transcript_exo n.fa \
-o ~/lncRNA_project/07.identification/step3/CNCI \
-m ve \
-p 4 > cnci.log 2>&1 &
nohup python PLEK.py \
-
fasta ~/lncRNA_project/07.identification/step2/filter2_transcript _exon.fa \
-out ~/lncRNA_project/07.identification/step3/plek/plek \ -thread 4 > plek.log 2>&1 &
less CPC2_result.txt|grep 'noncoding'|awk '{print $1}'> CPC2 _id.txt
$ wc -l CPC2_id.txt
55359 CPC2_id.txt
less -
S cpat_result.txt|awk '($6<0.364){print $1}' > cpat_id.txt sed -i '1d' file
$ wc -l cpat_id.txt
51956 cpat_id.txt
less lncFinder_result.txt|grep -
w 'NonCoding' > lncFinder_id.txt
$ wc -l lncFinder_id.txt
52442 lncFinder_id.txt
less -S plek | grep -w 'Non-
coding' |awk '{print $3}' |sed 's/>//g' > plek_id.txt
$ wc -l plek_id.txt
55996 plek_id.txt
# 4个软件取交集
cat *txt |sort |uniq -c |awk '{if( $1==4){print}}'|wc
step4：比对到Pfam据库
比对到Pfam据库，过滤 (E-value < 1e-5)
# 使用Transeq 将转录本序列翻译为6个可能的蛋白序列
transeq ../step3/intersection/filter3_by_noncoding_exon.fa fi lter4_protein.fa -frame=6
conda install pfam_scan
nohup pfam_scan.pl -fasta ./filter4_protein.fa -
dir /home/data/lihe/database/Pfam/ -
out Pfam_scan.out > Pfam_scan.log 2>&1 &
# 过滤 (E-value < 1e-5)
grep -v '^#' Pfam_scan.out | grep -v '^\s*$' | awk '($13< 1e-5){print $1}'| awk -F "_"'{print$1}' | sort | uniq > coding.ID
wc coding.ID
3751
## 直接过滤 fastq 文件即可
grep -v -
f coding.ID ../step3/intersection/filter3_transcript_ID > filter4_tra nscript_ID
$ wc -l filter4_transcript_ID
30259 filter4_transcript_ID
grep -Ff filter4_transcript_ID -
w ../step3/intersection/filter3_by_noncoding.gtf > filter4_by_pfa m.gtf
awk '($3=="exon"){print$0}' filter4_by_pfam.gtf > filter4_by_ pfam_exon.gtf
gffread -w filter4_by_pfam_exon.fa -
g /home/data/server/reference/genome/hg38/hg38.fa filter4_b y_pfam_exon.gtf
$ grep -c '^>' filter4_by_pfam_exon.fa
30259
step5：到NR数据库
diamond blastx 到NR数据库，过滤 (E-value < 1e-5)
nohup diamond blastx -e 1e-5 -
d ~/database/blastDB/nr/diamond/nr -
q ../step4/filter4_by_pfam_exon.fa -f 6 -o ./dna_matches.txt &
$ less -S dna_matches.txt |cut -f 1 | sort -V | uniq -c |wc -l
25012
# 过滤 (E-value < 1e-5)
awk '($11<1e-
2){print$1}' dna_matches.txt | sort | uniq > lncRNA.ID
grep -v -f ../step5/lncRNA.ID -
w ../step4/filter4_transcript_ID > filter5_by_nr_ID
grep -Ff filter5_by_nr_ID -
w ../step4/filter4_by_pfam.gtf > filter5_by_nr.gtf
step6：过滤掉低表达量的lncRNA
通过count数量或FPKM过滤掉低表达量的lncRNA
# featureCounts 统计count
nohup featureCounts -T8 -a \
~/lncRNA_project/07.identification/step7/filter5_by_nr.gtf \ -o ./raw_count.txt -p -B -C -f -t transcript -g transcript_id \
~/lncRNA_project/04.mapping/*.bam > transcript_featureC ount.log 2>&1 &
# R 语言计算FPKM，筛选：FPKM > 0 in at least one sample ，得到lncRNA_id.txt
rm(list=ls())
# make count table
raw_df <- read.table(file = "~/lncRNA_project/test/08.featur ecounts/raw_count.txt",header = T,skip = 1,sep = "\t")
count_df <- raw_df[ ,c(7:ncol(raw_df))]
metadata <- raw_df[ ,1:6] # 提取基因信息count数据前的几列
rownames(count_df) <- as.character(raw_df[,1])
colnames(count_df) <- paste0("SRR10744",251:439)
# calculate FPKM
countToFpkm <- function(counts, effLen)
{
N <- colSums(counts)
exp( log(counts) + log(1e9) - log(effLen) - log(N) )
}
options(scipen = 200) # 表示在200个数字以内都不使用科学计
数法
fpkm = countToFpkm(count_df, metadata$Length)
# View(fpkm)
# FPKM > 0 in at least one sample
count_df.filter <- count_df[rowSums(fpkm)>0,]
write.table(rownames(count_df.filter,file="~/lncRNA_project /test/09.all_lncRNA/filter6_by_fpkm_id", sep="\t",quote=F)
# linux 里提取最终lncRNA的gtf文件
grep -Ff filter6_by_fpkm_id -
w ~/lncRNA_project/07.identification/step7/filter5_by_nr.gtf > l ncRNA.gtf
# 根据exon位置信息提取基因组序列，组装成转录本序列
gffread -w lncRNA.fa -
g /home/data/server/reference/genome/hg38/hg38.fa lncRNA. gtf
$ grep -c '^>' lncRNA.fa
featureCounts对组装出的lncRNA、mRNA、other_RNA定量接下来就可以拿到组装好的gtf文件，对原来的测序文件的比对后的bam文件进行定量操作。

############## lncRNA featureaCounts ############ ##########
nohup featureCounts -t transcript -g transcript_id \
-Q 10 --primary -s 0 -p -f -T 8 \
-a ../lncRNA.gtf \
-o ./raw_count.txt \
~/lncRNA_project/04.mapping/*.bam \
> featureCounts.log 2>&1 &
#########----------- protein_coding-------------------
- #############
less -S gencode.v37.annotation.gtf | grep -
w 'gene_type "protein_coding"' > protein_coding_gene.gtf less -S gencode.v37.annotation.gtf | grep -
w 'gene_type "lncRNA"' > known_lncRNA_gene.gtf
# featureCounts 定量
nohup featureCounts -t gene -g gene_id \
-Q 10 --primary -s 0 -p -f -T 8 \
-a ../protein_coding_gene.gtf \
-o ./raw_count.txt \
~/lncRNA_project/04.mapping/*.bam \
> featureCounts.log 2>&1 &
#########----------- other_RNA-------------------
- #############
less -
S ~/lncRNA_project/10.mRNA/gencode.v37.annotation.gtf | gre p -w -v 'gene_type "protein_coding"' > other_coding_gene.gtf
# featureCounts 定量
nohup featureCounts -t gene -g gene_id \
-Q 10 --primary -s 0 -p -f -T 8 \
-a ../other_coding_gene.gtf \
-o ./raw_count.txt \
~/lncRNA_project/04.mapping/*.bam \
> featureCounts.log 2>&1 &
NONCODE v6_human数据库
blastn 到NONCODE v6_human，区分组装出的lncRNA为：know_lncRAN、novel_lncRNA
#######----blastn ——> NONCODEv6_human ------------------#############
cat NONCODEv6_human.fa |seqkit rmdup -s -o clean.fa -
d duplicated.fa -D duplicated.detail.txt
nohup makeblastdb -in clean.fa -dbtype nucl -
parse_seqids -out NONCODEv6_human &
#将blastn结果中e-value<=1e-10,min-identity=80%,min-coverage=50%的序列筛选出来当作比对上的序列
blastn -db NONCODEv6_human -evalue 1e-10 -
num_threads 10 -max_target_seqs 5 -
query ~/lncRNA_project/09.lncRNA/test.fa -
outfmt ' 6 qseqid sseqid pident qcovs length mismatch gapope n qstart qend sstart send qseq sseq evalue bitscore' -
out test.txt
lncRNA的功能推断
大量lncRNA的功能是未知的，但是它们主要是cis-regulators，所以可以根据它们临近的蛋白编码基因功能来近似推断，然后表达量
的相关性也可以类推到。

根据位置关系推断
使用bedtools等工具！对DEL lncRNA 提取顺式靶mRNA，产生了lncRNA基因座/蛋白质编码基因座对的列表！
bedtools window -a lncRNA_metadata.gtf -
b ~/lncRNA_project/10.protein_coding_RNA/gencode.v37.anno tation.gtf -l 100000 -r 10000 > test.txt
表达量的相关性
首先自己的转录组测序数据里面就有蛋白编码基因，know_lncRAN、novel_lncRNA各自的表达量矩阵，就可以计算相关性。

也可以看数据库：
比如杂志Cancer Medicine, 2020的文章《 Genome-wide DNA methylation analysis by MethylRad and the transcriptome profiles reveal the potential cancer-related lncRNAs in colon cancer》，在进行结直肠癌相关lncRNA的功能富集分析，就是采用LncRN2Target v2.0和StarBase分析与15个lncRNA共表达的蛋白编码基因，其中lncRNA HULC和ZNF667-AS1分别鉴定到28个、9个共表达的蛋白编码基因！
LncSEA数据库
LncSEA（/LncSEA/index.php）着重于收录已发表的人类各种lncRNA信息，并可以对用户提交的lncRNA集合进行注释和富集分析，提供超过40000种参考lncRNA集合，包括18个类型（miRNA，drug，disease，methylation pattern，cancer specific phenotype，lncRNA binding protein，cancer hallmark，subcellular localization，survival，lncRNA-eQTL，cell marker，enhancer，super-enhancer，transcription factor，accessible chromatin and smORF，exosome和conservation）和66个亚类，包含超过了5万条lncRNA。

LncSEA主要包括Analysis，Search，Browse，ID conversion，Download 5个功能模块！。