RepeatScou操作步骤(说明)
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
The purpose of the RepeatScout software is to identify repeat familysequences from genomes where hand-curated repeat databases (a laRepBase update) are not available. In fact, the output of this programcan be used as input to RepeatMasker as a way of automatically maskingnewly-sequenced genomes.
Included in this package is a more or less arbitrary 3 Mb of the human X chromosome for your testing and debugging purposes.
Wget下载,tar -xzvf解压,直接进入make一下即可,安装后文件如下
Running RepeatScout proceeds in four phases.
First, build_lmer_table creates a file that tabulates the frequency of all l-mers in these quence to be analyzed.
第一步:用build_lmer_table命令把整个基因组生成一个频率表格,把所有有过重复的kmer都找出来,
EG:build_lmer_table -sequence genome.fa -freq genome.fq
Second, RepeatScout takes this table and the sequence and produces a fasta file that contains all the repetitive elements that it could find.
第二步:用RepeatScout 这个命令根据生成的频率表格和基因组序列产生一个包含有所有的能找到的重复元件的文件。
EG:RepeatScout -sequence genome.fa -output repeats.fa -freq genome.fq
Third, the "filter-stage-1.prl" scriptis run on the output of RepeatScout to remove low-complexity and tandem elements;
RepeatMasker is run on the sequence of interest using this filtered RepeatScout library.
The program "filter-stage-2.prl"then filters out any repeat element that does not appear a certain number of times (by default, 10).
第三步:用filter-stage-1.prl这个脚本过滤掉低复杂度和串联重复元件。
EG:filter-stage-1.prl repeats.fa &> repeats.fa.filter_1
Finally, the locations of the repeats found by RepeatMasker are used, in conjuction with GFF files that describe segmental duplications or exons or other such "uninteresting" regions to remove sequences from the library that are likely to not be mobile elements; the program "compare-out-to-gff.prl" does exactly this.
第四步
Repeat Scout
RepeatScout is run in several steps using several different programs. This tutorial assumes you are using a fasta file named genome.fa you could, however, use any fasta file simple replacing genome.fa with the name of the fasta file being used. Also, running any of the listed commands without any additional input (or running them with the -h option) will display the help message including a description of the program, the usage, and all options.
1) Count every 12 base pair sequence in the genome, and when it is done, look at the result file:
$build_lmer_table -sequence genome.fa -freq genome.freq $less genome.freq
2) Extend the 12 base pair sequences to form initial consensus repeats. This will take a minute.