Bioperl经典教程Bioperl_II.talk

合集下载

Perl 快速 入门 教程

Perl 快速 入门 教程
输出结果如下:
d:\>perl riusksk.pl
Hello,riusksk!
在这里点号“.”起到一个连接字符串的作用,在后面关于溢出exploit编写中,你就可以体会到它的方便之处了,而且增强了代码的易读性。接下来,我们讲讲关于操作符方面的知识。算术操作符主要有:
+(加)、-(减)、*(乘)、/(除)、**(乘幂)、%(取余)、-(单目负)。
perl example.pl
EditPlus是Windows下的一个简便的Internet 32位文本编辑器、HTML编辑器和程序设计员的编辑器,支持HTML、CSS、PHP、ASP、Perl、C/C++、Java、JavaScript和VBScript等多种语言的代码高亮显示,使用者也可根据自己需要到其官方网站下载语言包去支持其它语言(如汇编语言)的代码高亮,读者可也根据自己喜好选择其它编辑器,如Notepad++。现在我们可新建一个example.pl:
while ($x < 5) {
$x++;
}
print $x;
输出结果为:
d:\>perl riusksk.pl
5
(2)until循环语句
格式:
until (express){
statemet
}
例如:
$y = A;
until ($y ge C) {
$y++;
}
print $y;
输出结果:
d:\>perl riusksk.pl
整数比较操作符主要有:
< (小于)、> (大于)、==(等于)、<= (小于等于)、>= (大于等于)、!= (不等于)、<=> (比较,返回1,0,或者-1)。

4-Perl基础【兰州大学生物信息学】

4-Perl基础【兰州大学生物信息学】
Perl 的一些基础概念
提纲
• 1. Perl 的数据类型 • 2. Perl 的控制结构 • 3.Perl 中的正则表达式 • 4. 其他高频函数 • 5. BioPerl的简单使用 • 6. 关于Perl 应该努力的方向 • 7. 几个值得保持的好习惯 • 调试常见问题
1. Perl 的数据类型
网络资源
• /yumtaoist题
• 变量没有赋值 • 变量没有定义 (拼写错误) • 文件格式不标准
• 意想不到的奇怪空白 (s/\s+//g); • 文件不完整 (重新下载)
谢谢
1.3 哈希
• 定义赋值: my %rpkm; $rpkm{$geneId}{$sample}=$rpkm; • 调用: $rpkm{$geneId}{$sample}; • 处理方法: keys, values, exists, delete • 遍历多层哈希 %{ }
Foreach my $geneId (keys %rpkm){ foreach my $sample(keys %{$rpkm{$geneId}}){ print “$rpkm{$geneId}{$sample}\n”; }
• 标量变量 $ • 列表 @ • 哈希 %
1.1 标量变量
• 定义赋值: my $value = 1 • 调用: 正确书写变量名字 $value • 处理方法: 各种运算符
1.2 列表
• 定义赋值: My @array =(‘a’,1,”world”); • 调用: @array,$array[0] • 处理方法: push, pop, reverse, sort
一般紧随文件读取操作使用43根据坐标提取序列substr起始坐标为044脚本内调用linux命令关于perl应该努力的方向增加程序可读性一致清晰的变量命名添加必要注释61增加程序稳定性和准确性62子程序的使用大型的工作流程给未来写一个readme网络资源google调试常见问题文件不完整重新下载谢谢常用的有用数据库plantensemblephytoz基础概念同源基因

Perl 入门介绍

Perl 入门介绍

Perl 入门介绍(1)perl是一门免费及开源的脚本语言,由Larry Wall所创造,这门语言以实用,快速开发为主要目标,与当前流行的面向对象,结构化编程有些格格不入。

但是这并不妨碍perl被广泛流传和使用,世界范围内围绕Perl建立起了非常活跃的社区,很多人在其中不断帮助完善文档,创建示例代码,提供一些第三库等等。

具体可以浏览以下两个网页: ,.perl最出名也最擅长的要数文本处理了,很多其它语言要花几十上百行代码才能完成的任务,perl可能只需要几行代码就能完成。

但这些优势是有代价的,Perl书写代码的风格有时常常被诟病,典型的面向过程式语言。

还有就是,Perl一直以来所坚持的哲学:只要不引起冲突误解,代码可以写成多种形式--There's More Than One Way To Do It。

这导致Perl在语法上具有非常松散随意的特点,同一件事情常常有多种不同的写法,有些看起来还很奇特,体现在语法上,Perl的符号特别多,让Perl在一定程度比较难学易忘。

本文主要简单介绍一下Perl的基本语法,目的是通过本文,你能基本了解Perl的写法,能够顺利的去读别人的代码。

天下的语言在一定程度上是相通的,如果有一门语言的基础,去学另一本语言,入门基本不会太难,但要用到熟,得心应手,第三库的使用等,这些就需要耐心,假以时日了。

本节主要介绍一下各种数据类型.(一)语句及注释:Perl语句以分号(;)结尾,用#作为一行的注释,没有其它语言中那种跨行的注释。

代码块用大括号围起来,这个和c类似。

但这个大括号在有些地方是强制要求,如在if ,for,do, while等语句中。

它不像其它语言一样会用缩进来判断块。

(二) 变量: 标量(scalar) & 列表(list)Perl把简单的数据类型,如字符串,数字等“单数”的东西统称为标量,与之相对的,就是“复数”的东西,如数组。

标量的声明都是$开始,如: $str = "abc"多个标量也可以放一处一起声明:($x, $y, $z) = (11, 22, "no", 4);而数组的声明则是以@开头,如:@arr = ("abc","edf")Perl的变量声明和很多其它脚本语言一样,不需要指明类型,直接声明赋值就可以使用。

Bioperl操作指南

Bioperl操作指南

Bioperl 操作指南camelbbs@Bioperl 为许多经典的生物信息学程序提供了软件模块,这些包括: 从本地或远程数据库获取数据; 转换数据库或文件记录的格式; 操作单个序列; 搜索相似序列;创建和进行序列比对;搜索基因组上的基因及其它结构; 发展机器可读的序列注释;下面的章节将描述bioperl 怎样执行这些任务;III.1从本地和远程数据库中获取数据bioperl 主要集中于序列操作,但是在用bioperl 操作序列之前,需要获取序列数据。

现在你可以直接将序列数据输入到bioperl 的Seq 对象,例如:$seq = Bio::Seq->new(-seq => 'actgtggcgtcaact',-desc => 'Sample Bio::Seq object', -display_id => 'something', -accession_number => 'accnum', -alphabet => 'dna' );然而,在大多数时候,从在线文档及数据库中获取序列更优越。

注意在生物信息学的传统叫法中有时候被称作“数据库”的很可能是一个“索引平台文件”。

Bioperl 支持远程数据获取,也可为访问本地数据库创建索引。

有两个普通的方法完成这个。

如果你知道序列储存在什么样的数据库中(例如文本文件、本地关系型数据库或一个internet 上可访问的远程数据库),你可以写一个脚本特定地从这些数据库中获得数据。

这种方法将在III.1.1 节和III.1.2节中描述,这两节分别讲如何从远程数据库和本地的索引平台文件中获取数据。

明确地从本地关系型数据库中获取序列数据需要安装和设置bioperl-db 库和BioSQL 计划中的模块,更多介绍可见IV .3节。

另一个方法是使用最近发展起来的OBDA(Open Bioinformatics Data Access)注册系统。

Bioperl经典教程Bioperl_II.talk

Bioperl经典教程Bioperl_II.talk
use Bio::DB::Taxonomy; my $db = Bio::DB::Taxonomy->new( -source => ‘flatfile’, -nodesfile=> $nodefile, -namesfile=> $namesfile); $node = $db->get_Taxonomy_Node( -taxonid => ‘9606’); $node = $db->get_Taxonomy_Node( -name => ‘Homo sapiens’);
Graphics
Simple code to render Sequence with ‘tracks’ Basic component of Generic Genome Browser (GBrowse) Highly customizable, extensible
Generate a Graphics Panel
批注本地保存成功开通会员云端永久保存去开通
Bioperl II
Jason Stajich Duke University
Multiple Sequence Alignments
Bio::AlignIO to read alignment files Produces Bio::SimpleAlign objects Interface and objects designed for roundtripping and some functional work Could really use an overhaul or a parallel MSA representation
Position Weight Matricies
Bio::Matrix::PSM meme, mast, transfac parsers For now, just a parsing and representation system Also see Boris Lenhard’s TFBS perl modules (depend on Bioperl) - some analysis tools

perl教程7——经典小骆驼配套讲解

perl教程7——经典小骆驼配套讲解

第七章Bioperl模块唐玉荣tangyurong@主要内容7.1 Bioperl模块简介7.2 Bioperl功能与用法1.Sequences类2.Alignments类3.Analysis类4.Databases类7.3 实例7.1 Bioperl模块简介1.什么是Bioperl–Bioperl是一组Perl模块,它的主要目的在于利用Perl解决生物学研究中的一些问题,主要是生物信息学中的各种实际问题,如获取分子数据,分析序列文件,序列比对,大批量BLAST等。

–Bioperl开发项目正式开始于1995年。

它是一个开放源代码项目,使用Perl语言开发,由OBF(Open BioinformaticsFoundation) 所支持。

Bioperl的官方主页是:2.Bioperl相关文档–Bioperl tutorial/Core/Latest/bptutorial.html –Mastering Perl for Bioinformatics: Introduction to bioperl /catalog/mperlbio/chapter/ch09.pdf –Bioperl documentation/Core/Latest/modules.html –Modules listing/bioperl-live/–Jason Stajich的Bioperl教程/Bioperl_Tutorials/–Beginner’s HOWTO by Brian Osborne & James Thompson /HOWTOs/Beginners/index.html–法国Pasteur Institute的Bioperl课程http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/–WIBR Bioinformatics and Research Computing /bio/education/bioinfo-mini/unix-perl/ /~lehvasla/bioperl/7.2 Bioperl功能与用法1.Sequences类2.Alignments类3.Analysis类4.Databases类1.Sequences类–The Bio::SeqIO class–Sequence classes–Features and Location classes –Sequence analysis tools•The Bio::SeqIO class文档:/releases/bioperl-1.0.1/Bio/SeqIO.html 功能:格式转换例如:SwissProt格式转化为Fasta格式#! /usr/bin/perl–wuse strict;use Bio::SeqIO;my $in = Bio::SeqIO->newFh( -file => ’<seqs.sp’, -format => ’swiss’);my $out = Bio::SeqIO->newFh( -file => ’>seqs.fasta’, -format => ’fasta’);print $out $_ while <$in>;phred output phdQuality values (get a sequence of quality scores)qual GAME XML format gameACeDB sequence format aceRaw format (one sequence per line, no ID) rawGCG format GCGProtein Information Resource format PIRSCF tracefile format SCFSwissprot format swissGenBank format GenBankEMBL format EMBLFASTA format Fastadescription formatBio::SeqIO->newFh 中的format 说明:Swissprot格式ID 7LES_DROME STANDARD; PRT; 2554 AA.AC P13368; Q9TYI0; Q9U5V7; Q9VZ36;DT 01-JAN-1990 (Rel. 13, Created)DT 16-OCT-2001 (Rel. 40, Last sequence update)DT 15-JUN-2004 (Rel. 44, Last annotation update)DE Sevenless protein (EC 2.7.1.112).GN SEV OR HD-265 OR CG18085.OS Drosophila melanogaster(Fruit fly).OC Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;OC Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;OC Ephydroidea; Drosophilidae; Drosophila.OX NCBI_TaxID=7227;RN [1]RP SEQUENCE FROM N.A.RC STRAIN=Canton-S;RX MEDLINE=88282538; PubMed=2840202;RA Basler K., Hafen E.;RT "Control of photoreceptor cell fate by the sevenless protein requiresRT a functional tyrosine kinase domain.";RL Cell 54:299-311(1988).RN [2]RP SEQUENCE FROM N.A.RC STRAIN=Oregon-R;RX MEDLINE=88329706; PubMed=3138161;RA Bowtell D.L.L., Simon M.A., Rubin G.M.;RT "Nucleotide sequence and structure of the sevenless gene ofRT Drosophila melanogaster.";RL Genes Dev. 2:620-634(1988).CC -!-FUNCTION: Receptor for an extracellular signal required toCC instruct a cell to differentiate into an R7 photoreceptor. TheCC ligand for sev is the boss (bride of sevenless) protein on theCC surface of the neighboring R8 cell.CC -!-CATALYTIC ACTIVITY: ATP + a protein tyrosine = ADP + protein CC tyrosine phosphate.CC -!-SUBUNIT: May form a complex with drk and Sos.CC -!-DOMAIN: It is unclear whether the potential membrane spanning CC region near the N-terminus is present as a transmembrane domain inCC the native protein or serves as a cleaved signal sequence.CC -!-SIMILARITY: Belongs to the Tyr family of protein kinases. Insulin CC receptor subfamily.CC -!-SIMILARITY: Contains 7 fibronectin type III domains.CCDR EMBL; J03158; AAA28882.1; -.DR EMBL; X13666; CAA31960.1; ALT_INIT.DR EMBL; X13666; CAB55310.1; -.DR EMBL; AE003484; AAF47992.2; -.DR EMBL; AJ002917; CAA05752.1; -.DR PIR; A28912; TVFF7L.DR HSSP; P08069; 1JQH.DR FlyBase; FBgn0003366; sev.DR GO; GO:0005886; C:plasma membrane; IDA.DR GO; GO:0004713; F:protein-tyrosine kinase activity; IDA.DR GO; GO:0045467; P:R7 development; NAS.DR GO; GO:0008293; P:torso signaling pathway; NAS.DR InterPro; IPR003961; FN_III.DR InterPro; IPR008957; FN_III-like.DR InterPro; IPR000033; Ldl_receptor_rep.DR InterPro; IPR000719; Prot_kinase.DR InterPro; IPR002011; RecepttyrkinsII.DR InterPro; IPR001245; Tyr_pkinase.DR InterPro; IPR008266; Tyr_pkinase_AS.DR Pfam; PF00041; fn3; 6.DR Pfam; PF00069; Pkinase; 1.DR PRINTS; PR00109; TYRKINASE.DR ProDom; PD000001; Prot_kinase; 1.DR SMART; SM00060; FN3; 6.DR SMART; SM00135; LY; 2.DR SMART; SM00219; TyrKc; 1.DR PROSITE; PS50853; FN3; 7.DR PROSITE; PS00107; PROTEIN_KINASE_ATP; 1.DR PROSITE; PS50011; PROTEIN_KINASE_DOM; 1.DR PROSITE; PS00109; PROTEIN_KINASE_TYR; 1.DR PROSITE; PS00239; RECEPTOR_TYR_KIN_II; 1.KW Transferase; Tyrosine-protein kinase; Receptor; Vision; Transmembrane; KW Glycoprotein; ATP-binding; Phosphorylation; Repeat.FT DOMAIN 1 2123 Extracellular(Potential).FT TRANSMEM 2124 2147 Potential.FT DOMAIN 2148 2554 Cytoplasmic(Potential).FT DOMAIN 311 431 Fibronectin type-III 1.FT DOMAIN 436 528 Fibronectin type-III 2.FT DOMAIN 822 921 Fibronectin type-III 3.FT DOMAIN 1298 1392 Fibronectin type-III 4.FT DOMAIN 1680 1794 Fibronectin type-III 5.FT DOMAIN 1797 1897 Fibronectin type-III 6.FT DOMAIN 1898 1988 Fibronectin type-III 7.FT DOMAIN 2038 2046 Poly-Arg.FT DOMAIN 2209 2485 Protein kinase.FT CONFLICT 2271 2271C -> R (in Ref. 1).SQ SEQUENCE 2554 AA; 287022 MW; 09E238A0F27684F8 CRC64;MTMFWQQNVD HQSDEQDKQA KGAAPTKRLN ISFNVKIAVN VNTKMTTTHI NQQAPGTSSS SSNSQNASPS KIVVRQQSSS FDLRQQLARL GRQLASGQDG HGGISTILII NLLLLILLSI CCDVCRSHNY TVHQSPEPVS KDQMRLLRPK LDSDVVEKVA IWHKHAAAAP PSIVEGIAIS SRPQSTMAHH PDDRDRDRDP SEEQHGVDER MVLERVTRDC VQRCIVEEDL FLDEFGIQCE KADNGEKCYK TRCTKGCAQW YRALKELESC QEACLSLQFY PYDMPCIGAC EMAQRDYWHL QRLAISHLVE RTQPQLERAP RADGQSTPLT IRWAMHFPEH YLASRPFNIQ YQFVDHHGEE LDLEQEDQDA SGETGSSAWF NLADYDCDEY YVCEILEALI PYTQYRFRFE LPFGENRDEV LYSPATPAYQ TPPEGAPISA PVIEHLMGLD DSHLAVHWHP GRFTNGPIEG YRLRLSSSEG•Sequence类–创建一条序列use Bio::Seq;my $seq1 = Bio::Seq->new (-seq=> ’ATGAGTAGTAGTAAAGGTA’,-id => ’my seq’,-desc=> ’this is a new Seq’);–读取序列文件use Bio::SeqIO;my $seqin= Bio::SeqIO->new( -file => ’seq.fasta’,-format => ’fasta’);my $seq3 = $seqin->next_seq();use Bio::SeqIO;my $seqin2 = Bio::SeqIO->newFh( -file => ’seq.fasta’,-format => ’fasta’); my $seq4 = <$seqin2>;–从远程数据库中获取序列举例:•Create a sequence object for the MALK_ECOLI SwissProtentry•Print its Accession number and descriptionuse Bio::DB::SwissProt;my $database = new Bio::DB::SwissProt;my $seq= $database->get_Seq_by_id(’MALK_ECOLI’);print "Seq: ", $seq->accession_number(), " --", $seq->desc(), "\n\n"; Seq: P68187--Maltose/maltodextrin import ATP-binding protein malK(EC3.6.3.19).–根据SwissProt entry获取PDB结构entry$annotation = $seq->annotation();my @structures = ();foreach my $link ( $annotation->get_Annotations(’dblink’) ) { if ($link->database() eq’PDB’) {push (@structures, $link->primary_id());}}print "\nPDB Structures: ", join (" ", @structures), "\n";•Features and Location classes –Features class–Location class•Features Classes–Bio::SeqFeatureI–Bio::SeqFeature::Generic–Bio::SeqFeature::FeaturePair–Bio::SeqFeature::SimilarityPair–Bio::SeqFeature::Similarity–Bio::SeqFeature::Gene::–Bio::SeqFeature::Gene::ExonI–Bio::SeqFeature::Gene::Exon–Bio::SeqFeature::Gene::Transcript–Bio::SeqFeature::Gene::TranscriptI–Bio::SeqFeature::Gene::GeneStructureI –Bio::SeqFeature::Gene::GeneStructure–Adding a feature to a Genbank entryuse Bio::SeqFeature::Generic;use Bio::SeqIO;$in = Bio::SeqIO->newFh(-file => $ARGV[0]);$out = Bio::SeqIO->newFh();$seq= <$in>;$feat = new Bio::SeqFeature::Generic( -start => 10, -end => 100,-strand => -1,-primary => 'repeat',-source => 'repeatmasker',-score => 1000,-tag => {new => 1,author => 'someone',sillytag=> 'this is silly!' } ); $seq->add_SeqFeature($feat);print $out $seq;增加前的特征项:FEATURES Location/Qualifiers source 1..16705/organism="Pteropus dasymallus"/organelle="mitochondrion"/mol_type="genomic DNA"/db_xref="taxon:126282"gene 1..69/gene="tRNA-Phe“……………gene 15321..15390/gene="tRNA-Thr"tRNA15321..15390/gene="tRNA-Thr"gene complement(15390..15455)/gene="tRNA-Pro"tRNA complement(15390..15455)/gene="tRNA-Pro"D-loop 15456..16705/note="D-loop region"增加后的特征项:FEATURES Location/Qualifiers……….D-loop 15456..16705/note="D-loop region"repeat complement(10..100)/new="1"/author="someone"/sillytag="this is silly!"/note="score=1000"•Location Classes–Bio::LocationI–Bio::Location::Simple–Bio::Location::SplitLocationIet Bio::Location::Split–Bio::Location::FuzzyLocationIet Bio::Location::Fuzzy –Coordinate policies, for fuzzy location start and enddetermination:•interface: Bio::Location::CoordinatePolicyI,•default: Bio::Location::WidestCoordPolicy•Bio::Location::NarrowestCoordPolicy,Bio::Location::AvWithinCoordPolicy–specifies a location whose begin is before 30 and whose end is at 90my $fuzzylocation= new Bio::Location::Fuzzy(-start => ’<30’,-end => 90,-loc_type=> ’.’);符号说明:–’..’: EXACT–’^’: BETWEEN (a site between 2 bases), exemple: 123^124a site between bases 123 and 124. 145^177 a site between 2adjacent bases, somewhere between bases 145 and 177.–’.’: WITHIN (a base within a range), example : (102.110) is a site wihtin the 102 .. 110 range, inclusive.–’<’: BEFORE–’>’: AFTER•Sequence analysis tools–Bio::Tools::SeqStats/releases/bioperl-1.0.1/Bio/Tools/SeqStats.html –Bio::Tools::SeqWordshttp:/releases/bioperl-1.0.1/Bio/Tools/SeqWords.html/–Bio::Tools::SeqPattern/releases/bioperl-1.0.1/Bio/Tools/SeqPattern.html –Bio::SeqUtils/releases/bioperl-1.0.1/Bio/Tools/SeqUtils.html –Bio::Tools::RestrictionEnzyme/releases/bioperl-1.0.1/Bio/Tools/RestrictionEnzyme.html–Bio::Tools::Sigcleave/releases/bioperl-1.0.1/Bio/Tools/Sigcleave.html–Bio::Tools::SeqStats$seq_stats= Bio::Tools::SeqStats->new(-seq=>$seqobj); $seq_stats->count_monomers();$seq_stats->count_codons();$weight = $seq_stats->get_mol_wt($seqobj);–Bio::Tools::SeqWords$seq_word= Bio::Tools::SeqWords->new(-seq=> $seqobj); $seq_stats->count_words($word_length);–Bio::Tools::SeqPattern$pat = ’T[GA]AA...TAAT’;$pattern = new Bio::Tools::SeqPattern(-SEQ =>$pat, -TYPE=>’Dna’); $pattern->expand;$pattern->revcom;$pattern->alphabet_ok;–Bio::Tools::RestrictionEnzyme$re1 = new Bio::Tools::RestrictionEnzyme(-NAME =>’EcoRI’);@fragments = $re1->cut_seq($seqobj);$re2 = new Bio::Tools::RestrictionEnzyme( -NAME =>’EcoRV--GAT^ATC’, -MAKE =>’custom’);–Bio::Tools::Sigcleave$sigcleave_object= new Bio::Tools::Sigcleave(’-file’=>’sigtest.aa’,’-threshold’=>’3.5’,’-desc’=>’test sigcleave protein seq’,’-type’=>’AMINO ’); %raw_results= $sigcleave_object->signals;$formatted_output= $sigcleave_object->pretty_print;2.Alignments类–Bio::AlignIO/releases/bioperl-1.0/Bio/AlignIO.html –Bio::SimpleAlign/releases/bioperl-1.0/Bio/SimpleAlign.html –Bio::Tools::CodonTable/releases/bioperl-1.0/Bio/Tools/CodonTable.html–Bio::AlignIOFormat conversions with AlignIO:use Bio::AlignIO;$inputfilename= "testaln.fasta";$in = Bio::AlignIO->new(-file => $inputfilename, '-format' => 'fasta'); $out = Bio::AlignIO->new(-file => ">out.aln.pfam" , '-format' => 'pfam'); while ( my $aln= $in->next_aln() ) {$out->write_aln($aln);}Alignment result:use Bio::AlignIO;my $in = new Bio::AlignIO( -file =>, $ARGV[0], -format => 'clustalw' );my $aln= $in->next_aln();print "same length of all sequences: ", ($aln->is_flush()) ? "yes" : "no", "\n"; print "alignment length: ", $aln->length(), "\n";printf"identity: %.2f %%\n", $aln->percentage_identity();printf"identity of conserved columns: %.2f %%\n",$aln>overall_percentage_identity();infile.ALN文件内容:CLUSTAL W (1.82) multiple sequence alignmentBACR_HALAR ---------------MPEPGSEAIWLWLGTAGMFLGMLYFIARGWGETDSRRQKFYIATI BACR_HALHP -----------------------IWLWLGTAGMFLGMLYFIARGWGETDSRRQKFYIATI BACR_HALVA ---------------MPAPEGEAIWLWLGTAGMFLGMLYFIARGWGETDSRRQKFYIATI BACR_HALHS -------------------------LWLGTAGMFLGMLYFIARGWGETDGRRQKFYIATI BACR_HALSR --------MLQSGMSTYVPGGESIFLWVGTAGMFLGMLYFIARGWSVSDQRRQKFYIATI BACR_HALHA MLELLPTAVEGVSQAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITT BACR_HALHM --------------------------GIGTLLMLIGTFYFIARGWGVTDKKAREYYAITI. :** * :* :**:.:* . :* :::* *BACR_HALAR LITAIAFVNYLAMALGFGLTIVEFAGEEHP--IYWARYSDWLFTTPLLLYDLGLLAGADR BACR_HALHP LITAIAFVNYLAMALGFGLTIVEFAGEEHP--IYWARYSDWLFTTPLLLYDLGLLAGADR BACR_HALVA LITAIAFVNYLAMALGFGLTIVEIAGEQRP--IYWARYSDWLFTTPLLLYDLGLLAGADR BACR_HALHS LITAIAFVNYLAMALGFGLTFIEFGGEQHP--IYWARYTDWLFTTPLLLYDLGLLAGADR BACR_HALSR MIAAIAFVNYLSMALGFGVTTIELGGEERA--IYWARYTDWLFTTPLLLYDLALLAGADR BACR_HALHA LVPAIAFTMYLSMLLGYGLTMVPFGGEQNP--IYWARYADWLFTTPLLLLDLALLVDADQ BACR_HALHM LVPGIASAAYLSMFFGIGLTTVEVAGMAEPLEIYYARYADWLFTTPLLLLDLALLANADR ::..** . **:* :* *:* : ..* .. **:***:********** **.**..**:BACR_HALAR NTITSLVSLDVLMIGTGLVATLSPGSGVLSAGAERLVWWGISTAFLLVLLYFLFSSLSGR BACR_HALHP NTITSLVSLDVLMIGTGLVATLSAGSGVLSAGAERLVWWGISTAFLLVLLYFLFSSLSGR BACR_HALVA NTISSLVSLDVLMIGTGLVATLSAGSGVLSAGAERLVWWGISTAFLLVLLYFLFSSLSGR BACR_HALHS NTIYSLVSLDVLMIGTGVVATLSAGSGVLSAGAERLVWWGISTAFLLVLLYFLFSSLSGR BACR_HALSR NTIYSLVGLDVLMIGTGALATLSAGSGVLPAGAERLVWWGISTGFLLVLLYFLFSNLTDR BACR_HALHA GTILALVGADGIMIGTGLVGALT------KVYSYRFVWWAISTAAMLYILYVLFFGFTSK BACR_HALHM TTIGTLIGVDALMIVTGLIGALS------HTPLARYTWWLFSTIAFLFVLYYLLTVLRSA ** :*:. * :** ** :.:*:..:. . * .** :** :* :** *: : .BACR_HALAR VADLPSDTRSTFKTLRNLVTVVWLVYPVWWLIGTEGIGLVGIGIETAGFMVIDLTAKVGF BACR_HALHP VADLPSDTRSTFKTLRNLVTVVWLVYPVWWLIGTEGIGLVGIGIETAGFMVIDLTA----BACR_HALVA VADLPSDTRSTFKTLRNLVTVVWLVYPVWWLVGTEGIGLVGIGIETAGFMVIDLVAKVGF BACR_HALHS VANLPSDTRSTFKTLRNLVTVVWLVYPVWWLVGSEGLGLVGIGIETAGFMVIDLVA----BACR_HALSR ASELSGDLQSKFSTLRNLVLVLWLVYPVLWLVGTEGLGLVGLPIETAAFMVLDLTAKIGF BACR_HALHA AESMRPEVASTFKVLRNVTVVLWSAYPVVWLIGSEGAGIVPLNIETLLFMVLDVSAKVGF BACR_HALHM AAELSEDVQTTFNTLTALVAVLWTAYPILWIIGTEGAGVVGLGVETLAFMVLDVTA----. .: : :.*..* :. *:* .**: *::*:** *:* : :** ***:*: *. . BACR_HALAR GIILLRS---HGVLDGAAETTGTGATPADDBACR_HALHP ------------------------------BACR_HALVA GIILLRS---HGVLDGAAETTGAGATATADBACR_HALHS ------------------------------BACR_HALSR GIILLQS---HAVLD-EGQTASEGAAVAD-BACR_HALHA GLILLRSRAIFGEAEAPEPSAGDGAAATSDBACR_HALHM ------------------------------. : . .. ::. .:: : .屏幕输出:same length of all sequences: yesalignment length: 270identity: 69.99 %identity of conserved columns: 28.52 %3.Analysis类–Blast/bioperl-live/–Genscan/releases/bioperl-1.0/Bio/Tools/Genscan.html•Blast–Bio::Tools::Run::StandAloneBlastRunning local Blastuse Bio::SeqIO;use Bio::Tools::Run::StandAloneBlast;my $Seq_in= Bio::SeqIO->new (-file => $ARGV[0], -format => 'fasta');my $query = $Seq_in->next_seq();my $factory = Bio::Tools::Run::StandAloneBlast->new('program' => 'blastp','database' => 'swissprot',);my $blast_report= $factory->blastall($query);my $result = $blast_report->next_result;while( my $hit = $result->next_hit()) {print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n";}Running Blast on a Swissprot entryuse Bio::SeqIO;use Bio::Tools::Run::StandAloneBlast;use Bio::DB::SwissProt;my $database = new Bio::DB::SwissProt;my $query = $database->get_Seq_by_id(’TAUD_ECOLI’);my $factory = Bio::Tools::Run::StandAloneBlast->new(’program’=> ’blastp’,’database’=> ’swissprot’,); my $blast_report= $factory->blastall($query);my $result = $blast_report->next_result;while( my $hit = $result->next_hit()) {print "\thit name: ", $hit->name()," significance: ", $hit->significance(), "\n";}Parsing from a Blast run resultuse Bio::SeqIO;use Bio::Tools::Run::StandAloneBlast;my $Seq_in= Bio::SeqIO->new (-file => $ARGV[0], -format => 'fasta');my $query = $Seq_in->next_seq();my $factory = Bio::Tools::Run::StandAloneBlast->new('program' => 'blastp','database' => 'swissprot',); my $blast_report= $factory->blastall($query);my $result = $blast_report->next_result;while( my $hit = $result->next_hit()) {print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n";while( my $hsp= $hit->next_hsp()) {print "E: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n";}}Parsing from a Blast fileuse Bio::SearchIO;my $blast_report= new Bio::SearchIO('-format' => 'blast', '-file' => $ARGV[0]); my $result = $blast_report->next_result;while( my $hit = $result->next_hit()) {print "\thit name: ", $hit->name(), "\n";while( my $hsp= $hit->next_hsp()) {print "E: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n";}}Running bl2sequse Bio::SeqIO;use Bio::Tools::Run::StandAloneBlast;my $query1_in = Bio::SeqIO->newFh( -file => $ARGV[0], -format => 'fasta' ); my $query1 = <$query1_in>;my $query2_in = Bio::SeqIO->newFh( -file => $ARGV[1], -format => 'fasta' ); my $query2 = <$query2_in>;$factory = Bio::Tools::Run::StandAloneBlast->new('program' => 'blastp'); $report = $factory->bl2seq($query1, $query2);while(my$hsp= $report->next_feature) {print "homology seq:\n", $hsp->homologySeq, "\n";print "sbjctSeq:\n", $hsp->sbjctSeq, "\n";}•Genscan–Bio::Tools::GenscanGenscan parsinguse Bio::Tools::Genscan;my $genscan_file= $ARGV[0];$genscan= Bio::Tools::Genscan->new(-file => $genscan_file);while(my$gene = $genscan->next_prediction()) {my $prot= $gene->predicted_protein;print "protein :\n", $prot->seq, "\n\n";}Genscan.out文件内容:GENSCAN 1.0Date run: 20-Jan-98Time: 11:02:22Sequence HUMRASH : 6453 bp: 68.19% C+G : Isochore 4 (57.00 -100.00 C+G%) Parameter matrix: HumanIso.smatPredicted genes/exons:Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..------------------------------------------------------1.01 Init + 1664 1774 111 1 0 94 83 212 0.997 21.331.02 Intr+ 2042 2220 179 1 2 104 66 408 0.997 40.121.03 Intr+ 2374 2533 160 1 1 89 94 302 0.999 32.081.04 Term + 3231 3350 120 2 0 115 48 202 0.961 18.31Predicted peptide sequence(s):>HUMRASH|GENSCAN_predicted_peptide_1|189_aa MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG QEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDL AARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS屏幕显示内容:protein: MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLC VFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLV REIRQHKLRKLNPPDESGPGCMSCKCVLSGenscan parsing, with sub-sequencesuse Bio::Tools::Genscan;my $genscan_file= $ARGV[0];$genscan= Bio::Tools::Genscan->new(-file => $genscan_file); while(my$gene = $genscan->next_prediction()) {my $prot= $gene->predicted_protein;print "protein :\n", $prot->seq, "\n\n";# display genscan predicted cds(if -cds genscan option) my $predicted_cds= $gene->predicted_cds;print "predicted CDS: \n", $predicted_cds->seq, "\n\n";foreach my $exon($gene->exons()) {my $loc = $exon->location;print "exon-primary_tag: ",$exon->primary_tag, " coding? ",$exon->is_coding(), " start-end: ", $loc->start, "-", $loc->end, "\n";}foreach my $intron($gene->introns()) {my $loc = $intron->location;print "intron primary_tag: ",$intron->primary_tag, " start-end: ",$loc->start, "-", $loc->end, "\n";}print "---------------------------------\n";}输入文件:GENSCAN 1.0Date run: 30-Oct-101Time: 17:32:04Sequence HUMRASH : 6453 bp: 68.19% C+G : Isochore 4 (57 -100 C+G%)Parameter matrix: HumanIso.smatPredicted genes/exons:Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..------------------------------------------------------1.01 Init + 1664 1774 111 1 0 94 83 212 0.997 21.33 1.02 Intr+ 2042 2220 179 1 2 104 66 408 0.997 40.12 1.03 Intr+ 2374 2533 160 1 1 89 94 302 0.999 32.08 1.04 Term + 3231 3350 120 2 0 115 48 202 0.961 18.31Predicted peptide sequence(s):Predicted coding sequence(s):>HUMRASH|GENSCAN_predicted_peptide_1|189_aa MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG QEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDL AARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPG CMSCKCVLS>HUMRASH|GENSCAN_predicted_CDS_1|570_bp atgacggaatataagctggtggtggtgggcgccggcggtgtgggcaagagtgcgctgacc atccagctgatccagaaccattttgtggacgaatacgaccccactatagaggattcctac cggaagcaggtggtcattgatggggagacgtgcctgttggacatcctggataccgccggc caggaggagtacagcgccatgcgggaccagtacatgcgcaccggggagggcttcctgtgt gtgtttgccatcaacaacaccaagtcttttgaggacatccaccagtacagggagcagatc aaacgggtgaaggactcggatgacgtgcccatggtgctggtggggaacaagtgtgacctg gctgcacgcactgtggaatctcggcaggctcaggacctcgcccgaagctacggcatcccc tacatcgagacctcggccaagacccggcagggagtggaggatgccttctacacgttggtg cgtgagatccggcagcacaagctgcggaagctgaaccctcctgatgagagtggccccggc tgcatgagctgcaagtgtgtgctctcctga输出文件:protein: MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLC VFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLV REIRQHKLRKLNPPDESGPGCMSCKCVLSpredicted CDS: atgacggaatataagctggtggtggtgggcgccggcggtgtgggcaagagtgcgctgaccatccagctgatccagaacca ttttgtggacgaatacgaccccactatagaggattcctaccggaagcaggtggtcattgatggggagacgtgcctgttgg acatcctggataccgccggccaggaggagtacagcgccatgcgggaccagtacatgcgcaccggggagggcttcctgtgt gtgtttgccatcaacaacaccaagtcttttgaggacatccaccagtacagggagcagatcaaacgggtgaaggactcgga tgacgtgcccatggtgctggtggggaacaagtgtgacctggctgcacgcactgtggaatctcggcaggctcaggacctcg cccgaagctacggcatcccctacatcgagacctcggccaagacccggcagggagtggaggatgccttctacacgttggtg cgtgagatccggcagcacaagctgcggaagctgaaccctcctgatgagagtggccccggctgcatgagctgcaagtgtgt gctctcctgaexon-primary_tag: Initial coding? 1 start-end: 1664-1774exon-primary_tag: Internal coding? 1 start-end: 2042-2220exon-primary_tag: Internal coding? 1 start-end: 2374-2533exon-primary_tag: Terminal coding? 1 start-end: 3231-3350intron primary_tag: intron start-end: 1775-2041intron primary_tag: intron start-end: 2221-2373intron primary_tag: intron start-end: 2534-3230---------------------------------4.Databases类–Bio::Index::Swissprotuse Bio::Index::Swissprot;use Bio::SeqIO;my $out = Bio::SeqIO->newFh(-fh=>\*STDOUT, -format=>'fasta');my $Index_File_Name= shift;my $inx= Bio::Index::Swissprot->new(-filename=>$Index_File_Name); foreach my $id (@ARGV) {my $seq= $inx->fetch($id);print $out $seq;}7.3 实例1.获取序列统计信息–分子量、残基、密码子使用频度#!/usr/bin/perluse warnings;use Bio::Seq;use Bio::Tools::SeqStats ;$seqobj = Bio::Seq->new(-seq =>"aaaatgggggggggggccccgtt",-display_id => "#12345",-desc => "example 1" );$seq_stats = Bio::Tools::SeqStats->new($seqobj);$weight = $seq_stats->get_mol_wt();$monomer_ref = $seq_stats->count_monomers();$codon_ref = $seq_stats->count_codons();print "$weight\n$monomer_ref\n$codon_ref\n";ARRAY(0x1bf9024)HASH(0x1be0e98)HASH(0x1bf94d4)2.识别限制性酶切位点#!/usr/bin/perluse warnings;use Bio::Seq;use Bio::Tools::RestrictionEnzyme;$seqobj= Bio::Seq->new(-seq=>"aaaatgggggggggggccccgtt",-display_id=> "#12345",-desc=> "example 1" );$re = new Bio::Tools::RestrictionEnzyme(-NAME =>'EcoRI');@sixcutters= $re->available_list();$re1 = new Bio::Tools::RestrictionEnzyme(-NAME =>'EcoRI');@fragments = $re1->cut_seq($seqobj);print "@sixcutters\n@fragments";3.识别氨基酸剪切位点#!/usr/bin/perl use warnings;use Bio::Seq;use Bio::Tools::Sigcleave;$seqobj = Bio::Seq->new(-seq=>"AALLHHHHHHGGGGPPRTTTTTVVVVVVVVVVVVVVV");$sigcleave_object = new Bio::Tools::Sigcleave(-seq=>$seqobj, -threshod=>3.5,-desc=>'test sigcleave protein seq',-type=>'AMINO');$formatted_output = $sigcleave_object->pretty_print;print "$formatted_output\n";SIGCLEAVE of NONAME from: 1 to 37Report scores over 3.5Maximum score 3.6 at residue 3Sequence: AA--------------LLHHHHHHGGGGPPRTTTTTVVVVVVVVVVVVVVV | (signal) | (mature peptide)1 3。

BioPerl使用手册讲解学习

BioPerl使用手册讲解学习

BioPerl使用手册第一弹—Bio::SeqIO篇注:本手册假设你已经拥有了一定的Perl编程经验,对个中术语(terms)不再进行赘述。

1.让我们开始吧为了让第一次使用本手册的同志们在刚开始就有成功的喜悦,这里给出一个例子,大家准备好自己手中的fasta 文件吧!请在文本编辑器中写入如下程序,并在终端运行:#! /usr/bin/perl –wuse strict;use Bio::SeqIO;my $file = “*********”; # Please use your path to replace the startsmy $seqio_object = Bio::SeqIO -> new(-file = $file);my $seq_obj = $seqio_object -> next_seq;printf “$seq_obj\n”;如果你成功键入了以上程序并且没有报错发生,那么屏幕上面就会正常显示出你的fasta序列。

那么恭喜你,你已经成功调用了BioPerl的模块,并且完成了一个面对对象的程序。

下面我们就来看一下我们第一个认识的BioPrel 的模块Bio::SeqIO。

2.关于Bio::SeqIO的那些事儿在介绍Bio::SeqIO之前,先来说一下为什么会产生BioPerl这个东西。

在生物信息学起步之初Prel语言强悍的字符串处理能力以及执行效率,毫无疑问的被各位从计算机和数学行业转行过来的“生物学家”选为工具语言(在生物信息数据处理方面放眼望去毫无疑问是Perl语言的天下,近来对大规模数据的处理方面R语言亦有崛起之势)。

但是,对于这海量的数据,同样丰富多彩的数据格式以及花样繁多的数据分析;每次处理数据都要重新自己编写正则表达式未免效率过于低下。

于是,在Perl一次重大的更新之后(引入面对对象编程,后面都将使用OOP代替面对对象编程),几个不太勤快的学生物的程序员看到了通用编程的可能,于是就有了我们现在广泛应用的BioPerl。

bioperl

bioperl

BioPerlT utorial - a tutorial for bioperlAUTHORWritten by Peter Schattner <schattner@>DESCRIPTIONThis tutorial includes "snippets" of code and text from variousBioperl documents including module documentation, example scriptsand "t" test scripts. You may distribute this tutorial under thesame terms as perl itself.This document is written in Perl POD (plain old documentation)format. You can run this file through your favorite pod translator(pod2html, pod2man, pod2text, etc.) if you would like a moreconvenient formatting.zhihuazhang@ May 17, 2003I. Introduction (4)I. Introduction (4)I.4 Installation (4)I.5 Additional comments for non-unix users (4)I.6 Places to look for additional documentation (5)II. Brief introduction to bioperl's objects (5)II.1 Sequence objects (5)II.2 Location objects (6)II.4 Interface objects and implementation objects (7)III. Using bioperl (7)III.1 Accessing sequence data from local and remote databases (7)III.1.1 Accessing remote databases (Bio::DB::GenBank, etc) (8)III.1.2 Indexing and accessing local databases (Bio::Index::*, bpindex.pl, (8)III.2 Transforming formats of database/ file records (9)III.2.1 Transforming sequence files (SeqIO) (9)III.2.2 Transforming alignment files (AlignIO) (10)III.3 Manipulating sequences (11)III.3.1 Manipulating sequence data with Seq methods (11)III.3.2 Obtaining basic sequence statistics- molecular weights, residue & (13)III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme) (13)III.3.4 Identifying amino acid cleavage sites (Sigcleave) (13)III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern (14)III.3.6 Sequence manipulation without creating Bioperl ``objects'' (15)III.4 Searching for ``similar'' sequences (15)III.4.1 Running BLAST (using RemoteBlast.pm) (15)III.4.2 Parsing BLAST and FASTA reports with Search and SearchIO (16)III.4.3 Parsing BLAST reports with BPlite, BPpsilite, and BPbl2seq (17)III.4.4 Parsing HMM reports (HMMER::Results, SearchIO) (18)III.4.5 Running BLAST locally (StandAloneBlast) (18)III.6 Searching for genes and other structures on genomic DNA (Genscan, (20)III.7 Developing machine readable sequence annotations (21)III.7.1 Representing sequence annotations (21)III.7.2 Representing and large and/or changing sequences (22)III.7.3 Representing related sequences - mutations, polymorphisms etc (23)III.7.4 Incorporating quality data in sequence annotation (SeqWithQuality) (24)III.7.5 Sequence XML representations - generation and parsing (25)III.7.6 Representing Sequences using GFF (Bio:DB:GFF ) (25)III.8 Manipulating clusters of sequences (Cluster, ClusterIO) (26)III.9 Representing non-sequence data in Bioperl: structures, trees and maps (26)III.9.3 Map objects for manipulating genetic maps (Map::MapI, MapIO) (27)III.9.5 Graphics objects for representing sequence objects as images (27)III.10 Bioperl alphabets (28)III.10.1 Extended DNA / RNA alphabet (28)III.10.2 Amino Acid alphabet (28)IV. Auxilliary Bioperl Libraries (Bioperl-run, Bioperl-db, etc.) (29)IV.1 Using the Bioperl Auxilliary Libraries (29)IV.2 Running programs (Bioperl-run, Bioperl-ext) (30)IV.2.2 Aligning 2 sequences with Blast using bl2seq and AlignIO (31)IV.2.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm) (31)IV.3 bioperl-db (32)IV.4 Other Bioperl auxilliary libraries (32)V Appendix (33)V.1 Appendi x: Finding out which methods are used by which Bioperl Objects (33)V.2 Appendix: Tutorial demo scripts (33)I. IntroductionI.4 InstallationThe actual installation of the various system components is accomplished in the standard manner:Locate the package on the networkDownloadDecompress (with gunzip or a similiar utility)Remove the file archive (eg with tar -xvf)Create a ``makefile'' (with ``perl Makefile.PL'' for perl modules or a supplied ``install'' or ``configure'' program for non-perl program Run ``make'', ``make test'' and ``make install'' This procedure must be repeated for every CPAN module, bioperl-extension and external module to be installed. A helper module CPAN.pm is available from CPAN which automates the process for installing the perl modules. The CPAN module can also be used to install all of the modules listed above in a single step as a ``bundle'' of modules, Bundle::BioPerl, eg$>perl -MCPAN -e shellcpan>install Bundle::BioPerl<installation details....>cpan>install B/BI/BIRNEY/bioperl-1.0.1.tar.gz<installation details....>cpan>quitBe advised that version numbers change regularly, so the number used above may not apply. A disadvantage of the ``bundle'' approach is that if there's a problem installing any individual module it may be a bit more difficult to isolate. See the package's INSTALL file for more details. For the external programs (clustal, Tcoffee, ncbi-blast), there is an extra step:Set the relevant environmental variable (CLUSTALDIR, TCOFFEEDIR or BLASTDIR) to the directory holding the executable in your startup file - eg in .bashrc. (For running local blasts, it is also necessary that the name of local-blast database directory is known to bioperl. This will typically happen automatically, but in case of difficulty, refer to the documentation in the Bio::Tools::Run::StandAloneBlast manpage) The only likely complication (at least on unix systems) that may occur is if you are unable to obtain system level writing privileges. For instructions on modifying the installation in this case and for more details on the overall installation procedure, see the INSTALL file in the bioperl distribution as well as the README files in the external programs you want to use (eg bioperl-ext, clustalw, TCoffee, NCBI-blast).I.5 Additional comments for non-unix usersBioperl has mainly been developed and tested under various unix environments (including Linux and MacOSX). In addition, this tutorial has been written largely from a Unix perspective. Mac users may find Steve Cannon's installation notes and suggestions for Bioperl on OS X at/~cann0010/Bioperl_OSX_install.html helpful. Also Todd Richmond has written of his experiences with BioPerl on MacOS 9 at /Core/mac-bioperl.html The bioperl core has also been tested and should work under most versions of Microsoft Windows. For many windows users the perl and bioperl distributions from Active State, at / has been quite helpful. Other windows users have had success running bioperl under Cygwin (). See the package'sINSTALL.WIN file for more details.Many bioperl features require the use of CPAN modules, compiled extensions or external programs. These features probably will not work under some or all of these other operating systems. If a script attempts to access these features from a non-unix OS, bioperl is designed to simply report that the desired capability is not available. However, since the testing of bioperl in these environments has been limited, the script may well crash in a less ``graceful'' manner.I.6 Places to look for additional documentationThis tutorial does not intend to be a comprehensive description of all the objects and methods available in bioperl. For that the reader is directed to the documentation included with each of the modules. A very usefulinterface for finding one's way within all the module documentation can be found at /bioperl-live/ This interface lists all bioperl modules and descriptions of all of their methods. In addition, beginner questions can often be answered by looking at the FAQ, INSTALL and README files in the top-level directory of the bioperl distribution. One potential problem in locating the correct documentation is that multiple methods in different modules may all share the same name. Moreover, because of perl's complex method of ``inheritance'', it is not often clear which of the identically named methods is being called by a given object. One way to resolve this question is by using the software described in Appendix V.1.For those who prefer more visual descriptions, /Core/Latest/modules.html also offers links to PDF files which contain schematics that describe how many of the bioperl objects related to one another. In addition, a bioperl online course is available on the web at http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/ The user is also referred to numerous bioperl scripts in the scripts/ and examples/ directories (see bioscripts.pod for a description of these scripts).II. Brief introduction to bioperl's objectsThe purpose of this tutorial is to get you using bioperl to solve real-life bioinformatics problems as quickly as possible. The aim is not to explain the structure of bioperl objects or perl object-oriented programming in general. Indeed, the relationships among the bioperl objects is not simple; however, understanding them in detail is fortunately not necessary for successfully using the package. Nevertheless, a little familiarity with the bioperl object ``bestiary'' can be very helpful even to the casual user of bioperl. For example there are (at least) eight different ``sequence objects'' - Seq, PrimarySeq, LocatableSeq, RelSegment, LiveSeq, LargeSeq, SeqI, and SeqWithQuality. Understanding the relationships among these objects - and why there are so many of them - will help you select the appropriate one to use in your script.II.1 Sequence objectsSeq is the central sequence object in bioperl. When in doubt this is probably the object that you want to use to describe a DNA, RNA or protein sequence in bioperl. Most common sequence manipulations can be performed with Seq. These capabilities are described in sections III.3.1 and III.7.1, or in the Bio::Seq manpage. Seq objects can be created explicitly (see section III.2.1 for an example). However usually Seq objects will be created for you automatically when you read i n a file containing sequence data using the SeqIO object. This procedure is described in section III.2.1. In addition to storing its identification labels and the sequence itself, a Seq object canstore multiple annotations and associated ``sequence features''. This capability can be very useful - especially in development of automated genome annotation systems, see section III.7.1. RichSeq objects store additional annotations beyond those used by standard Seq objects. If you are using sources with very ric h sequence annotation, you may want to consider using these objects which are described in section III.7.1. SeqWithQuality objects are used to manipulate sequences with quality data, like those produced by phred. These objects are described in section III.7.4, the Bio::Seq::RichSeqI manpage, and in the Bio::Seq::SeqWithQuality manpage. On the other hand, if you need a script capable of simultaneously handling many (hundreds or thousands) sequences at a time, then the overhead of adding annotations to each sequence can be significant. For such applications, you will want to use the PrimarySeq object. PrimarySeq is basically a ``stripped down'' version of Seq. It contains just the sequence data itself and a few identifying labels (id, accession number, molecule type = dna, rna, or protein). For applications with hundreds or thousands or sequences, using PrimarySeq objects can significantly speed up program execution and decrease the amount of RAM the program requires. See the Bio::PrimarySeq manpage for more details. What is (for historical reasons) called a LocatableSeq object might be more appropriately called an ``AlignedSeq'' object. It is a Seq object which is part of a multiple sequence alignment. It has ``start'' and ``end'' positions indicating from where in a larger sequence it may have been extracted. It also may have ``gap'' symbols corresponding to the alignment to which it belongs. It is used by the alignment object SimpleAlign and other modules that use SimpleAlign objects (eg AlignIO.pm, pSW.pm). I n general you don't have to worry about creating LocatableSeq objects because they will be made for you automatically when you create an alignment (using pSW, Clustalw, Tcoffee or bl2seq) or when you input an alignment data file using AlignIO. However if you need to input a sequence alignment by hand (eg to build a SimpleAlign object), you will need to input the sequences as LocatableSeqs. Other sources of information include the Bio::LocatableSeq manpage, the Bio::SimpleAlign manpage, the Bio::AlignIO manpage, and the Bio::Tools::pSW manpage. The RelSegment object is also a type of bioperl Seq object. RelSegment objects are useful when you want to be able to manipulate the origin of the genomic coordinate system. This situation may occur when looking at a sub-sequence (e.g. an exon) which is located on a longer underlying underlying sequence such as a chromosome or a contig. Such manipulations may be important, for example when designing a graphical genome ``browaer''. If your code may need such a capability, look at the documentation the Bio::DB::GFF::RelSegment manpage which describes this feature in detail.A LargeSeq object is a special type of Seq object used for handling very long ( eg > 100 MB) sequences. If you need to manipulate such long sequences see section III.7.2 which describes LargeSeq objects, or the Bio::Seq::LargeSeq manpage. A LiveSeq object is another specialized object for storing sequence data. LiveSeq addresses the problem of features whose location on a sequence changes over time. This can happen, for example, when sequence feature objects are used to store gene locations on newly sequenced genomes -locations which can change as higher quality sequencing data becomes available. Although a LiveSeq object is not implemented in the same way as a Seq object, LiveSeq does implement the SeqI interface (see below). Consequently, most methods available for Seq objects will work fine with LiveSeq objects. Section III.7.2 and the Bio::LiveSeq manpage contain further discussion of LiveSeq objects. SeqI objects are Seq ``interface objects'' (see section II.4 and the Bio::SeqI manpage). They are used to ensure bioperl's compatibility with other software packages. SeqI and other interface objects are not likely to be relevant to the casual bioperl user. *** Having described these other types of sequence objects, the ``bottom line'' still is that if you store your sequence data in Seq objects (which is where they'll be if you read them in with SeqIO), you will usually do just fine. ***II.2 Location objectsA Location object is designed to be associated with a Sequence Feature object to indicate where on a largerstructure (eg a chromosome or contig) the feature can be found. The reason why this simple concept has evolved into a collection of rather complicated objects is that: 1) Some objects have multiple locations or sub-locations (e.g. a gene's exons may have multiple start and stop locations) 2) In unfinished genomes, the precise locations of features is not known with certainty. Bioperl's various Location objects address these complications. In addition there are ``CoordinatePolicy'' objects that allow the user to specify how to measure the ``length'' of a feature if its precise start and end coordinates are not know. In most cases, you will not need to worry about these complications if you are using bioperl to handle simple features with well-defined start and stop locations. However, if you are using bioperl to annotate partially or unfinished genomes or to read annotations of such genomes with bioperl, understanding the various Location objects will be important. See the documentation of the various modules in the Bio::Locations directory or the Bio::Location::CoordinatePolicyI manpage for more information.II.4 Interface objects and implementation objectsOne goal of the design of Bioperl is to separate interface and implementation objects. An interface is solely the definition of what methods one can call on an object, without any knowledge of how it is implemented. An implementation is an actual, working implementation of an object. In languages like Java, interface definition is part of the language. In Perl, you have to roll your own. In bioperl, the interface objects usually have names like Bio::MyObjectI, with the trailing I indicating it is an interface object. The interface objects mainly provide documentation on what the interface is, and how to use it, without any implementations (though there are some exceptions). Although interface objects are not of much direct utility to the casual bioperl user, being aware of their existence is useful since they are the basis to understanding how bioperl programs can communicate with other bioinformatics projects and computer languages such as Ensembl and biopython and biojava. For more discussion of design and development issues please see the biodesign.pod file.III. Using bioperlBioperl provides software modules for many of the typical tasks of bioinformatics programming. These include: Accessing sequence data from local and remote databasesTransforming formats of database/ file recordsManipulating individual sequencesSearching for ``similar'' sequencesCreating and manipulating sequence alignmentsSearching for genes and other structures on genomic DNADeveloping machine readable sequence annotationsThe following sections describe how bioperl can help perform all of thesetasks.III.1 Accessing sequence data from local and remote databasesMuch of bioperl is focused on sequence manipulation. However, before bioperl can manipulate sequences, it needs to have access to sequence data. Now one can directly enter data sequence data into a bioperl Seq object, eg:$seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact','-desc'=>'Sample Bio::Seq object','-display_id' => 'something', '-accession_number' => 'accnum','-alphabet' => 'dna' );However, in most cases, it is preferable to access sequence data from some online data file or database (Note that in common with conventional bioinformatics usage we will sometimes call a ``database'' what might be more appropriately referred to as an ``indexed flat file''.) Bioperl supports accessing remote databases as well as developing indices for setting up local databases.III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)Accessing sequence data from the principal molecular biology databases is straightforward in bioperl. Data can be accessed by means of the sequence's accession number or id. Batch mode access is also supported to facilitate the efficient retrieval of multiple sequences. For retrieving data from genbank, for example, the code could be as follows:$gb = new Bio::DB::GenBank();# this returns a Seq object :$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');# this returns a Seq object :$seq2 = $gb->get_Seq_by_acc('AF303112'))# this returns a SeqIO object :$seqio = $gb->get_Stream_by_id("J00522","AF303112","2981014");Bioperl currently supports sequence data retrieval from the genbank, genpept, RefSeq, swissprot, and EMBL databases. See the Bio::DB::GenBank manpage, the Bio::DB::GenPept manpage, the Bio::DB::SwissProt manpage, the Bio::DB::RefSeq manpage and the Bio::DB::EMBL manpage for more information. A user can also specify a different database mirror for a database - this is especially relevent for the SwissProt resource where there a re many ExPaSy mirrors. There are also configuration options for specifying local proxy servers for those behind firewalls. The retrieval of NCBI RefSeqs sequences is supported through a special module called Bio::DB::RefSeq which actually queries an EBI server. Please see the Bio::DB::RefSeq manpage before using it as there are some caveats with RefSeq retrieval. RefSeq ids in Genbank begin with ``NT_'', ``NC_'',``NG_'', ``NM_'', ``NP_'', ``XM_'' ``XR_'', or ``XP_'' (for more information see /LocusLink/refseq.html). Bio::DB::GenBank can be used to retrieve entries corresponding to these ids but bear in mind that these are not Genbank entries, strictly speaking. See the Bio::DB::GenBank manpage for special details on retrieving e ntries beginning with ``NT_'', these are specially formatted ``CONTIG'' entries. Bioperl also supports retrieval from a remote Ace database. This capability requires the presence of the external AcePerl module. You need to download and install the aceperl module from /AcePerl/ An additional module is available for accessing remote databases, BioFetch, which queries the dbfetch script at EBI. The available databases are EMBL, GenBank, or SWALL, and the entries can be retrieved in different formats as objects or streams (SeqIO objects), or as ``tempfiles''. See the Bio::DB::BioFetch manpage for the details.III.1.2 Indexing and accessing local databases (Bio::Index::*, bpindex.pl,bpfetch.pl, Bio::DB::*)Alternately, bioperl permits indexing local sequence data files by means of the Bio::Index or Bio::DB::Fasta objects. The following sequence data formats are supported by Bio::Index: genbank, swissprot, pfam, embl and fasta. Once the set o f sequences have been indexed using Bio::Index, individual sequences can be accessed using syntax very similar to that described above for accessing remote databases. For example, if one wants to set up an indexed flat-file database of fasta files, and later wants then to retrieve one file, one could write scripts like:# script 1: create the indexuse Bio::Index::Fasta; # using fasta file formatuse strict; # some users have reported that this is necessarymy $Index_File_Name = shift;my $inx = Bio::Index::Fasta->new(-filename => $Index_File_Name,-write_flag => 1);$inx->make_index(@ARGV);# script 2: retrieve some filesuse Bio::Index::Fasta;use strict; # some users have reported that this is necessarymy $Index_File_Name = shift;my $inx = Bio::Index::Fasta->new($Index_File_Name);foreach my $id (@ARGV) {my $seq = $inx->fetch($id); # Returns Bio::Seq object# do something with the sequence}To facilitate the creation and use of more complex or flexible indexing systems, the bioperl distribution includes two sample scripts in the scripts/ directory, bpindex.pl and bpfetch.pl. These scripts can be used as templates to develop customized local data-file indexing systems. Bioperl also supplies Bio::DB::Fasta as a means to index and query Fasta format files. It's similar in spirit to Bio::Index::Fasta but offers more methods, eguse Bio::DB::Fasta;use strict;my $db = Bio::DB::Fasta->new($file); # one file or many filesmy $seqstring = $db->seq($id); # get a sequence as stringmy $seqobj = $db->get_Seq_by_id($id); # get a PrimarySeq objmy $desc = $db->header($id); # get the header, or description, lineThis module also offers the user the ability to designate a specific string within the fasta header as the desired id, such as the gi number within the string ``gi|4556644|gb|X45555'' (use the -makeid option for this capability). See the Bio::DB::Fasta manpage for more information on this fully-featured module. The core bioperl installation does not support accessing sequences stored in relational databases. However, such capabilility is available with the auxilliary bioperl-db library. See section IV.3 for more information.III.2 Transforming formats of database/ file recordsIII.2.1 Transforming sequence files (SeqIO)A common - and tedious - bioinformatics task is that of converting sequence data among the many widely used data formats. Bioperl's SeqIO object, however, makes this chore a breeze. SeqIO can read a stream of sequences - located in a single or in multiple files - in a number of formats: Fasta, EMBL, GenBank, Swissprot, PIR, GCG, SCF, phd/phred, Ace, fastq, or raw (plain sequence). Once the sequence data has been read in with SeqIO, it is available to bioperl in the form of Seq objects. Moreover, the Seq objects can then be written to another file (againusing SeqIO) in any of the supported data formats making data converters simple to implement, for example: use Bio::SeqIO;$in = Bio::SeqIO->new('-file' => "inputfilename",'-format' => 'Fasta');$out = Bio::SeqIO->new('-file' => ">outputfilename",'-format' => 'EMBL');while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }In addition, perl ``tied filehandle'' syntax is available to SeqIO, allowing you to use the standard <> and print operations to read and write sequence objects, eg:$in = Bio::SeqIO->newFh('-file' => "inputfilename" ,'-format' => 'Fasta');$out = Bio::SeqIO->newFh('-format' => 'EMBL');print $out $_ while <$in>;If the ``-format'' argument isn't used then Bioperl will guess the format based on the file's suffix in a case-insensitive manner. Here are the current interpretations:Format Suffixesfasta fasta|fast|seq|fa|fsa|nt|aagenbank gb|gbank|genbankscf scfpir pirembl embl|ebl|emb|datraw txtgcg gcgace acebsml bsm|bsmlswiss swiss|spphd phd|phredfastq fastqFor more information see the Bio::SeqIO manpage.III.2.2 Transforming alignment files (AlignIO)Data files storing multiple sequence alignments also appear in varied formats. AlignIO is the bioperl object for data conversion of alignment files. AlignIO is patterned on the SeqIO object and shares most of SeqIO's features. AlignIO currently supports input in the following formats:fastamase (Seaview)stockholmprodomselex (HMMER))bl2seqclustalw (.aln)msf (GCG)water*phylip (interleaved)stockholmnexusmegamemepsi (PSI-BLAST)*used by EMBOSS, see IV.2.1AlignIO supports output in these formats: fasta, mase, selex, clustalw, msf/gcg, and phylip (interleaved). One significant difference between AlignIO and SeqIO is that AlignIO handles IO for only a single alignment at a time but SeqIO.pm handles IO for multiple sequences in a single stream. Syntax for AlignIO is almost identical to that of SeqIO:use Bio::AlignIO;$in = Bio::AlignIO->new('-file' => "inputfilename" ,'-format' => 'fasta');$out = Bio::AlignIO->new('-file' => ">outputfilename",'-format' => 'pfam');while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }The only difference is that here, the returned object reference, $aln, is to a SimpleAlign object rather than a Seq object. AlignIO also supports the tied filehandle syntax described above for SeqIO. See the Bio::AlignIO manpage and section III.5 for more information.III.3 Manipulating sequencesBioperl contains many modules with functions for sequence analysis. And if you cannot find the function you want in bioperl you may be able to find it in EMBOSS or PISE , which are accessible through the bioperl-run auxilliary library (see IV.2.1).III.3.1 Manipulating sequence data with Seq methodsOK, so we know how to retrieve sequences and access them as Seq objects. Let's see how we can use the Seq objects to manipulate our sequence data and retrieve information. Seq provides multiple methods for performing many common (and some not-so-common) tasks of sequence manipulation and data retrieval. Here are some of the most useful: The following methods return strings$seqobj->display_id(); # the human read-able id of the sequence$seqobj->seq(); # string of sequence$seqobj->subseq(5,10); # part of the sequence as a string$seqobj->accession_number(); # when there, the accession number$seqobj->alphabet(); # one of 'dna','rna','protein'$seqobj->primary_id(); # a unique id for this sequence irregardless# of its display_id or accession number$seqobj->desc(); # a description of the sequenceIt is worth mentioning that some of these values correspond to specific fields of given formats. For example, the display_id method returns the LOCUS name of a Genbank entry, the (\S+) following the > character in a Fasta file, the ID from a SwissProt file, and so on. The desc() method will return the DEFINITION line of a Genbank file, the line following the display_id in a Fasta file, and the DE field in a SwissProt file. The following methods return an array of Bio::SeqFeature objects$seqobj->get_SeqFeatures; # The 'top level' sequence features$seqobj->get_all_SeqFeatures; # All sequence features, including sub# seq featuresFor a comment annotation, you can use:use Bio::Annotation::Comment;。

基于BioPerl实现从NCBI中精确下载LEA基因序列

基于BioPerl实现从NCBI中精确下载LEA基因序列

Hans Journal of Computational Biology 计算生物学, 2014, 4, 13-19Published Online June 2014 in Hans. /journal/hjcb/10.12677/hjcb.2014.42002Based on BioPerl Realize AccuratelyDownload LEA Gene Sequences fromthe NCBIXiaojing Zhang*, Xingqin Cao, Weimin Pan#School of Life Sciences, Xinjiang Normal University, UrumchiEmail: 313741033@, #379483304@Received: Apr. 11th, 2014; revised: Apr. 18th, 2014; accepted: Apr. 22nd, 2014Copyright © 2014 by authors and Hans Publishers Inc.This work is licensed under the Creative Commons Attribution International License (CC BY)./licenses/by/4.0/AbstractRecently, researchers have paid more and more attention to the resistance gene research; in Xin-jiang, especially the research of drought resistance gene has attached great importance. Based on these factors, according to the conservative domain structure of LEA gene (late embriogenesis ab-undant gene, LEA) and the corresponding keywords, this paper designed a program that LEA gene sequences was downloaded accurately from NCBI based on the BioPerl. This procedure not only solves the precise acquisition of LEA gene, but also provides a better solution to download differ-ent types of sequence exactly.KeywordsBioPerl, Conservative Structure Domain, Feature List, Key Words, LEA Gene基于BioPerl实现从NCBI中精确下载LEA基因序列张晓婧*,曹兴芹,潘伟民#新疆师范大学生命科学学院,乌鲁木齐Email: 313741033@, #379483304@#通讯作者。

经典的Bioedit使用说明书

经典的Bioedit使用说明书

选择的类型是“Normal Arrow ”、“Wide Arrow ”、“Normal Box” 和、“Wide Box ”。在上面例子中的所有特征是“常规”宽度的。如果特征是一个箭头,箭头的方向将是从起点位置到终点位置。
2
增加特征或酶时,他们各自的标记增加在外面,中心是可能的尺寸。标记可以被选择工具选择、移动、编辑和缩放。
综合序列分析软件BioEdit
PLEASE ENTER YOUR TITLE HERE
2003级 高芳銮
BioEdit是一个性能优良的免费的生物序列编辑器,可在Windows 95/98/NT/2000中运行,它的基本功能是提供蛋白质、核酸序列的编辑、排列、处理和分析。
与DNAMAN相比,其分析内容相对丰富一些,而且提供了很多网络程序的分析界面和接口,与DNAMAN等软件配合使用更好。
使用BioEdit质粒绘图功能,序列可以通过自动的位置标记,自动修改成环形质粒。特征、多连接位点和限制性位点可以通过使用对话框增加。当将一个序列进入质粒图时,在背景上出现一个限制性内切酶图谱,所以可以通过对话框选择可以增加限制性位点。它们自动增加到当前的位点。质粒功能提供简单的绘制和标记工具。标签和绘图可以通过鼠标移动和缩放。想要编辑目标性质,双击目标。
05
核酸分析:组成、互补、反转、翻译、质粒、限制性内切酶;
06
蛋白质分析:氨基酸成分、疏水性轮廓 、疏水力矩平均数
07
翻译或反翻译:把DNA或RNA翻译成蛋白质;
08
切换翻译:在核酸和编码蛋白质序列中切换核苷酸序列;
09
点图[成对比较]:相互比较两序列的矩阵,生成一个点图。
序列的常规操作:
BLAST
瞬间疏水性轮廓(hydrophobic moment profile)

神奇的perl-最佳PERL入门教程

神奇的perl-最佳PERL入门教程

第一章
拥抱 Perl.................................................................................................................................. 6
1.
Perl 是什么? ............................................................................................................................ 6
2.2 程序执行 ................................................................................................................................... 9
3.
变量与常量 ............................................................................................................................10
1.2 注释............................................................................................................................................. 9
1.3 语句块 ........................................................................................................................................ 9

Bioperl经典教程Database-SQL

Bioperl经典教程Database-SQL
5
SQL datatypes
• datatype -- describes the data stored in a particular column of a table • typically is either numeric or character strings • SQL defines subtypes that set different upper limits on the size of text or numerical data • Also -- special types such as DATE, MONEY
(if time)
Ensembl web interface Ensembl Biomart
3
RDBMS
• Oracle and Sybase (many others) – industry standard, commercial products – development and management tools • PostgresSQL – full-featured relational DBMS – open source – found in most linux distributions – handles unusual datatypes well which adds flexibility for future extensions • MySQL – open-source relational DBMS – easy to setup and use – Linux/Windows/Mac
6
SQL datatypes
• • • • • • • • • INT FLOAT REAL (larger float) CHAR -- fixed length text string TEXT -- variable length text BLOB -- variable length binary field DECIMAL -- real number stored as character string DATE TIMESTAMP -- value updates every time the record is modified • ENUM -- limited set of options (numeric or named) • SET -- value is one of a limited set

BioPerl使用手册讲解学习

BioPerl使用手册讲解学习

BioPerl使用手册第一弹—Bio::SeqIO篇注:本手册假设你已经拥有了一定的Perl编程经验,对个中术语(terms)不再进行赘述。

1.让我们开始吧为了让第一次使用本手册的同志们在刚开始就有成功的喜悦,这里给出一个例子,大家准备好自己手中的fasta 文件吧!请在文本编辑器中写入如下程序,并在终端运行:#! /usr/bin/perl –wuse strict;use Bio::SeqIO;my $file = “*********”; # Please use your path to replace the startsmy $seqio_object = Bio::SeqIO -> new(-file = $file);my $seq_obj = $seqio_object -> next_seq;printf “$seq_obj\n”;如果你成功键入了以上程序并且没有报错发生,那么屏幕上面就会正常显示出你的fasta序列。

那么恭喜你,你已经成功调用了BioPerl的模块,并且完成了一个面对对象的程序。

下面我们就来看一下我们第一个认识的BioPrel 的模块Bio::SeqIO。

2.关于Bio::SeqIO的那些事儿在介绍Bio::SeqIO之前,先来说一下为什么会产生BioPerl这个东西。

在生物信息学起步之初Prel语言强悍的字符串处理能力以及执行效率,毫无疑问的被各位从计算机和数学行业转行过来的“生物学家”选为工具语言(在生物信息数据处理方面放眼望去毫无疑问是Perl语言的天下,近来对大规模数据的处理方面R语言亦有崛起之势)。

但是,对于这海量的数据,同样丰富多彩的数据格式以及花样繁多的数据分析;每次处理数据都要重新自己编写正则表达式未免效率过于低下。

于是,在Perl一次重大的更新之后(引入面对对象编程,后面都将使用OOP代替面对对象编程),几个不太勤快的学生物的程序员看到了通用编程的可能,于是就有了我们现在广泛应用的BioPerl。

bioperl入门级教程可编辑

bioperl入门级教程可编辑

初品BioPerl(第一篇:让BioPerl在你的电脑上安家)九月18, 2010Perl的模块有两类,一类是内置在Perl中的,比如小驼书中提到的CGI,File::Basename 等,所以你无须另外下载安装这些模块即可使用它们;另一类则是与Perl相分离的,所以你要下载并安装才可以使用它们。

很不幸,BioPerl属于后者。

而且,安装过程对某些人某些电脑来说不是很容易。

首先,如果你恰巧跟上了因Ubuntu10.04的发行带来的Linux崇拜潮流,已经成功抛弃了Windows Xp转向Ubuntu的话,那么恭喜你,安装BioPerl和安装gimp之类的应用软件一样简单!打开新立德软件包管理器,输入BioPerl搜索,出现的第一项就是(目前的版本是1.6.1),然后右键点击选择“安装”,系统就会自动把所有依赖的软件包全部安装上。

怎么样,是不是很方便呢?而且将来某一天若BioPerl有了更新,可以很方便地使用新立德来升级。

当然,如果大家都这么方便的话,我就不用写这篇文章啦!对于使用其他Linux和Unix系统的人来说可能未必有这种软件包管理器(即使有,在软件源里也未必有BioPerl的安装文件)。

但是所有的类Unix系统应该都支持“源代码编译安装”的方式(这是不少Linux高手自认为很酷的装软件方法,哪怕现在还有好多人在用。

新手最好不要用,很容易出现问题)。

对于这些类Unix系统的用户来说,安装BioPerl和安装一些普通的软件方法是一样的,步骤是:(1)下载BioPerl的源代码,并解压。

我提供两个网址:/DIST/BioPerl-1.6.1.tar.gz/CPAN/authors/id/C/CJ/CJFIELDS/BioPerl-1.6.1.tar .gz(2)进入该目录,然后执行下列命令:$ perl Build.PL$ ./Build test# ./Build install注意:执行./Build install时必须要有系统管理员权限。

perl简明教程perl教程集合

perl简明教程perl教程集合

perl简明教程perl教程集合前⾔:perl是什么,⼲什么⽤的?perl原来设计者的意图是⽤来处理字符的,80%的强项是处理字符,当然其它的很多都可以。

现在很多⽹页也是⽤perl的,通常需要CGI环境,⽐如 $char =~ /语⾔/ ,意思是查找含有“语⾔”这两个字的字符串。

它还能做unix和linux系统管理,⽂件内容处理(基于awk和sed等⼯具的功能),其他好多你想做的事。

⼀、perl环境配置1.获取perl Perl通常位于/usr/local/bin/perl或/usr/bin/perl中。

你可以在Internet⽤匿名FTP免费得到它,如ftp:///pub/gnu/perl-5.004.tar.gz2.安装过程为:(1)解压:$gunzip perl-5.004.tar.gz$tar xvf - <perl-5.004.tar.gz(2)编译:$make makefile(3)放置:将编译⽣成的可执⾏⽂件拷贝到可执⾏⽂件通常所在⽬录,如:$copy <compiled excutable file> /usr/local/bin/perl3.运⾏ ⽤⽂本编辑器编辑好你的Perl程序,加上可执⾏属性:$chmod +x <program>就可以执⾏了:$./<program>。

如果系统提⽰:"/usr/local/bin/perl not found",则说明你没有安装成功,请重新安装。

注:你的程序的第⼀⾏必须为#!/usr/local/bin/perl(perl所在位置)。

4注释:注释的⽅法为在语句的开头⽤字符#,如:# this line is a comment注:建议经常使⽤注释使你的程序易读,这是好的编程习惯。

⼆、常量、变量等问题1.单引号与双引号 双引号内的字符串中⽀持简单变量替换如:$text = "This text contains the number $number."; 双引号内的字符串中⽀持转义字符 单引号字符串与双引号字符串有两个区别,⼀是没有变量替换功能,⼆是反斜线不⽀持转义字符。

用bioperl实现种子植物18SrRNA基因序列的大规模获取3

用bioperl实现种子植物18SrRNA基因序列的大规模获取3

收稿日期:2005Ο02Ο053国家自然科学基金(90202016)和湖北省首届新世纪人才工程项目(010*******)资助33通讯作者E Οmail :yulj @ 向 福,男,1977年生,华中科技大学生命科学与技术学院在读硕士生.武汉430074用biop erl 实现种子植物18S rRNA基因序列的大规模获取3向 福 余龙江33 栗茂腾 刘 智(华中科技大学生命科学与技术学院,武汉430074)摘要 基于bioperl 设计了大规模获取基因序列的方法程序,成功实现了Genbank 中种子植物18S rRNA 基因序列的大规模获取,解决了构建种子植物18S rRNA 基因序列二级数据库的大规模数据获取问题。

同时,该程序为不同类型序列的大规模获取都提供了一种较好的解决办法。

关键词 18S rRNA ;序列获取;bioperl 中图法分类号 Q 949.4 TP 319 大规模获取18S rRNA 基因序列数据是构建种子植物的18S rRNA 基因序列二级数据库的关键,同时该方法的实现对基于大量序列的研究工作都能够奠定良好的工作基础,有着重要的现实意义。

由于分子生物学实验技术的进步,人类基因组计划的实施,以及各种物种基因组研究的相继展开,有关核酸、蛋白质的序列和结构数据呈爆炸趋势增长。

但是从某种意义上来说,人们已经不象最初获得这类信息时那样兴奋,因为这些序列的本身已经变得没有意义了。

要真正利用、发掘这些序列信息的内在含义,就需要对它们进行集中深入的比对、分析研究,而这些工作的前提就是对所有获得数据的全部收集、整理、管理和系统化,并通过切实可行的方法提供给所有可能需要这些信息的人们。

构建二级数据库就是基于这种目的。

但是,目前还没有一种好的方法来实现大规模序列数据的获取。

通常做法都是通过手工直接查询Genbank 等数据库,找到需要的序列部分,然后拷贝下来经过格式等方面的处理才能用来作为研究数据。

perl教程

perl教程

字符串操作符
. 连接两个字符串
"hello" . "world" = "helloworld"
x 串重复操作符 "yang" x 3="yangyangyang" 次数使用前截成一个整数。4.8=4,小
于 1 的拷贝次数会得到一个空串。
数字与字符串的自动转换 依赖于作用在标量值上的操作符。如果是+则是数据,如果是.则是字符串。 "z".6*7="z42" ,"12"*"3"=36
第一章 简介
第二章 标量数据
什么是村量数据? 标量(scalar)是 perl 中最基本的元素。大多数标量要么是一个数字,要么是一个字符串。 perl 基本上把它们视为可相互替换的东西。
数字 所有数字的内部格式都一样 在内部,perl 一律把整数,浮点数按双精度浮点数进行计算。perl 内部没有整数值----程序中 的一个整数常量被当作等值的浮点数。
0x50657c八进制直接量以十六进制直接量以二进制以从数值操作符23加32减23乘23除103模23指数字符串单引号字符串直接量在引号间除了单引号或反斜杠以外的任何字符包括换行符如果该字符串连续占几行在字符串中表示该字符本身
我的评价: 本书是 perl 的经典入门书籍,介绍了 perl 中最基本的概念和语法,是 perl 入门的好书,我 向所有想了解或学习 perl 语言的朋友推荐本书。书中穿插有 perl 语法形成的历史和原因, 使你能充分感受到 perl 语言的文化,这也是 perl 流行的原动力。本书行文流畅,各知识点 介绍到位,令人很容易明白,达到入门点睛的效果。但本书的作用也就是入门而已,目的很 明确,它没有深入到 perl 脚本语言的高级部份。如果你想了解 perl 的高级功能或使用 perl 来更好地完成你的日常工作,还需进一步学习,《perl f 值,可以用古怪的 undef 操作符
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
use Bio::DB::Taxonomy; my $db = Bio::DB::Taxonomy->new( -source => ‘flatfile’, -nodesfile=> $nodefile, -namesfile=> $namesfile); $node = $db->get_Taxonomy_Node( -taxonid => ‘9606’); $node = $db->get_Taxonomy_Node( -name => ‘Homo sapiens’);
while( my $aln = $in->next_aln ) { print $aln->no_sequences,” sequence in alignment\n”; for my $sequence( $aln->each_seq ) { print $sequence->display_id, “\n”; } my $slice = $aln->slice(10,30); # slice of alignment $out->write_aln($slice); }
use Bio::Tools::Run::RemoteBlast; my $prog = ‘blastp’; my $db = ‘ecoli’; my $e_val= ‘1e-10’; my $remote_blast = Bio::Tools::Run::RemoteBlast->new( -prog => $prog, -data => $db, -expect => $e_val); my $r = $remote_blast->submit_blast($inputfilename); while( my @rids = $remote_blast->each_rid ) { for my $rid ( @rids ) { my $rc = $remote_blast->retrieve_blast($rid); if( ! ref($rc) ) { if( $rc < 0 ) { $remote_blast->remove_rid($rid); } print STDERR “.”; sleep(10); } else { $remote_blast->remove_rid($rid); my $result = $rc->next_result; while( my $hit = $result->next_hit ) { print $hit->name, “ “, $hit->significance, “\n”; } }
-height => 8,
-label => 1,
-description => 1,
); } print $panel->png;
Graphics output
[babelfish]$ perl graphics.pl AB077698.gbk > AB077698.png
Sequence Retrieval from Local Database
use Bio::DB::Fasta; my $db = new Bio::DB::Fasta($dir_with_fa_files); my $seqstr = $db->seq(‘SEQUENCE1’); my $seq_part = $db->seq(‘SEQUENCE2’, 100 => 300); my $seqobj = $db->get_Seq_by_acc(‘SEQUENCE1’); $db = Bio::DB::Flat->new(-directory => '/usr/share/embl', -dbname => 'mydb', -format => 'embl', -index => 'bdb', -write_flag => 1); $db->build_index('/usr/share/embl/primate.embl', '/usr/share/embl/protists.embl'); $seq = $db->get_Seq_by_id('BUM'); @sequences = $db->get_Seq_by_acc('DIV' => 'primate'); $raw = $db->fetch_raw('BUM');
Sequence Retrieval Script
#!/usr/bin/perl -w use strict; use Bio::DB::GenPept; use Bio::DB::GenBank; use Bio::SeqIO; my $db = new Bio::DB::GenPept(); # my $db = new Bio::DB::GenBank(); # if you want NT seqs # use STDOUT to write sequences my $out = new Bio::SeqIO(-format => 'fasta'); my $acc = ‘AB077698’; my $seq = $db->get_Seq_by_acc($acc); if( $seq ) { $out->write_seq($seq); } else { print STDERR "cannot find seq for acc $acc\n"; } $out->close();
Getting Sequences from GenBank
Through Web Interface Bio::DB::GenBank (don’t abuse!!) Alternative is to download all of genbank, index with Bio::DB::Flat (will be much faster in long run)
Tools for Evolutionary and Population analyses
Population Genetics Modules Taxonomy Molecular Evolution Phylogenetic Tree Building and Manipulation
Taxonomy data
-glyph => 'generic',
-bgcolor => $colors[$idx++ %; 'black',
-font2color => 'red',
-key => "${tag}s",
-bump => +1,
Bioperl II
Jason Stajich Duke University
Multiple Sequence Alignments
Bio::AlignIO to read alignment files Produces Bio::SimpleAlign objects Interface and objects designed for roundtripping and some functional work Could really use an overhaul or a parallel MSA representation
use Bio::Graphics; use Bio::SeqIO; use Bio::SeqFeature::Generic; my $file = shift or die "provide a sequence file as the argument"; my $io = Bio::SeqIO->new(-file=>$file) or die "couldn't create Bio::SeqIO"; my $seq = $io->next_seq or die "couldn't find a sequence in the file"; my @features = $seq->all_SeqFeatures; # sort features by their primary tags my %sorted_features; for my $f (@features) { my $tag = $f->primary_tag; push @{$sorted_features{$tag}},$f; } my $panel = Bio::Graphics::Panel->new( -length => $seq->length, -key_style => 'between', -width => 800, -pad_left => 10, -pad_right => 10); $panel->add_track(arrow => Bio::SeqFeature::Generic->new(-start => 1, -end => $seq->length), -bump => 0, -double=>1, -tick => 2); $panel->add_track(generic => Bio::SeqFeature::Generic->new(-start => 1, -end => $seq->length, -bgcolor => 'blue', -label => 1,); # general case my @colors = qw(cyan orange blue purple green chartreuse magenta yellow aqua); my $idx = 0; for my $tag (sort keys %sorted_features) { my $features = $sorted_features{$tag}; $panel->add_track($features,
相关文档
最新文档