生物信息学教程

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

A Basic Bioinformatics Tutorial
This tutorial was developed by:
Dr Michael Bunce (Murdoch University, Perth, Australia), and the Biomatters Team
Once DNA has been sequenced it is deposited in a database called Genbank (/Genbank). Genbank has over 100 gigabases of DNA sequence data. Molecular biologists rely heavily on existing Genbank sequence data in their day to day research activities. Learning basic bioinformatics skills including search&retrieve, "BLAST", sequence alignment and analysis is now central to molecular biology.
Geneious provides a user friendly interface into Genbank so that information is retrieved intuitively and visually. It also allows you to set up agents that search for data automatically, and even give your data
a relevancy score based on machine learning techniques. Geneious is also
a powerful aid in protein and DNA sequence analysis, combining a number of core bioinformatics tools into a single, integrated system.
Using this tutorial, within 2 hours you will know all the core elements of bioinformatics in Geneious, and have experienced "doing it" for yourself. This tutorial covers basic techniques in downloading, analysing and manipulating DNA/protein sequence data.
Exercise 1: Viewing the structure of a protein
Exercise 2: Accessing Genbank and interpreting its output
Exercise 3: Importing, aligning and building trees using Geneious Exercise 4: Designing PCR primer pairs.
Exercise 1: Viewing the structure of a protein:
Have a look around the Geneious interface - familiarise yourself with the location of the buttons and menus. In this first exercise we will visualise a 3-D molecule. Crystallising a protein (or DNA) is technically challenging - the structure of DNA was determined by Watson and Crick in 1953 based on a X-ray crystalography image below generated by Rosalind Franklin.
Click Here to open the nucleosome core document.
A nucleosome core is the basic building block of a chromosome. It is comprised of 146 base-pairs of DNA wrapped around 8 histone proteins (an octamer). Click some of the buttons to the right of the crystal structure - these toggle on and off the bonds, ribbons, etc. Also increase and decrease the atom size. The spin function allows you to view the structure from all angles and you can zoom and rotate the molecule using the mouse and shift keys.
Question: Why might knowing something about the shape of a molecule aid in designing therapeutic drugs?
Answer:
>
Exercise 2: Accessing Genbank and interpreting its output.
Given a nucleotide or protein sequence it is possible to search for similar sequences using BLAST. Blast stands for Basic Local Alignment Search Tool and finds regions of local similarity between a target sequence and a set of stored sequences. Blast also calculates the statistical significance
of matches based on both similarity and the number of sequences in the set.
2.1: Copy the unknown DNA sequence (below) and paste it into the search box within geneious's NCBI blastn window (located on the service tree on the left panel) - click the search button on the right. Like all databases if many people are accessing it simultaneously then output can be slow. Be patient. An estimate of the approximate search time will appear below the menu buttons.
CAT CCG TTG CCC ACA CAT GTC GTG ATG TAC AGT ACG GCT GAT TAA TCC G
Click on the first (or top) Genbank "hit" to display the result in the sequence viewer below. The viewer shows both query and match regions. Note that in this case there is an exact match. Click on the second hit on this list in the document table - note that most of the bases are the same but there are a number of differences.
Now click back to the top match and click on "Download Full Documents". This will download the complete Genbank seqeunce. Read the information in the summery column: from this information you should be able to answer the following questions:
From the closest Genbank match to the query sequence what is the:
Genus name (e.g. Homo sapiens)
>
Species name (e.g. Homo sapiens)
>
Genbank number (this is a unique number assigned to every DNA sequence on Genbank)
>
Name of gene
>
Is the origin of the DNA nuclear or mitochondrial?
>
2.2 Now click on the NCBI/Taxonomy database (located in the service tree) and search for the genus and species names you recorded above. This information yields the entire taxonomic lineage of the species (phylum, order, family etc...). If you click on the Lineage link it will take you
to the NCBI website - on this website click on the species name corresponding to the "animal diversity web" (under the Linkout heading): the link will export you to some information about the species in question.
To check you have accessed this material answer:
What is the common name for this species?
>
Where was this species found?
>
What is the estimated body mass of this species?
>
2.3Now click on the NCBI/Nucleotide database (located in the left hand service tree) and search for the genus and species names you recorded above.
How many genebank entries are there for this species
>
What is another gene present in Genbank for this species
>
Select the cytochrome B sequence and then click on the text view tab on the lower lower document pane (this changes the view from the sequence view option). Next click on the google scholar link. Some of you may be familiar with google scholar - it is a google search engine focusing on scholarly articles. In short Google Scholar does not find as much web fodder as the normal google search engine, making it a valuable tool to use in the search for academic studies. The google scholar search should have returned a couple of results. click on the top link (it should refer you to /cgi/content/full/295/5560/1683) which is the original paper that describes this DNA sequence. The authors of this sequence deposited it on Genbank. Read the first paragraph of the paper, it will give you a little perspective on why researchers conducted this research.
To check that you have read this paragraph answer:
What is the presumed closest relative of Raphus cucullatus:
>
What was the name of the "first" author on this paper?
Firstname:
>
Lastname:
>
2.4Now click on the NCBI/PubMed database (located in the service tree) and search for the name of the author (first and last name) that you recorded above. PubMed is one of many online databases that records journal articles published in scientific journals. It is possible to download articles from this list into citation software packages (such as Endnote) so that you do not have to enter all the references by hand. Have a read though the list of titles for this author and list two other extinct species (common or scientific names) on which this author has published papers:
Species 1:
>
Species 2:
>
Exercise 3: Importing, aligning and building trees using Geneious
The aim of this exercise is to become familiar with importing DNA sequences, aligning them and then analysing the output.
The DNA sequences we will be using in this exercise originate from various species of Bears. Before you begin you should try and become familiar with the names and locations of the bear species that you will be analysing.
List of the Ursidae family:
GIANT PANDA (Ailuropoda melanoleuca)
MALAYAN SUN BEAR (Helarctos malayanus)
SLOTH BEAR (Melursus ursinus)
ASIATIC BLACK BEAR (Selenarctos thibetanus)
SPECTACLED BEAR (Tremarctos ornatus)
BLACK BEAR (Ursus americanus)
POLAR BEAR (Thalarctos maritimus)
BROWN BEAR (Ursus arctos)
3.1:Click Here to open the first bear species on the list (Ailuropoda melanoleuca). Copy the entire DNA sequence. This is most easily achieved by a "Select All" (Ctrl-A) followed by a Copy (Ctrl-C) in the seqeunce viewer. Now paste (Ctrl-V) the DNA sequences into the search box within the NCBI/blastn interface and click the search button. From the blast results what mitochondrial gene does this DNA sequence originate from?
Mitochondrial gene:
>
3.2:Click Here to highlight all 8 sequences simultaneously in the tutorial folder. You will note that the sequences are not aligned despite the fact that all the sequences are from the same mitochondrial gene. Geneious can do this alignment for you by clicking the alignment button on the toolbar when the sequences are highlighted. Geneious will prompt you with a number of options for the alignment. These options fine-tune the alignment algorithm i.e the relative cost of a match vs. a mismatch and should a gap be inserted at any given position? The alignment options we will use for this mitochondrial dataset are:
-Cost matrix: 93% similarity
-Alignment type: Global alignment with free end gaps
-Refinement iterations: 1
Once the alignment is complete alter some of the options on the panel to right of the alignment. These options make it possible to visualise the alignment in different ways and using different colours - for example many find it easier to view the alignment using block colours for each of the nucleotides - this option is available under the color options (click on the icon to toggle on these options).
Using the statistics window answer the following:
What is the total length of the alignment: _______ base pairs
>
How many identical sites are there in the alignment: __________ (or ____ %) >
What is the %GC content of the alignment?
>
3.3: Phylogenetic reconstructions: As you scan the alignment you have constructed you should notice that the DNA sequences are similar, but not identical. If you look closer at some of the nucleotide differences you will probably be able to see that some bear sequences are more closely related than others. Rather than eyeball the sequences to guess evolutionary relationships it is possible to use the DNA changes to statistically infer the evolutionary relationships from DNA sequences. This modelling is known as phylogenetics. In this exercise we will build a phylogeny of the Ursidae family (bears). Select the alignment that you generated and click on the tree icon in the menu. Geneious will prompt you for a number of options. These options alter the way that the program models DNA sequences on a tree - in this example we are building a very simple tree. Under Genetic Distance Model select: HKY and UPGMA for the tree building method. Leave the other boxes unchecked then click OK.
Once Geneious has finished constructing the tree select the graphical tree view tab. This is a phylogenetic reconstruction of the bear alignment you made. A few things you should know about phylogenetic trees:
1) The tips correspond to extant taxa
2) A tree summarises the relatednes of all taxa (i.e a family tree)
3) The internal nodes, or internal vertices correspond to ancestral (hypothetical) taxa
4) The branch lengths represent evolutionary distances. Longer branches mean less sequence similarity. For relativley short branches (say below 0.2) branch length is roughly 1 - the percentage of differing sites in the two seqeunces. i.e. a distance of .0.04 means 96% seqeunce similarity.
From the phylogenetic tree you have constructed answer the following questions:
Which two bear species (common names) in your phylogenetic tree are most closely related:
>
and
>
Which bear species (common name) is the most basal on the phylogenetic tree?
>
When "Show Branch Labels" is checked the length of each branch is shown above it. Use this information to compute the genetic distace of
S.thibetanus and U.americanus back to their common ancestral node and estimate the sequence divergence.
Approximately what % sequence divergence?
>
Compare the number above with actual data. Select the two sequence in the viewer and look at the statistics panel.
What is the observed % sequence divergence?
If this mitochondrial gene mutated at 1% per million years (an approximate rate for mitochondrial DNA) then how many years ago did the common ancestor of S.thibetanus and U.americanus diverge?
>
Given that S. thibetanus has an Asian/Russian distribution and and U.americanus an American distribution what geographical "feature" may have caused the speciation "event" in the common ancestor of these bears. >
3.4: You have just constructed a tree of all the extant (living) bears there are also two bears that have gone extinct in the past 20,000 years. There is a Genbank record for the mitochondrial cytochrome B gene of the extinct cave bear (Ursus spelaeus). This DNA sequence was isolated from a fossil bone - the retrieval of "old" degraded DNA is known as ancient DNA. It is technically challenging, as the DNA is degraded into small pieces (typically 100-200 base-pairs). The goal of this exercise is to find the cytochrome B DNA sequences for Ursus spelaeus and find its closest living relative by integrating it into your existing bears phylogenetic tree.
Go to the NCBI/Nucleotide search (in the service tree) and input: Ursus spelaeus cytochrome B and search. By doing this you are searching Genbank for a species and gene name - which is often the easiest way to locate DNA sequences. You can pick up the correct DNA entry from this search window and drop it in the tutorial folder. Check that the file now appears in this folder.
One of the columns to the right of the sequence summary window is labelled "name". At present the Genbank number AF264047.1 has been inputted. Click on this name and change it to Ursus spelaeus. This will ensure that the species name appears in you tree.
Now select the alignment you generated earlier at the same time as the new Ursus spelaeus sequence you have just imported (hold down contrl key and click to select multiple items). Then click the alignment button and perform the alignment (as done previously). Once alignment is complete build a new tree (as before). View the new tree - what are the closest living relative(s) of the extinct cave bear?
>
3.5: An Open Reading Frame (ORF) is a region of the sequence that could potentialy contain a protein. The DNA sequence of Ursus spelaeus that you just downloaded codes for a protein called cytochrome B that plays a role in mitochondrial function. Select The Ursus spelaeus record that you downloaded previously. Right click while your mouse is over the sequence and select the find ORF option. By conductiinfg an ORF search you are asking Geneious to locate what could be protein coding regions withing this DNA sequence. The find ORF option will prompt you for a number of parameters. Make the ORF size 99 and for the genetic code select: vertebrate mitochondrial. Also check the box to include interior ORF's. click OK to run the ORF search. Geneious will add the ORFs as annotations on the sequence. By selecting the annotations you can toogle this information on and off.
How many ORFs are in this sequence?
>
Look at the nucleotide codon (3 bases) at the start of each ORF what have each of them got in common?
>
Select the entire DNA sequence (you can do this by clicking on the DNA sequence within the sequence view window and selecting edit/select all from the Geneious menu). When the entire DNA sequence is highlighted click on the translate button (located just above the DNA sequence window). The genetic code for this translation is: vertebrate mitochondrial - select this option and click OK. A new file should of appeared that contains the cytochrome B sequence for Ursus spelaeus.
To check that you have completed the translation correctly answer:
What are the last 4 amino acids in the sequence?
>
What is the length of the cytochrome B protein (without stop codon)? >
In the same way that it is possible to search Genbank with a DNA sequence it is also possible to search with a protein. Copy and paste your newly translated protein sequence into the Blast/blastp window (located on the service tree) and click search - this may take a few minutes.
Other than Ursus spelaeus what bear species are the next 2 closest matches to this protein sequence?
>
and
>
Does this result agree or disagree with the phylogenetic tree that you constructed earlier?
>
Exercise 4: Designing PCR primer pairs.
There is one other extinct bear species for which no DNA sequences yet exist. In this exercise you will design a polymerase chain reaction (PCR) assay for the cytochrome B gene for Arctodos simus the giant short faced bear. If you are not confident about what PCR is then visit the following website for a quick refresher course: /wiki/PCR
Read the following information about this species:
Giant short-faced bear
Arctodos simus
The giant short-faced bear was the biggest bear ever to have lived. Standing a 1.5 metres at the shoulder and equipped with powerful jaws this bear would have been an intimidating sight.
Statistics? Height: 1.5m at the shoulder and an impressive 3m when standing on its hind legs. Weight: 6-800kg
Physical Description?The giant short-faced bear was a large bear, bigger than any living species of bear. Compared to modern brown bears it had much longer limbs and was generally more slender. It had a very short, broad muzzle which gives rise to its name, and which gave it a very powerful bite.
Former Distribution? North America.
Habitat? Giant short-faced bears inhabitated the open areas of ice age North America from steppe tundra in the far north to grasslands further south.
Diet? The giant short-faced bear was a carnivore, probably a scavenger but would have taken live prey at times.
Behaviour?Little is known of the behaviour of short-faced bears. Studies of the bone chemistry show that they were predominantly carnivorous but whether or not they were predators or scavengers remains a contentious issue. Recent research favours the scavenger theory. With long legs, they were adapted to ranging far and wide in search of carrion and their powerful bite enabled them to crack open bones to reach rich marrow.
Conservation status?Extinct from approximately 12,500 years ago (end of
ice age)
History? The short-faced bear belongs to a group of bears known as the Tremarctine bears which are of New World origin. The earliest member of the Tremarctinae is Plionarctos of the Pleiocene age (about 5-2 million years ago) from Texas. It is likely the spectacled bear (Tremarctos) is the closest relative of the short-faced bear. Although the early history of Arctodus is poorly known, it evidently became widespread in North America about 800,000 years ago.
4.1: Formulating a plan for PCR design:
You have extracted some ancient DNA from an Arctodus bone
(pictured) - 14C dating of the bone demonstrated that the bone
is 18,500 years old and has intact biomolecules (DNA and
protein). Based on the information about Arctodus and what you
know about ancient DNA answer the following questions:
What DNA sequence would you use as the template for designing a PCR assay given that no sequence yet exists for the short faced bear?
Species:
>
Reason:
>
Given the Arctodus DNA is likely degraded - what size should the PCR amplicon (DNA "unit" that amplifies) might be a reasonable.
Between:
>
and
>
base pairs.
Identifying which extant bear species might be most suitable to use as a template for designing a PCR assay is very important. However, inferring relationships from bones is very difficult and paleontologists have been known to make errors regarding the relationships of extinct animals. There
is little doubt that Arctodus was a bear, but its exact relationship is still unknown. Given this uncertainty - another factor that you need to consider is whether or not the PCR primers that you design will work on other bear species. Ideally any PCR primers you design should amplify all bear species but this is often impossible to accomplish. Have a scan back through the DNA alignment that you generated earlier and consider where on this sequence might be a good place to situate PCR primers. When you have generated what you think is a good primer pair it would be anvantegious to consider them with regard to other bear sequences. You should remember that it is the 3' end of the primer that is most crucial in the PCR reaction as this is where the DNA polymerase binds.
4.2:PCR Primer Design: The design of PCR primers is relatively simple from a computational point of view: just search along a sequence and find short sub-sequences that fit certain criteria. However, the molecular biology of PCR is very complex and the design of primers is best accomplished with the aid of computer programs that can help decide what is a "good" and a "bad" primer.
Some general guidelines are:
-primers should ideally be 18-27 base pairs long
-have at least 40% G/C content
-anneal at a temperature in the range of 50 to 65 degrees
-avoid primers containing repetitions of the same nucleotide.
-primers should be specific to the target (often difficult to ascertain)
Usually higher annealing temperatures (Tm) are better (i.e. more specific for your desired target).
In addition, the forward and reverse primer should anneal at approximately the same temperature (allowing perhaps 3 or 4 degrees of difference between them).
Next, you have to consider the formation of self-annealing regions within each primer (i.e. hairpin and foldback loops) as well as direct annealing between two primers to form "primer dimers". Hairpins and dimers will adversely affect your PCR. Geneious will discard PCR primer pairs that form such structures.
Select the cytochrome B DNA sequence for the bear species you have chosen to use as a template for an Arctodus PCR. Right click on the sequence in the folder, and choose "Primers"
Have a look at all of the options in the primer design parameters (there are many). Most of these should be left as the default settings but one important one to alter is the product size which should be altered to reflect the fact that you are dealing with degraded DNA. Click the "OK" button when you are ready. Geneious will now update the sequence with primers displayed as annotations. Consider these primers in the context of the bear alignment you generated.
Once you have decided upon a good primer pair it is always good to check them out using an oligo analyser. These programs determine it the primers you have designed are prone to dimers and hairpins. Input you primers into the programs - you should record some of the details of these outputs for your report.
A suitable program IDT's oligo analyser: (focus on hairpins, self dimers and hetero dimers)
/analyzer/Applications/OligoAnalyzer/。

相关文档
最新文档