Difference between revisions of "Pancrustacean MegaTree"

From Ucsbgalaxy
Jump to: navigation, search
(Mitochondrial Genomes)
(6-gene Data Set)
Line 3: Line 3:
 
==Data Sets==
 
==Data Sets==
 
===6-gene Data Set===
 
===6-gene Data Set===
Seth compiled the 6-gene data set. He sent a file to THO and others on May 15 in an email entitled: Final files: MTrDNA, MTproteom, six gene data set. There were minor changes and a new file sent May 16. THO found a few duplications, including: Nebalia histone, Nebalia 18S, Pollicipes_polymerus
+
* Seth compiled the 6-gene data set. He sent a file to THO and others on May 15 in an email entitled: Final files: MTrDNA, MTproteom, six gene data set. There were minor changes and a new file sent May 16. THO found a few duplications, including: Nebalia histone, Nebalia 18S, Pollicipes_polymerus
 
is duplicated with 12S being in both 6-genes, and in mtRNAgenomes. Seth then wrote "There was a typo in the file.  That is why I didn't delete the duplicate Pollicipes_polymerus 12S between the two files, Pollicipes_polymerus 16S will also be duplicated. So, you are going to have to cut that out of the six gene data set.  "
 
is duplicated with 12S being in both 6-genes, and in mtRNAgenomes. Seth then wrote "There was a typo in the file.  That is why I didn't delete the duplicate Pollicipes_polymerus 12S between the two files, Pollicipes_polymerus 16S will also be duplicated. So, you are going to have to cut that out of the six gene data set.  "
 +
* After uploading to Galaxy, THO used find and replace to replace sp. with sp  The reason for this is that the "." after sp can sometimes be problematic in some programs.  To standardize species names, we should try to always leave off the period after the sp.
  
 
===Mitochondrial Genomes===
 
===Mitochondrial Genomes===

Revision as of 00:32, 3 July 2012

Introduction

Methods

Data Sets

6-gene Data Set

  • Seth compiled the 6-gene data set. He sent a file to THO and others on May 15 in an email entitled: Final files: MTrDNA, MTproteom, six gene data set. There were minor changes and a new file sent May 16. THO found a few duplications, including: Nebalia histone, Nebalia 18S, Pollicipes_polymerus

is duplicated with 12S being in both 6-genes, and in mtRNAgenomes. Seth then wrote "There was a typo in the file. That is why I didn't delete the duplicate Pollicipes_polymerus 12S between the two files, Pollicipes_polymerus 16S will also be duplicated. So, you are going to have to cut that out of the six gene data set. "

  • After uploading to Galaxy, THO used find and replace to replace sp. with sp The reason for this is that the "." after sp can sometimes be problematic in some programs. To standardize species names, we should try to always leave off the period after the sp.

Mitochondrial Genomes

  1. Accession numbers are available on the NCBI website, by searching with the taxon ID for Pancrustacea, which is 197562 [[1]]. As of July 1, 2012, there were 373 accessions available. Note that some of these are sub-species. The list of accessions can be downloaded at the right near the top, clicking on "Download".
  2. The accessions downloaded in May along with scripts to directly pull and parse the data, written by THO, sit on macroevolution in the following directory: /labdata/nfs/lab/scripts/ATOLmt/
  3. The accessions pulled at that time are in the file AccList.tx . There are 365 accessions in that list.
  4. In addition, Alignments for each gene can be downloaded by clicking on links toward the top of the page. These are saved in individual fasta files named by gene as *.fa in the directory above. THO then converted to tabular format using the shell script 1_make_tables, which calls the perl script getSpeciesofGB.pl for each gene. That perl script pulls out the species name of each accession from GenBank, and writes a tabular file, which is concatenated together into the file
   mtGenome.tab
  1. That is the proteome data. For the rDNA data, THO wrote scripts to parse data from GenBank files. These are in the subdirectory gbstrip of the directory listed above.
  2. The next step is to use BioPerl to download all the GenBank files directly from GenBank, using the list of genome accession numbers explained above. This is done using the script getGB.pl. The actual command is:
   ./getGB.pl AccList.tx > mtGenomes.gb

this pulls accessions in AccList.tx from GenBank and writes the data into the file called mtGenomes.gb

  1. The next step is to use genbankstrip.pl, a script written by Olaf Bininda-Emonds. The script pulls genes from a genbank file based on name. Many genes, especially mt genes have many different synonyms. These can be added into the script such that all synonyms are pulled out into one file. All the synonyms for 12S and 16S collapse into mtrnr1 and mtrnr2.
  2. All the other genes collapse into a single file as well. However, some synonyms might not be completely accounted for. Final tallies for each gene are listed in the file synonyms_list.dat
  3. All the files can be rebuilt by simply calling the shell script named: 1_pull_nucs . This calls genbankstrip.pl for each gene and calls a custom perl script to get additional data for phytab format, including species name and genbank id.

Results

Discussion