Pancrustacean MegaTree
From Ucsbgalaxy
Contents
Introduction
Methods
Data Sets
EST
New 454
Full Genomes
Regier 62-gene protein coding
6-gene Data Set
- Seth compiled the 6-gene data set. He sent a file to THO and others on May 15 in an email entitled: Final files: MTrDNA, MTproteom, six gene data set. There were minor changes and a new file sent May 16. THO found a few duplications, including: Nebalia histone, Nebalia 18S, Pollicipes_polymerus
is duplicated with 12S being in both 6-genes, and in mtRNAgenomes. Seth then wrote "There was a typo in the file. That is why I didn't delete the duplicate Pollicipes_polymerus 12S between the two files, Pollicipes_polymerus 16S will also be duplicated. So, you are going to have to cut that out of the six gene data set. "
- After uploading to Galaxy, THO used find and replace to replace sp. with sp The reason for this is that the "." after sp can sometimes be problematic in some programs. To standardize species names, we should try to always leave off the period after the sp.
- THO uploaded the 6-gene NT data set to Galaxy, and it can be found on the kGalaxy history linked here [ http://knot.cnsi.ucsb.edu:8080/u/ostratodd/h/pancrustacea-megatree-data]
- That kGalaxy history has a MUSCLE alignment, and phylocatenator grid to view presence/absence of each gene.
Amino Acid Data
- The 6-gene dataset was first compiled as DNA. However, our other datasets are mainly analyzed as amino acids. As such, we need the translated data for CO1 and Histone. The other genes are rDNA.
- In an email dated June 29, Seth sent the translated H3 sequences in a file. THO uploaded to the Galaxy history linked above.
- Seth is working on the CO1 AA data.
Mitochondrial Genomes
Amino Acids
- Accession numbers are available on the NCBI website, by searching with the taxon ID for Pancrustacea, which is 197562 [[1]]. As of July 1, 2012, there were 373 accessions available. Note that some of these are sub-species. The list of accessions can be downloaded at the right near the top, clicking on "Download".
- The accessions downloaded in May along with scripts to directly pull and parse the data, written by THO, sit on macroevolution in the following directory: /labdata/nfs/lab/scripts/ATOLmt/
- The accessions pulled at that time are in the file AccList.tx . There are 365 accessions in that list.
- In addition, Alignments for each gene can be downloaded by clicking on links toward the top of the page. These are saved in individual fasta files named by gene as *.fa in the directory above. THO then converted to tabular format using the shell script 1_make_tables, which calls the perl script getSpeciesofGB.pl for each gene. That perl script pulls out the species name of each accession from GenBank, and writes a tabular file, which is concatenated together into the file: mtGenome.tab .
- For no apparent reason, a few amino acid sequences did not download, despite the fact that NT sequences and in at least some cases, translations, are available. These perhaps should be added manually. The list is here:
Sasakia_charonda ATP6 Sasakia_charonda ATP8 Sasakia_charonda COI Sasakia_charonda COII Sasakia_charonda COIII Sasakia_charonda CYTB Hipparchia_autonoe CYTB Anopheles_funestus ATP6 Anopheles_funestus ATP8 Anopheles_funestus COIII Anopheles_funestus ND2 Anopheles_funestus ND4 Anopheles_funestus ND4L Agriosphodrus_dohrni COI Agriosphodrus_dohrni COII Agriosphodrus_dohrni COIII
Nucleotide Data
- For the rDNA data, THO wrote scripts to parse data from GenBank files. These are in the subdirectory gbstrip of the directory listed above.
- The next step is to use BioPerl to download all the GenBank files directly from GenBank, using the list of genome accession numbers explained above. This is done using the script getGB.pl. The actual command is:
./getGB.pl AccList.tx > mtGenomes.gb
this pulls accessions in AccList.tx from GenBank and writes the data into the file called mtGenomes.gb
- The next step is to use genbankstrip.pl, a script written by Olaf Bininda-Emonds. The script pulls genes from a genbank file based on name. Many genes, especially mt genes have many different synonyms. These can be added into the script such that all synonyms are pulled out into one file. All the synonyms for 12S and 16S collapse into mtrnr1 and mtrnr2.
- All the other genes collapse into a single file as well. However, some synonyms might not be completely accounted for. Final tallies for each gene are listed in the file synonyms_list.dat
- All the files can be rebuilt by simply calling the shell script named: 1_pull_nucs . This calls genbankstrip.pl for each gene and calls a custom perl script to get additional data for phytab format, including species name and genbank id.
- THO uploaded the NT dataset for mt coding genes to the history above. THO subtracted datasets to find genes present as NT but not as AA. There are a few such genes, listed above.