Introduction

Methods

Data Sets

EST

New 454

Full Genomes

Regier 62-gene protein coding

6-gene Data Set

Seth compiled the 6-gene data set. He sent a file to THO and others on May 15 in an email entitled: Final files: MTrDNA, MTproteom, six gene data set. There were minor changes and a new file sent May 16. THO found a few duplications, including: Nebalia histone, Nebalia 18S, Pollicipes_polymerus

is duplicated with 12S being in both 6-genes, and in mtRNAgenomes. Seth then wrote "There was a typo in the file. That is why I didn't delete the duplicate Pollicipes_polymerus 12S between the two files, Pollicipes_polymerus 16S will also be duplicated. So, you are going to have to cut that out of the six gene data set. "

After uploading to Galaxy, THO used find and replace to replace sp. with sp The reason for this is that the "." after sp can sometimes be problematic in some programs. To standardize species names, we should try to always leave off the period after the sp.
THO uploaded the 6-gene NT data set to Galaxy, and it can be found on the kGalaxy history linked here [ http://knot.cnsi.ucsb.edu:8080/u/ostratodd/h/pancrustacea-megatree-data]
That kGalaxy history has a MUSCLE alignment, and phylocatenator grid to view presence/absence of each gene.

Amino Acid Data

The 6-gene dataset was first compiled as DNA. However, our other datasets are mainly analyzed as amino acids. As such, we need the translated data for CO1 and Histone. The other genes are rDNA.
- In an email dated June 29, Seth sent the translated H3 sequences in a file. THO uploaded to the Galaxy history linked above.
- Seth is working on the CO1 AA data.

Mitochondrial Genomes

Amino Acids

Accession numbers are available on the NCBI website, by searching with the taxon ID for Pancrustacea, which is 197562 [[1]]. As of July 1, 2012, there were 373 accessions available. Note that some of these are sub-species. The list of accessions can be downloaded at the right near the top, clicking on "Download".
The accessions downloaded in May along with scripts to directly pull and parse the data, written by THO, sit on macroevolution in the following directory: /labdata/nfs/lab/scripts/ATOLmt/
The accessions pulled at that time are in the file AccList.tx . There are 365 accessions in that list.
In addition, Alignments for each gene can be downloaded by clicking on links toward the top of the page. These are saved in individual fasta files named by gene as *.fa in the directory above. THO then converted to tabular format using the shell script 1_make_tables, which calls the perl script getSpeciesofGB.pl for each gene. That perl script pulls out the species name of each accession from GenBank, and writes a tabular file, which is concatenated together into the file: mtGenome.tab .
For no apparent reason, a few amino acid sequences did not download, despite the fact that NT sequences and in at least some cases, translations, are available. These perhaps should be added manually. The list is here:

   Sasakia_charonda	ATP6
   Sasakia_charonda	ATP8
   Sasakia_charonda	COI
   Sasakia_charonda	COII
   Sasakia_charonda	COIII
   Sasakia_charonda	CYTB
   Hipparchia_autonoe	CYTB
   Anopheles_funestus	ATP6
   Anopheles_funestus	ATP8
   Anopheles_funestus	COIII
   Anopheles_funestus	ND2
   Anopheles_funestus	ND4
   Anopheles_funestus	ND4L
   Agriosphodrus_dohrni	COI
   Agriosphodrus_dohrni	COII
   Agriosphodrus_dohrni	COIII

Nucleotide Data

For the rDNA data, THO wrote scripts to parse data from GenBank files. These are in the subdirectory gbstrip of the directory listed above.
The next step is to use BioPerl to download all the GenBank files directly from GenBank, using the list of genome accession numbers explained above. This is done using the script getGB.pl. The actual command is:

   ./getGB.pl AccList.tx > mtGenomes.gb

this pulls accessions in AccList.tx from GenBank and writes the data into the file called mtGenomes.gb

The next step is to use genbankstrip.pl, a script written by Olaf Bininda-Emonds. The script pulls genes from a genbank file based on name. Many genes, especially mt genes have many different synonyms. These can be added into the script such that all synonyms are pulled out into one file. All the synonyms for 12S and 16S collapse into mtrnr1 and mtrnr2.
All the other genes collapse into a single file as well. However, some synonyms might not be completely accounted for. Final tallies for each gene are listed in the file synonyms_list.dat
All the files can be rebuilt by simply calling the shell script named: 1_pull_nucs . This calls genbankstrip.pl for each gene and calls a custom perl script to get additional data for phytab format, including species name and genbank id.
THO uploaded the NT dataset for mt coding genes to the history above. THO subtracted datasets to find genes present as NT but not as AA. There are a few such genes, listed above.

Pancrustacean MegaTree

Contents

Introduction

Methods

Data Sets

EST

New 454

Full Genomes

Regier 62-gene protein coding

6-gene Data Set

Amino Acid Data

Mitochondrial Genomes

Amino Acids

Nucleotide Data

Results

Discussion

Navigation menu

Views

Personal tools

Navigation

Search

Tools