Introduction

Methods

Seth

Accession numbers are available on the NCBI website, by searching with the taxon ID for Pancrustacea, which is 197562 [[1]]. As of July 1, 2012, there were 373 accessions available. Note that some of these are sub-species. The list of accessions can be downloaded at the right near the top, clicking on "Download".
The accessions downloaded in May along with scripts to directly pull and parse the data, written by THO, sit on macroevolution in the following directory: /labdata/nfs/lab/scripts/ATOLmt/
The accessions pulled at that time are in the file AccList.tx . There are 365 accessions in that list.
I think Seth and Heather somehow manually pulled down the proteins and aligned them. I think perhaps they did this before visiting UCSB. In any event, somehow the genes are in individual fasta files named by gene as *.fa in the directory above. THO then converted to tabular format using the shell script 1_make_tables, which calls the perl script getSpeciesofGB.pl for each gene. That perl script pulls out the species name of each accession from GenBank, and writes a tabular file, which is concatenated together into the file

   mtGenome.tab

That is the proteome data. For the rDNA data, THO wrote scripts to parse data from GenBank files. These are in the subdirectory gbstrip of the directory listed above.
The next step is to use BioPerl to download all the GenBank files directly from GenBank. This is done using the script getGB.pl. The actual command is:

   ./getGB.pl AccList.tx > mtGenomes.gb

this pulls accessions in AccList.tx from GenBank and writes the data into the file called mtGenomes.gb