Implement EvolMap -- Assigned to Roger
- Implement EvolMap [1] program in Galaxy.
- Write script to parse output (this could also be done by modifying the EvolMap Java -- Note that the GUI version has this functionality, through manual manipulation; so what we need could be programmed.)
The goal is to obtain genes that have 1 and only 1 representative in each species. For the dataset called "algae_genomes" this information is present in the file:
algae_genomes.ancestors_pass2.rn
That is a large file, which contains all gene families, line by line. Essentially, each line of the file is a gene family.
The file begins with a line that starts with
ANCESTOR
Following "Ancestor" is a list of species. The first line contains all species in the analysis, and is referring to the common ancestor of all species in the analysis. Following the "ANCESTOR" line are different categories of gene families. After specifying all the gene families for the first ANCESTOR, there are other gene families grouped into each ancestral node, each specified by an ANCESTOR line.
Lines representing gene families below each ANCESTOR begin with the following words:
PRESENT
Indicates a gene was inferred present in the common ancestor. This line then contains a list of all the genes in this gene family that are present inDIVERGED
gene is inferred not present in the ancestor, but are duplicated at one of the descendant lineages from the source gene.SINGULAR
gene is not present in this ancestor and is not gained in any of the descendant lineages [but gained in a later branch]For now, we are only interested in the gene families with 1 and only 1 representative in each species of the analysis. This is because we want to study in our next steps genes that are only rarely duplicated and lost.
So, a simple PERL script could be written to search the output file line by line for lines that meet the following criteria:
- line must begin with PRESENT
- line must contain X gene names, where X=number of species in original analysis, this information could be obtained from the first line of the file by counting the _ characters (the gene names are tab-delimited, and a simple count of tabs or words could easily check for this). The the algae genome analysis, there are 8 species.
- line must contain 1 and only 1 gene from each of the 8 species (it is possible one species has 2 or more genes, while some species have zero genes). This step may require some fiddling because the gene names are derived directly from the input files. In some cases the gene names refer to the species they come from, but in other cases, the genes are named simply with numbers. The java EvolMap program retains information about which species a gene comes from, but this does not seem to be in the output file.
For each of the gene families that meet the criteria (1 and only 1 gene in each species) - we want a file containing each of the X genes, and only those genes, in fasta format. These sequences would be pulled from a different file - either the whole genome data base or the original input file.