Implement EvolMap -- Assigned to Roger
- Implement EvolMap [1] program in Galaxy.
- Write script to parse output (this could also be done by modifying the EvolMap Java)
- The goal is to obtain genes that have 1 and only 1 representative in each species. For the dataset called "algae_genomes" this information is present in the file:
algae_genomes.ancestors_pass2.rn
That is a large file, which contains all gene families, line by line. Essentially, each line of the file is a gene family.
The file begins with a line that starts with
ANCESTOR
Following "Ancestor" is a list of species. The first line contains all species in the analysis, and is referring to the common ancestor of all species in the analysis. Following the "ANCESTOR" line are different categories of gene families. After specifying all the gene families for the first ANCESTOR, there are other gene families grouped into each ancestral node, each specified by an ANCESTOR line.
Lines representing gene families below each ANCESTOR begin with the following words:
PRESENT
Indicates a gene was inferred present in the common ancestor. This line then contains a list of all the genes in this gene family that are present inDIVERGED
gene is inferred not present in the ancestor, but are duplicated at one of the descendant lineages from the source gene.SINGULAR
gene is not present in this ancestor and is not gained in any of the descendant lineages [but gained in a later branch]For now, we are only interested in gene families that are 1) Present in the common ancestor and 2) Have 1 and only 1 gene present in each descendent. We then want a set of files. Each file should contain a gene family, and the gene sequence of one gene of that family from each
This could be checked