Difference between revisions of "Program to find probes"
From Ucsbgalaxy
Line 1: | Line 1: | ||
− | project on | + | project on galaxy-dev in PIA user |
+ | opsinomics.pl | ||
Program flow: | Program flow: | ||
− | *1 | + | *1. Reads in sequence.fasta sequence by sequence |
− | + | **1b. creates a hash of all sequence ids to their corresponding sequences | |
− | ** | + | *2. Reads in tree.tre |
− | *3. | + | *3. Reroots the tree on the midpoint |
− | ** | + | *4. Gets all the nodes sorted by how many generations they are from the root and adds them to a nodes array |
+ | *5. Goes through each node in reverse order (starting farthest from the root) | ||
+ | **5b. Retrieves the node's sibling(s) | ||
+ | *** If one of the siblings is blank, writes the other's sequence to the output file | ||
+ | **5c. Uses blastn's bl2seq to compare the two sequences | ||
+ | **5d. Uses SimpleAlign to get the best consensus string | ||
+ | *** If the consensus is smaller than $MINLENGTH, writes it to the output file | ||
+ | **5e. Updates the sequence hash, writing the consensus string to the ancestor node | ||
Verbal description of program: | Verbal description of program: |
Revision as of 11:23, 28 February 2014
project on galaxy-dev in PIA user
opsinomics.pl Program flow:
- 1. Reads in sequence.fasta sequence by sequence
- 1b. creates a hash of all sequence ids to their corresponding sequences
- 2. Reads in tree.tre
- 3. Reroots the tree on the midpoint
- 4. Gets all the nodes sorted by how many generations they are from the root and adds them to a nodes array
- 5. Goes through each node in reverse order (starting farthest from the root)
- 5b. Retrieves the node's sibling(s)
- If one of the siblings is blank, writes the other's sequence to the output file
- 5c. Uses blastn's bl2seq to compare the two sequences
- 5d. Uses SimpleAlign to get the best consensus string
- If the consensus is smaller than $MINLENGTH, writes it to the output file
- 5e. Updates the sequence hash, writing the consensus string to the ancestor node
- 5b. Retrieves the node's sibling(s)
Verbal description of program:
- Stage 1.
- After finding 'all' opsins (>9000 genes), I aligned them with MAFFT and calculated a NJ tree using Clearcut.
- A new program uses a phylogeny and unaligned sequences as input.
- Using the tree, the most closely related genes are compared and aligned using blast2seq
- If the sequences we compare have a match, the matching sequence is propagated down to the ancestral node of the 2 sequences.
- More distant relatives are compared using blast2seqs, and any time there is sufficient similarity, the consensus match is propagated "down" the tree toward the root.
- As such, a sequence that matches all sequences in a clade is propagated down the tree, until the next distant relative no longer has a match.
- These results are currently written to a file called output.txt .
- Tests of output.txt indicate that all known opsin sequences will have blast similarity with some sequence(s) in the output.txt file
- The remaining challenge is that many of the sequences in output.txt are too long to be probes. So, we need to find the sub-sequence of each result that has the most coverage across genes. This will require the Stage 2 algorithm.
- Stage 2 (Not written yet)
- Use a 'sliding window' approach to test sub-sequences (putative-probe) of each full sequence in the output file.
- Use blast to find all full sequences the putative-probe hits, with particular similarity and length parameters.
- Find the PD (phylogenetic diversity=sum of branch lengths) of the full sequences that were hit
- Use 1 or 2 putative probes from each sequence that hit the maximum PD