Difference between revisions of "Program to find probes"

From Ucsbgalaxy
Jump to: navigation, search
Line 1: Line 1:
project in /home/likewise-open/ADS/oakley/labdata/OpsinGenomes
+
project on oakley-dev in PIA user
  
 
Program flow:
 
Program flow:
Line 7: Line 7:
 
*3. Reads through the sequence.fasta file again, this time running blastn on each sequence
 
*3. Reads through the sequence.fasta file again, this time running blastn on each sequence
 
**3b. For each hit, set the hash with the display_id as key to true
 
**3b. For each hit, set the hash with the display_id as key to true
 +
 +
Verbal description of program:
 +
* Stage 1.
 +
**After finding 'all' opsins (>9000 genes), I aligned them with MAFFT and calculated a NJ tree using Clearcut.
 +
**A new program uses a phylogeny and unaligned sequences as input.
 +
**Using the tree, the most closely related genes are compared and aligned using blast2seq
 +
**If the sequences we compare have a match, the matching sequence is propagated down to the ancestral node of the 2 sequences.
 +
**More distant relatives are compared using blast2seqs, and any time there is sufficient similarity, the consensus match is propagated "down" the tree toward the root.
 +
**As such, a sequence that matches all sequences in a clade is propagated down the tree, until the next distant relative no longer has a match.
 +
**These results are currently written to a file called output.txt .
 +
**Tests of output.txt indicate that all known opsin sequences will have blast similarity with some sequence(s) in the output.txt file
 +
**The remaining challenge is that many of the sequences in output.txt are too long to be probes. So, we need to find the sub-sequence of each result that has the most coverage across genes. This will require the Stage 2 algorithm.
 +
 +
*Stage 2 (Not written yet)
 +
** Use a 'sliding window' approach to test sub-sequences (putative-probe) of each full sequence in the output file.
 +
** Use blast to find all full sequences the putative-probe hits, with particular similarity and length parameters.
 +
** Find the PD (phylogenetic diversity=sum of branch lengths) of the full sequences that were hit
 +
** Use 1 or 2 putative probes from each sequence that hit the maximum PD

Revision as of 11:51, 27 February 2014

project on oakley-dev in PIA user

Program flow:

  • 1. Creates a BLAST+ database using sequence.fasta
  • 2. Reads in sequence.fasta sequence by sequence
    • 2b. creates a hash of all sequence ids to false
  • 3. Reads through the sequence.fasta file again, this time running blastn on each sequence
    • 3b. For each hit, set the hash with the display_id as key to true

Verbal description of program:

  • Stage 1.
    • After finding 'all' opsins (>9000 genes), I aligned them with MAFFT and calculated a NJ tree using Clearcut.
    • A new program uses a phylogeny and unaligned sequences as input.
    • Using the tree, the most closely related genes are compared and aligned using blast2seq
    • If the sequences we compare have a match, the matching sequence is propagated down to the ancestral node of the 2 sequences.
    • More distant relatives are compared using blast2seqs, and any time there is sufficient similarity, the consensus match is propagated "down" the tree toward the root.
    • As such, a sequence that matches all sequences in a clade is propagated down the tree, until the next distant relative no longer has a match.
    • These results are currently written to a file called output.txt .
    • Tests of output.txt indicate that all known opsin sequences will have blast similarity with some sequence(s) in the output.txt file
    • The remaining challenge is that many of the sequences in output.txt are too long to be probes. So, we need to find the sub-sequence of each result that has the most coverage across genes. This will require the Stage 2 algorithm.
  • Stage 2 (Not written yet)
    • Use a 'sliding window' approach to test sub-sequences (putative-probe) of each full sequence in the output file.
    • Use blast to find all full sequences the putative-probe hits, with particular similarity and length parameters.
    • Find the PD (phylogenetic diversity=sum of branch lengths) of the full sequences that were hit
    • Use 1 or 2 putative probes from each sequence that hit the maximum PD