Program to find probes

From Ucsbgalaxy
Jump to: navigation, search

project on galaxy-dev in PIA user ~/opsonomics directory


opsinomics.pl

Usage: ./opsinomics.pl sequence_file_fasta tree_file_newick output_file_fasta [min_seq_length]

Program flow:

  • 1. Reads in sequence.fasta sequence by sequence
    • 1b. creates a hash of all sequence ids to their corresponding sequences
  • 2. Reads in tree.tre
  • 3. Reroots the tree on the midpoint
  • 4. Gets all the nodes sorted by how many generations they are from the root and adds them to a nodes array
  • 5. Goes through each node in reverse order (starting farthest from the root)
    • 5b. Retrieves the node's sibling(s)
      • If one of the siblings is blank, writes the other's sequence to the output file
    • 5c. Uses blastn's bl2seq to compare the two sequences
    • 5d. Uses SimpleAlign to get the best consensus string
      • If the consensus is smaller than $MINLENGTH, writes it to the output file
    • 5e. Updates the sequence hash, writing the consensus string to the ancestor node

opsintest.pl

Checks the output of opsinomics.pl against the sequence file and calculates the similarity of each sequence

  • 1. Reads the output fasta and creates a BLAST+ database from it
  • 2. Reads through the sequence file and creates a hash to store the results (sets similarity to 0 for each)
  • 3. Reads through the sequence file again
    • 3b. Uses blastn to find hits in the output.fasta blast database for the sequence
      • Note: word_size parameter dramatically changes the number of hits found
    • 3c. Stores the percent identity in the results hash
    • 3d. Keeps track of the lowest score

Verbal description of program:

  • Stage 1.
    • After finding 'all' opsins (>9000 genes), I aligned them with MAFFT and calculated a NJ tree using Clearcut.
    • A new program uses a phylogeny and unaligned sequences as input.
    • Using the tree, the most closely related genes are compared and aligned using blast2seq
    • If the sequences we compare have a match, the matching sequence is propagated down to the ancestral node of the 2 sequences.
    • More distant relatives are compared using blast2seqs, and any time there is sufficient similarity, the consensus match is propagated "down" the tree toward the root.
    • As such, a sequence that matches all sequences in a clade is propagated down the tree, until the next distant relative no longer has a match.
    • These results are currently written to a file called output.txt .
    • Tests of output.txt indicate that all known opsin sequences will have blast similarity with some sequence(s) in the output.txt file
    • The remaining challenge is that many of the sequences in output.txt are too long to be probes. So, we need to find the sub-sequence of each result that has the most coverage across genes. This will require the Stage 2 algorithm.
  • Stage 2 (Not written yet)
    • Use a 'sliding window' approach to test sub-sequences (putative-probe) of each full sequence in the output file.
    • Use blast to find all full sequences the putative-probe hits, with particular similarity and length parameters.
    • Find the PD (phylogenetic diversity=sum of branch lengths) of the full sequences that were hit
    • Use 1 or 2 putative probes from each sequence that hit the maximum PD