Program to find probes

project on galaxy-dev in PIA user ~/opsonomics directory

opsinomics.pl

Usage: ./opsinomics.pl sequence_file_fasta tree_file_newick output_file_fasta [min_seq_length]

Program flow:

1. Reads in sequence.fasta sequence by sequence
- 1b. creates a hash of all sequence ids to their corresponding sequences
2. Reads in tree.tre
3. Reroots the tree on the midpoint
4. Gets all the nodes sorted by how many generations they are from the root and adds them to a nodes array
5. Goes through each node in reverse order (starting farthest from the root)
- 5b. Retrieves the node's sibling(s)
  - If one of the siblings is blank, writes the other's sequence to the output file
- 5c. Uses blastn's bl2seq to compare the two sequences
- 5d. Uses SimpleAlign to get the best consensus string
  - If the consensus is smaller than $MINLENGTH, writes it to the output file
- 5e. Updates the sequence hash, writing the consensus string to the ancestor node

opsintest.pl

Checks the output of opsinomics.pl against the sequence file and calculates the similarity of each sequence

1. Reads the output fasta and creates a BLAST+ database from it
2. Reads through the sequence file and creates a hash to store the results (sets similarity to 0 for each)
3. Reads through the sequence file again
- 3b. Uses blastn to find hits in the output.fasta blast database for the sequence
  - Note: word_size parameter dramatically changes the number of hits found
- 3c. Stores the percent identity in the results hash
- 3d. Keeps track of the lowest score

Verbal description of program:

Stage 2 (Not written yet)
- Use a 'sliding window' approach to test sub-sequences (putative-probe) of each full sequence in the output file.
- Use blast to find all full sequences the putative-probe hits, with particular similarity and length parameters.
- Find the PD (phylogenetic diversity=sum of branch lengths) of the full sequences that were hit
- Use 1 or 2 putative probes from each sequence that hit the maximum PD

Navigation menu