Greedy motif search
Sorry, because it is a graded problem I do not provide the source code. But I try to describe the solution as clear as possible to help you implementing the pseudo code. If you have problems with the understanding or questions to the pseudo code feel free to contact me.
This is the code which is visible at the course’s interactive text book:
GREEDYMOTIFSEARCH(Dna, k, t) BestMotifs ← motif matrix formed by first k-mers in each string from Dna for each k-mer Motif in the first string from Dna Motif_1 ← Motif for i = 2 to t form Profile from motifs Motif_1, …, Motif_i - 1 Motif_i ← Profile-most probable k-mer in the i-th string in Dna Motifs ← (Motif_1, …, Motif_t) if Score(Motifs) &lt; Score(BestMotifs) BestMotifs ← Motifs output BestMotifs
If you find the pseudo code clear feel free to skip this whole article. If not keep on reading.
Well, the pseudo code reads itself a bit hard — or at least for me. So here is a little explanation.
The input fields are the following:
- Dna is the list of the DNA strings to find the motifs in it
- k is the k-mer size to find the best scoring motifs
- t is the length of the Dna list
The BestMotifs in the beginning contains the first k-mers of each string of the provided Dna list.
Then the first for-loop iterates over the k-mers in the first string of the Dna list and adds this k-mer (called Motif) to the list of Motifs.
The second for loop forms a profile from the motifs contained in the list of Motifs, then it selects the most probable k-mer from the actual string taken from the Dna list and adds it to the list of Motifs.
After the second for loop the score of the new Motifs list is compared to the score of the BestMotifs. If the score is lower than from the actual best (lower score means less distance) then the current Motifs will become the BestMotifs.
Pseudo code for Score(Motifs)
Score(Motifs) consensus = Consensus(Motifs) score = 0 for motif in motifs: score = score + HammingDistance(consensus, motif) return score
Motifs is the list of Motifs to calculate the score for.
Pseudo code for Consensus(Motifs)
Consensus(Motifs) consensus = &quot;&quot; for i in 1..|Motifs| count[A] = 0 count[C] = 0 count[G] = 0 count[T] = 0 for motif in Motifs count[motif[i]] = count[motif[i]] + 1 consensus = consensus + key(count) where max(value)
Motifs is the list of Motifs to find the consensus for.
As you can see, a consensus is created from the most popular letter in each column of the motif matrix. If you encounter a tie you should take the lexicographically first one. For example if there are 4 motifs and each motif contains at position i a different nucleotide then the consensus would contain A at the position i.
The pseudo code seems a bit big but in Python for example it can be solved under 10 lines of code.
Pseudo code for HammingDistance(genome_a, genome_b)
HammingDistance(genome_a, genome_b) if genome_a == genome_b return 0 distance = 0 for i = 1..min(|genome_a|,|genome_b|) if genome_a[i] != genome_b[i] distance = distance + 1 distance = distance + abs(|genome_a| - |genome_b|) return distance
genome_a and genome_b are the two genomes we compare. |genome_a| denotes the length of the genome. abs is the absolute value of the difference of the two lengths.
As you can see, this pseudo code works for genomes of different lengths too.
The example is based on the sample input provided with the code challenge:
k = 3 t = 5 Dna: GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG
The first step is to calculate the list of BestMotifs initially. This means we take the first 3-mer of each element of the Dna list.
GGC AAG CAA CAC CAA
Now we have our BestMotifs (with the score of 6) and we can start the loop.
The first for loop iterates over the 3-mers in the genome GGCGTTCAGGCA. These 3-mers are: GGC GCG CGT GTT TTC TCA CAG AGG GGC GCA.
For every loop the first element of the Motifs list is the current 3-mer from the list above. Then for each remaining genome in the Dna list we select the most probable 3-mer for the profile generated from Motifs (for the first time it is only GGC). For the genome AAGAATCAGTCA the most probable 3-mer is AAG so we add it to the Motifs list. Now we take the third genome from the Dna list and select the most probable 3-mer against the profile formed from the Motifs list. And so on… After the iterations we end up with the following Motifs list: GCG AAG AAG CAC CAA. The score of this list is 7 which is higher than the score of the current BestMotifs which is 6 so the BestMotifs stays as it is.
I will skip the walkthrough for each iteration, however I will show the Motifs and their score after each iteration and indicate whether they switch to be the best motifs or not:
GCG AAG AAG ACG CAA, Score: 5 --&gt; gets BestMotifs CGT AAG AAG AAT AAT, Score: 4 --&gt; gets BestMotifs GTT AAG AAG AAT AAT, Score: 4 (not better) TTC AAG AAG ATC TTC, Score: 6 (not better) TCA TCA CAA TCA TAA, Score: 3 --&gt; gets BestMotifs CAG CAG CAA CAA CAA, Score: 2 --&gt; gets BestMotifs AGG AAG AAG CAC CAA, Score: 5 (not better) GGC AAG AAG CAC CAA, Score: 7 (not better) GCA AAG AAG ACG CAA, Score: 6 (not better)
As you can see, the best motifs at the end are CAG CAG CAA CAA CAA with the score of 2. And this is the result of the GreedyMotifSearch algorithm — and this is the same than the example shows at the code challenge.