Did you know: programmers convert coffe to code?

If you like my articles, support me with a small amount.


Buy me a coffee

Greedy motif search

I like learning new things. This is why I’ve joined Finding Hidden Messages in DNA (Bioinformatics I) at Coursera. The pace of this course is really good, however I needed some time figuring out the solution for the “Greedy Motif Search” algorithm — or at least how to implement it in Python.

Source code

Sorry, because it is a graded problem I do not provide the source code. But I try to describe the solution as clear as possible to help you implementing the pseudo code. If you have problems with the understanding or questions to the pseudo code feel free to contact me.

Pseudo code

This is the code which is visible at the course’s interactive text book:

GREEDYMOTIFSEARCH(Dna, k, t)
    BestMotifs ← motif matrix formed by first k-mers in each string from Dna
        for each k-mer Motif in the first string from Dna
            Motif_1 ← Motif
            for i = 2 to t
                form Profile from motifs Motif_1, …, Motif_i - 1
                Motif_i ← Profile-most probable k-mer in the i-th string in Dna
            Motifs ← (Motif_1, …, Motif_t)
            if Score(Motifs) < Score(BestMotifs)
                BestMotifs ← Motifs
    output BestMotifs

If you find the pseudo code clear feel free to skip this whole article. If not keep on reading.

Explanation

Well, the pseudo code reads itself a bit hard — or at least for me. So here is a little explanation.

The input fields are the following:

  • Dna is the list of the DNA strings to find the motifs in it
  • k is the k-mer size to find the best scoring motifs
  • t is the length of the Dna list

The BestMotifs in the beginning contains the first k-mers of each string of the provided Dna list.

Then the first for-loop iterates over the k-mers in the first string of the Dna list and adds this k-mer (called Motif) to the list of Motifs.

The second for loop forms a profile from the motifs contained in the list of Motifs, then it selects the most probable k-mer from the actual string taken from the Dna list and adds it to the list of Motifs.

After the second for loop the score of the new Motifs list is compared to the score of the BestMotifs. If the score is lower than from the actual best (lower score means less distance) then the current Motifs will become the BestMotifs.

Pseudo code for Score(Motifs)

Score(Motifs)
    consensus = Consensus(Motifs)
    score = 0
    for motif in motifs:
        score = score + HammingDistance(consensus, motif)
    return score

Motifs is the list of Motifs to calculate the score for.

Pseudo code for Consensus(Motifs)

Consensus(Motifs)
    consensus = ""
    for i in 1..|Motifs[1]|
        count[A] = 0
        count[C] = 0
        count[G] = 0
        count[T] = 0
        for motif in Motifs
            count[motif[i]] = count[motif[i]] + 1
        consensus = consensus + key(count) where max(value)

Motifs is the list of Motifs to find the consensus for.

As you can see, a consensus is created from the most popular letter in each column of the motif matrix. If you encounter a tie you should take the lexicographically first one. For example if there are 4 motifs and each motif contains at position i a different nucleotide then the consensus would contain A at the position i.

The pseudo code seems a bit big but in Python for example it can be solved under 10 lines of code.

Pseudo code for HammingDistance(genome_a, genome_b)


HammingDistance(genome_a, genome_b)
    if genome_a == genome_b
        return 0
    distance = 0
    for i = 1..min(|genome_a|,|genome_b|)
        if genome_a[i] != genome_b[i]
            distance = distance + 1
    distance = distance + abs(|genome_a| - |genome_b|)
    return distance

genome_a and genome_b are the two genomes we compare. |genome_a| denotes the length of the genome. abs is the absolute value of the difference of the two lengths.

As you can see, this pseudo code works for genomes of different lengths too.

An example

The example is based on the sample input provided with the code challenge:

k = 3
t = 5
Dna:
GGCGTTCAGGCA
AAGAATCAGTCA
CAAGGAGTTCGC
CACGTCAATCAC
CAATAATATTCG

The first step is to calculate the list of BestMotifs initially. This means we take the first 3-mer of each element of the Dna list.

GGC
AAG
CAA
CAC
CAA

Now we have our BestMotifs (with the score of 6) and we can start the loop.

The first for loop iterates over the 3-mers in the genome GGCGTTCAGGCA. These 3-mers are: GGC GCG CGT GTT TTC TCA CAG AGG GGC GCA.

For every loop the first element of the Motifs list is the current 3-mer from the list above. Then for each remaining genome in the Dna list we select the most probable 3-mer for the profile generated from Motifs (for the first time it is only GGC). For the genome AAGAATCAGTCA the most probable 3-mer is AAG so we add it to the Motifs list. Now we take the third genome from the Dna list and select the most probable 3-mer against the profile formed from the Motifs list. And so on… After the iterations we end up with the following Motifs list: GCG AAG AAG CAC CAA. The score of this list is 7 which is higher than the score of the current BestMotifs which is 6 so the BestMotifs stays as it is.

I will skip the walkthrough for each iteration, however I will show the Motifs and their score after each iteration and indicate whether they switch to be the best motifs or not:

GCG AAG AAG ACG CAA, Score: 5 --> gets BestMotifs
CGT AAG AAG AAT AAT, Score: 4 --> gets BestMotifs
GTT AAG AAG AAT AAT, Score: 4 (not better)
TTC AAG AAG ATC TTC, Score: 6 (not better)
TCA TCA CAA TCA TAA, Score: 3 --> gets BestMotifs
CAG CAG CAA CAA CAA, Score: 2 --> gets BestMotifs
AGG AAG AAG CAC CAA, Score: 5 (not better)
GGC AAG AAG CAC CAA, Score: 7 (not better)
GCA AAG AAG ACG CAA, Score: 6 (not better)

As you can see, the best motifs at the end are CAG CAG CAA CAA CAA with the score of 2. And this is the result of the GreedyMotifSearch algorithm — and this is the same than the example shows at the code challenge.

GHajba
 

Senior developer, consultant, author, mentor, apprentice. I love to share my knowledge and insights what I achieve through my daily work which is not trivial -- at least not for me.

Click Here to Leave a Comment Below 14 comments