Notes on Phyolgenetic Trees, Needleman-Wunsch, And Single-Linkage Clustering

Chris Tralie

These notes go over the background topics for homework 5, which bring together minimum spanning tree algorithms, dynamic programming, and a cool application to biology and the "tree of life."

    Table of Contents

    Amino Acids

    A DNA sequence can be thought of as a string. Roughly, DNA encodes building blocks known as amino acids, which, when decoded and strung together in sequence, form proteins. The sequence has a large impact on the way the protein structure folds and ultimately forms a 3D shape, leading towards a particular biomolecular role.

    Click here to see more information about each type of amino acid. In assignment 5, we will be considering an alphabet of 23 amino acids, including all of the standard 20, as well as the B,Z,X special ones highlighted in red in the aforementioned link.

    The Needleman-Wunsch Algorithm

    Since DNA is a string, we can compare two DNA sequences with string comparison methods, but when comparing the sequences across organisms, the technique needs to account for mutations that have occurred over evolutionary histories, including additions, deletions, and substitutions of individual amino acids. We've discussed dynamic programing techniques for computing the string edit distance, which accounts for such mutations. However, it models a unit cost for an addition, deletion, and substitution alike, and there are biological reasons that we may want to have costs that are more less expensive for certain amino acid edits, as explained below.

    To address the need for variable costs, there is a variant of edit distance known as Needleman-Wunsch, in which the costs can change depending on what characters are involved. By convention, we actually switch from a "minimizing cost" mindset to a maximizing score mindset. Given a string of length M and a string of length N, the optimal Needleman-Wunsch score can be computed in O(MN) time using a similar dynamic programming algorithm to edit distance and dynamic time warping. In particular, to match a string of length M to a string of length N, the vanilla version proceeds by filling in an (M+1) x (N+1) table S* storing all solutions to sub-problems, using the following recurrence relation, where di and dj refer to the cost of deleting the ith and jth characters of the first and second strings, respectively, and cij refers to the cost of substituting the ith character of the first string for the jth character of the second string (and everything is 0-indexed):

    \[ S_{ij} = \left\{ \begin{array}{cc} S_{i-1, 0} + d_{i-1} & j = 0 \\ S_{0, j-1} + d_{j-1} & i = 0 \\ \max \left\{ \begin{array}{c} S_{i-1, j-1} + c_{i-1, j-1} \\ S_{i, j-1} + d_{j-1} \\ S_{i-1, j} + d_{i-1} \end{array} \right\} & i > 0, j > 0 \end{array} \right\} \]

    (*) Note that there is a version that uses only Θ(N) memory, which uses a similar trick to a recent algorithm I developed with an Ursinus student for dynamic time warping

    To help you explore examples of Needleman-Wunsch scoring, I have provided a live demo app below. In the default example, we have an alphabet of two letters a and b, and the scores are as follows

    • A -1 penalty for deleting a
    • A -2 penalty for deleting b
    • A -3 penalty for swapping an a for a b or a b for an a
    • A +2 score for matching an a to an a
    • A +3 score for matching a b to a b
    All of this information is specified in a dictionary, which you can see in the "pairwise costs" menu, and the format of this dictionary mirrors the format of the Python dictionaries you will be using in assignment 5. In particular, a single character key represents the cost of deleting that character, while a key of concatenated characters represents the cost (or score) of substituting one character for the other. For instance, {"ab":-3} means that it costs 3 both to swap a in for b and to swap b in for a, and {"a":-1} refers to a penalty of -1 for deleting a.

    To generate examples, when you hit "match strings" in the live demo below, this program will run the Needleman-Wunsch algorithm to fill out the dynamic programming table. It will also perform a backtracing to find one of the optimal-cost sequence of edit operations, though this is not something you have to do in assignment 5. Play around with different examples and explore the results until you feel comfortable.

    Needleman-Wunsch Interactive Applet

    String 1

    String 2

    Alphabet

    Pairwise
    Costs

    Optimal Matching Score

    BLOSUM Amino Acid Substitution Costs

    Now that we have an alphabet for DNA strings and an algorithm to compare them, we can define meaningful costs for matchings and substitutions of amino acids. We will be using tables obtained from an experimental, data-driven technique that constructs a "Blocks Substitution Matrix," or "BLOSUM" for short, as described in this paper. In a nutshell, the technique computes statistical likelihoods that substitutions take place by examining many well-aligned sequences.

    There are different similarity thresholds at which BLOSUM tables can be constructed, and higher numbers mean that we're more conservative in which symbols we allow to be aligned. In assignment 5, we'll be considering BLOSUM50 and BLOSUM62, as obtained from ftp://ftp.ncbi.nih.gov/repository/blocks/unix/blosum/blosum.tar.Z. Click the "show/hide" buttons to view each of these below. You'll notice that we get a positive score when matching an amino acid to itself, and a negative score when swapping amino acids and when deleting them (matching to a *), so this is compatible with Needleman-Wunsch. You'll also notice that this matrix is symmetric (e.g. A to D and D to A are both -2), so we will only be storing the non-redundant parts in a lookup dictionary.

    BLOSUM 62

    BLOSUM 50

    ARNDCQEGHILKMFPSTWYVBZX*
    A4-1-2-20-1-10-2-1-1-1-1-2-110-3-20-2-10-4
    R-150-2-310-20-3-22-1-3-2-1-1-3-2-3-10-1-4
    N-2061-30001-3-30-2-3-210-4-2-330-1-4
    D-2-216-302-1-1-3-4-1-3-3-10-1-4-3-341-1-4
    C0-3-3-39-3-4-3-3-1-1-3-1-2-3-1-1-2-2-1-3-3-2-4
    Q-1100-352-20-3-210-3-10-1-2-1-203-1-4
    E-1002-425-20-3-31-2-3-10-1-3-2-214-1-4
    G0-20-1-3-2-26-2-4-4-2-3-3-20-2-2-3-3-1-2-1-4
    H-201-1-300-28-3-3-1-2-1-2-1-2-22-300-1-4
    I-1-3-3-3-1-3-3-4-342-310-3-2-1-3-13-3-3-1-4
    L-1-2-3-4-1-2-3-4-324-220-3-2-1-2-11-4-3-1-4
    K-120-1-311-2-1-3-25-1-3-10-1-3-2-201-1-4
    M-1-1-2-3-10-2-3-212-150-2-1-1-1-11-3-1-1-4
    F-2-3-3-3-2-3-3-3-100-306-4-2-213-1-3-3-1-4
    P-1-2-2-1-3-1-1-2-2-3-3-1-2-47-1-1-4-3-2-2-1-2-4
    S1-110-1000-1-2-20-1-2-141-3-2-2000-4
    T0-10-1-1-1-1-2-2-1-1-1-1-2-115-2-20-1-10-4
    W-3-3-4-4-2-2-3-2-2-3-2-3-11-4-3-2112-3-4-3-2-4
    Y-2-2-2-3-2-1-2-32-1-1-2-13-3-2-227-1-3-2-1-4
    V0-3-3-3-1-2-2-3-331-21-1-2-20-3-14-3-2-1-4
    B-2-134-301-10-3-40-3-3-20-1-4-3-341-1-4
    Z-1001-334-20-3-31-1-3-10-1-3-2-214-1-4
    X0-1-1-1-2-1-1-1-1-1-1-1-1-1-200-2-1-1-1-1-1-4
    *-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4-4
    ARNDCQEGHILKMFPSTWYVBZX*
    A5-2-1-2-1-1-10-2-1-2-1-1-3-110-3-20-2-1-1-5
    R-27-1-2-410-30-4-33-2-3-3-1-1-3-1-3-10-1-5
    N-1-172-20001-3-40-2-4-210-4-2-340-1-5
    D-2-228-402-1-1-4-4-1-4-5-10-1-5-3-451-1-5
    C-1-4-2-413-3-3-3-3-2-2-3-2-2-4-1-1-5-3-1-3-3-2-5
    Q-1100-372-21-3-220-4-10-1-1-1-304-1-5
    E-1002-326-30-4-31-2-3-1-1-1-3-2-315-1-5
    G0-30-1-3-2-38-2-4-4-2-3-4-20-2-3-3-4-1-2-2-5
    H-201-1-310-210-4-30-1-1-2-1-2-32-400-1-5
    I-1-4-3-4-2-3-4-4-452-320-3-3-1-3-14-4-3-1-5
    L-2-3-4-4-2-2-3-4-325-331-4-3-1-2-11-4-3-1-5
    K-130-1-321-20-3-36-2-4-10-1-3-2-301-1-5
    M-1-2-2-4-20-2-3-123-270-3-2-1-101-3-1-1-5
    F-3-3-4-5-2-4-3-4-101-408-4-3-214-1-4-4-2-5
    P-1-3-2-1-4-1-1-2-2-3-4-1-3-410-1-1-4-3-3-2-1-2-5
    S1-110-10-10-1-3-30-2-3-152-4-2-200-1-5
    T0-10-1-1-1-1-2-2-1-1-1-1-2-125-3-200-10-5
    W-3-3-4-5-5-1-3-3-3-3-2-3-11-4-4-3152-3-5-2-3-5
    Y-2-1-2-3-3-1-2-32-1-1-204-3-2-228-1-3-2-1-5
    V0-3-3-4-1-3-3-4-441-31-1-3-20-3-15-4-3-1-5
    B-2-145-301-10-4-40-3-4-200-5-3-452-1-5
    Z-1001-345-20-3-31-1-4-10-1-2-2-325-1-5
    X-1-1-1-1-2-1-1-2-1-1-1-1-1-2-2-10-3-1-1-1-1-1-5
    *-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5

    NOTE: Interestingly, as shown in this paper, the original authors had a normalization error when computing the BLOSUM matrices according to their procedure, but the matrices somehow yield better search results. So we will be using the original matrices in assignment 5.

    Phylogenetic Trees

    Now that we have ways of measuring similarity between amino acid sequences, we can use these similarity measures as a proxy for overall similarity between species. Amazingly, if we choose the right genes to analyze, we can use Needleman-Wunsch similarity to build a "tree of life" from the bottom up. In this context, the tree will show branching in evolutionary history, and it is known as a Phylogenetic tree. To build it, we'll merge nodes together first that have the highest similarity score, and they will end up towards the bottom of the tree. Internal nodes can then be thought of as common ancestors, and eventually we will end up at a root which can be thought of as the origin of life.

    When inferring the evolution of species, good genes to examine come from mitochondrial DNA (DNA describing the energy factory in cells), as the DNA mutates quickly and is only passed on from the mother. Using https://www.ncbi.nlm.nih.gov/nuccore/, I've provided amino acid sequences data for 71 species in the file organisms.json from the COX-3 mitochondrial gene on homework 5. The figure below shows the phylogenetic tree of these species obtained using Needleman-Wunsch with BLOSUM62 costs. We will walk through how to construct this in the next section.

    Single-Linkage Clustering

    We will now describe an algorithm, known as "single linkage clustering with the min rule," that can be used to construct phylogenetic trees. In this context, a phylogenetic tree is a specific example of a more general construct known as a dendrogram.

    Algorithm 1: O(N3) Naive Single-Linkage Clustering

    We will first describe an algorithm that will help to explain conceptually how the tree is built, but this algorithm is inefficient, so we will then explain a better algorithm that computes the same information, and you will implement the latter algorithm.

    1. Start off with each species in its own cluster, which is represented by a leaf node in the tree.
    2. Merge the two clusters with the highest similarity, according to the Needleman-Wunsch distance. Similarity is defined as the maximum similarity between all pairs of items in one and the other cluster. Create a new internal node in the tree whose children are the root nodes of each cluster that's being merged.

      As an example, suppose that Fugu rubipres and Goldfish are paired together, and Eastern newt and Fire salamander are paired together. Then the BLOSUM62-based Needleman-Wunsch scores are as follows:

      Eastern newt Fire salamander
      Fugu rubipres 1319 1294
      Goldfish 1341 1300
      In this case, the maximum similarity between members of the two clusters occurs between the Eastern Newt and Goldfish at 1341. So if at this step 1341 is the largest similarity between all pairs of clusters that exist, then we merge these four and form an internal node representing that.

    3. Repeat step 2 as long as there is more than one cluster.
    4. The final node that's created is the root of the tree.

    Algorithm 2: Kruskal's Algorithm for O(N2logN) Single Linkage Clustering

    The algorithm we've described so far totally works, but the problem is that a straightforward implementation from this description takes O(N3) time for N items. That's OK for only 71 species in our example, but these sorts of trees are often built on images, where every pixel starts off in its own node. Even a 100x100 image has 10,000 pixels, so N3 is a trillion at that point. So even though we can get away with a naive algorithm in assignment 5, for the purposes of the class, we will be implementing a more efficient version based on an algorithm known as Kruskal's algorithm, using union find to help us. The algorithm uses union find to keep track of clusters as clusters are merged together from most to least similar until everything is in one tree. The algorithm maintains three data structures in tandem as it goes along

    1. The tree that's being built
    2. A union find data structure to keep track of which leaf nodes (corresponding to species) are part of the same connected component
    3. An array or dictionary that keeps track of the node of the tree corresponding to a root in the union find data structure
    In this way, we can quickly look up whether a pair of nodes is part of the same cluster by seeing if their roots are the same. Then, it's just a matter of merging nodes from most similar to least similar. The steps below spell this out.

    1. Create a node for each species, and create a union find structure where each node starts off as its own component. Set the roots of each of these components to be the leaf nodes in the tree in a separate array.
    2. Compute all pairwise similarities between nodes based on Needleman-Wunsch.
    3. Sort all pairs of nodes in decreasing order of similarity to ensure that the most similar pairs get grouped together first. This step incurs the dominant cost of O(N2 log N) since there are O(N2) pairs.
    4. Now we need to merge the nodes together from the bottom up according to our sort. In particular, for each pair of nodes in the above order, check to see if they are part of different clusters by comparing their roots in union find. If they have different roots, merge them together in the union find data structure, create a new internal node for them, and set the new root to point to this internal node. Otherwise, if they are already part of the same cluster, they have the same root (adding an edge would form a cycle), and you can simply skip this pair.
    5. The last node to be added after repeating step 4 for all pairs is the root of the tree.

    This algorithm is tricky enough to warrant an example showing the dynamics of the three data structures. For simplicity of illustration, we'll show building a dendrogram corresponding to a minimal spanning tree on a set of points in the plane, though the algorithm is very similar for phylogenetic trees on Needleman Wunsch; we basically just switch sorting in ascending order of length to sorting in descending order of similarity. The example is shown below. The left plot shows the tree being built on the original point cloud, as well as the parents pointers of the union find. The right plot shows the dendrogram being constructed as new internal nodes are being added, and it also shows the array that keeps track of which union find roots correspond to which dendrogram roots.

    NOTE ALSO: The maximum rule algorithm is not necessarily the best algorithm for constructing phylogenetic trees, because it tends to produce subtrees whose most similar nodes are quite similar, but whose least similar nodes can be quite different. A better scheme is the "average rule," in which nodes are merged together based on average similarity of pairs between them. But this is beyond the scope of assignment 5.