This course covers fundamental algorithms for efficient analysis of biological sequences and for building evolutionary trees. This is an undergraduate course focusing on the ideas and concepts behind the most central algorithms in biological sequence analysis. Dynamic Programming, Alignment, Hidden…
Some suggestions of where the student can get more exposure to algorithms for bioinformatics and computational biology.
Building evolutionary trees from sequence data. The Maximum Parsimony criteria, the special case of Perfect Phylogeny, and the Fitch-Hartigon dynamic program to minimize mutations when the tree and a sequence alignment are known.
Additive trees and their construction. The Neighbor-Joining algorithm and its use with near-additive data. Bootstrap values and their misuse.
Algorithms for constructing an Ultrametric Tree from an Ultrametric Matrix, and the relationship of ultrametrics to the molecular clock.
lntroduction to trees that represent evolution. We start with the case of perfect data: the Ultrametric tree case.
What the Backwards algorithm computes and why we want it. Profile HMMs and their use. Cleaning up some topics in sequence analysis (running out of time); PSI-BLAST and its dangers.
This class finishes the discussion of the Vitterbi algorithm, its time analysis and the traceback algorithm. Introduction to the Forward algorithm to compute the probability that a given sequence is generate by the HMM.
Finish the discussion of HMMs for CpG islands. Introduction to the Vitterbi algorithm (really dynamic programming) to find the most likely Markov Chain generating a given sequence.
Hidden Markov models to identify CpG islands. This lecture follows the discussion in Durbin and Eddy.
Finish the discussion of profiles and log-odds ratios. introduction to Markov Models and Hidden Markov Models
Use of multiple sequence alignment to build a model of a set of biologically related sequences. Profiles, log-odds ratios.
The center tree method and analysis; progressive alignment, guide trees, CLUSTAL, uses of multiple alignment
Continuation of multiple sequence alignment; sum-of-pairs objective function; tree consistency theorem; factor-of-two approximation.
Start of discussion on Multiple Sequence Alignment. sum-of-pairs objective function. Dynamic program solution for three sequences. Program MSA
Further discussion of probability and database search. Granin and BRCA1 story. Database search used as a filter, not an oracle.
E-values, extreme value distribution, probability of a match
Discussion of hashing kmers, E-values, statistics, BLAST II
Continuation of the topic of probability of matching. Here we look at the probability that a query string matches completely, at least once, in a much larger database of strings.
Completion of the analysis of the expected length of the longest common substring in two random strings
End-gap-free alignment using dynamic programming. Example from whole-genome shotgun sequencing.
We discuss the expected length of the longest common subsequence between two random strings of lengths n each, and show that it grows linearly with n. This lecture originally contained a discussion both of the expected length of the longest common subsequence and the expected length of the longest common substring. But the video has been split to overcome some technical problems that showed up in the original video.
We discuss the expected length of the longest common substring (not subsequence) between two random strings of length n each, and show that it grows only logarithmically as a function of n - much slower than the growth of the expected longest common subsequence discussed in Lecture 11a.
In depth treatment of local alignment using dynamic programming.
Continuation of the discussion of how to compute similarity and optimal sequence alignment using dynamic programming. Local as well as global alignment.
In this class, we move from the visual alignment graph to a purely symbolic treatment of how to compute sequence similarity and optimal alignment using dynamic programming.
Continuation of the discussion of how to efficiently compute the similarity of two sequences. Introduction to the traceback operation to find the optimal alignment.
Introduction to computational efficiency. Introduction to how we actually compute sequence similarity efficiently.
Review of the definition of sequence similarity and extensions of the model for greater biological fidelity. Introduction to parametric sequence alignment.
Definition of sequence similarity and string alignment, counting alignments the need for fast computation.
Continuation of the introduction to the course and to bioinformatics. The biological utility of the similarity of sequences. The first fact of bio-sequence analysis.
Introduction to the course and bioinformatics. Why we do bioinformatics, how it relates to genomics and to the changing modalities of biology.