Machine Learning in Genomic Medicine: A Review of Computational Problems and Data Sets

Leung, Michael K. K., 2016

Summary

Major points made by the article:

“Our view is that to make genomic medicine a reality, we must develop computer systems that can accurately interpret the text of the genome just as the machinery inside the cell does”.
“Protein-coding exons are the most understood regions in the genome (re: “start” and “stop” codons”).
A long standing open problem is predicting whether a mutation will disrupt the stability or structure of the final protein molecule
“Predicting phenotypes (e.g., traits and disease risks) from biomarkers such as the genome is, in principle, a supervised machine learning problem”. The correct approach is not so simple; the computational model should be trained to predict measurable intermediate cell variables, also known as molecular phenotypes, and then these variables can be linked to phenotypes.
Alternative Splicing (AS) is the selection and ligation of specific exons during post-transcriptional modification.
On average, each protein-coding gene has approximately four transcripts (# of ways of selecting and combining available exons). We would like to be able to predict splicing by discovering the instructions that control splicing

Computational Model of Splicing

By accurately modeling splicing and AS computationally, researchers have been able to predict how it is affected by variations in the genome, and then to assess whether a mutation in the genome affects disease risk.

Computational Model of Protein-DNA and Protein-RNA binding

“Accurate models of protein-sequence binding are essential for interpreting the genome and for predicting the effects of mutations…Biologists have developed high-throughput experiments that measure the sequence specificity of individual proteins.”

Example computational model: inputs = genomic sequence, outputs is a binding score. One would like to predict the “motifs”, or patterns, that a particular protein binds to.

Deep Learning has been used to improve predictive performance- see Feedforward NNs for AS patterns.
CNNs have been used to improve predictive performance for binding specificity.
Cellular processes are highly stochastic and hence the genotype of an individual may not be sufficient to completely determine their phenotype
Measuring hundreds of thousands of cell variable measurements per patient for a small group of people potentially gives a better chance at deciphering the genomic instructions of the cell. More data for a model to learn from.
Necessary to use “large-scale machine learning”
RNNs can be useful for the following
- genome annotation
- Modelling of cell variable dynamics through time
- Creating a sequential state model of protein binding based on RNNs or LSTMs
- Imputation of epigenomic tracks - seq2seq
Machine Learning models need to be more interpret-able for genomics!

Notes

Since this is my first foray into computational biology, I’m going to keep track of a lot of terminology here:

Protein-coding genes describe how to build large molecules made from amino-acid chains (human genome contains ~20,000)
Non-coding genes describe how to build small molecules made from ribonucleic acid (RNA) chains (human genome contains ~25,000)
Information structures making up alternating regions on a typical gene are known as Introns and Exons 
Protein-sequence binding is the binding of proteins to nucleotide sequences
Position-Frequency Matrix - "workhorse of binding site modeling"

Strengths

Excellent paper for Machine Learning researchers to get a first look at diving into genomics.

Summary

Computational Model of Splicing

Computational Model of Protein-DNA and Protein-RNA binding

Specific Discussion Related to Deep Learning

Notes

Strengths