很高兴和你相遇
这里正在记录我的所思所学
订阅免费邮件通讯接收最新内容
首页 归档 想法 工具 通讯 播客 简历 主页

机器学习在生物信息学中的应用

recap

Machine learning methods

general-purpose approaches to learn functional relationships from data without the need to define them a priori

Advantage

derive predictive models without the need for strong assumptions about underlying mechanisms, which are frequently unknown or insufficiently defined (especially for genomic data)

Example

he most accurate prediction of gene expression levels is currently made from a broad set of epigenetic features using linear models or random forests

However, how the selected features determine the transcript levels remains an active research topic

four steps

Most of machine learning applications in genomics can be described within the canonical machine learning workflow with four steps

  • data cleaning and pre-processing

  • feature extraction

  • model fitting/training

  • evaluation

It is customary to denote one data sample, including all covariates and features as input x (usually a vector of numbers), and label it with its response variable or output value y (usually a single number) when available.

Supervised machine learning model

Unsupervised machine learning

approaches aim to discover patterns from the data samples x themselves, without the need for output labels y.

Methods such as clustering and principal component analysis are typical examples of unsupervised models applied to biological data

深度神经网络

Neural networks

An artificial neural network, initially inspired by neural networks in the brain, consists of layers of interconnected compute units (neurons).

The depth of a neural network corresponds to the number of hidden layers, and the width to the maximum number of neurons in one of itslayers.

When training networks with larger numbers of hidden layers, artificial neural networks were rebranded as “deep networks”.

The term “neural network” largely refers to the hypothesis class part of a machine learning algorithm:

  • Hypothesis: non-linear hypothesis function, which involve compositions of multiple linear operators (e.g. matrix multiplications) and element wise nonlinear functions
  • Loss: “Typical” loss functions for classification and regression: logistic, softmax (multiclass logistic), etc.
  • Optimization: Gradient descent

architectures

Recurrent neural networks

  • Maintain hidden state over time, hidden state is a function of currentinput and previous hidden state
  • Traditional RNNs have trouble capturing long-term dependencies
  • Problem has to do with vanishing gradient, for many activations like sigmoid, gradients get smaller and smaller over subsequent layers
  • One solution, long short term memory (LSTM) (Hochreiter and Schmidhuber 1997), has more complex structure that specifically encodes memory and pass-through features, able to model long-term dependencies

application

DeepBind

CNN architectures to predict specificities of DNA-binding and RNAbinding proteins

Outperformed existing methods, to recover known and novel sequence motifs, and could quantify the effect of sequence alterations and identify functional SNVs

The neurons in the convolutional layer scan for motif sequences and combinations thereof, similar to conventional PWMs

The learning signal from deeper layers informs the convolutional layer which motifs are the most relevant. The motifs recovered by the model can then be visualized as sequence logos

paper

  • Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning, doi:10.1038/nbt.3300

Need sufficient amount of labelled data

Need to be trained, selected and tested on independent data sets to avoid overfitting and assure that the model will generalize

  • The training set is used to learn models with different parameters, which are then assessed on the validation set.

  • The model with best performance, for example prediction accuracy or meansquared error, is selected and further evaluated on the test set to quantify the performance on unseen data and for comparison to other methods.

  • Typical data set proportions are 60% for training, 10% for validation and 30% for model testing. If the data set is small, k-fold cross-validation can be used

Categorical features such as DNA nucleotides need to be encoded numerically. Typically represented as binary vectors with all but one entry set to zero (one-hot coding).

The four bits of each encoded base are commonly considered analogously to color channels of an image to preserve the entity of a nucleotide


本文作者:思考问题的熊

版权声明:本博客所有文章除特别声明外,均采用 知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议 (CC BY-NC-ND 4.0) 进行许可。

如果你对这篇文章感兴趣,欢迎通过邮箱或者微信订阅我的 「熊言熊语」会员通讯,我将第一时间与你分享肿瘤生物医药领域最新行业研究进展和我的所思所学所想点此链接即可进行免费订阅。


· 分享链接 https://kaopubear.top/blog/2017-08-15-longxing-bioinfo-mlinbioinfo/