序列比对其他相关问题

2017-08-14, 1322 words, 6 min read

全局比对 Global Alignment

由芝加哥的 Needleman 和 Wunsch 两位于上个世纪 70 年代初提出，常被称之为 Needleman-Wunsch 算法。算法针对用户指定的打分函数，确定性地找出两条序列间的最优比对。

Needle-Wunsch 算法对两条序列所有残基进行全局比对的局限性。

功能相关的蛋白之间虽然可能在整体序列上相差甚远，却常常会具有相同的功能域
序列片段能够独立发挥特定的生物学功能，却在不同蛋白之间相当保守
仅靠全局比对的算法无法发现这样的片段
内含子的发现使得在做核酸水平的序列比对时必须要正确处理内含子导致的大片段的差异

局部比对 Local Alignment

1981 年，物理学家 Temple Smith 和数学家 Michael Waterman 在 Journal of Molcular Biology 上发表了一篇仅有四页的文章，提出 Smith-Waterman 局部比对算法

race back begins at the highest score inthe matrix and continues until you reach 0。And also the secondary best alignment

核心思想是给分数增加了下限 0 分

所有的回溯都是局部的，所有的最终比对也是局部的。引入止损下限，差异扩大之后重启比对，找到局部水平的相似性

空位罚分的改进

Affine gap penalty

opening a gap receives a score ofd
extending a gap receives a score of e
Penalty = d + (n-1)* e

有限向量机
alt

扩展公式

全局比对的时间复杂性

O(mn) 正比于 m*n

全基因组比对

同源 homology 的分类

直系同源 ortholog 来自于不同的物种，演化过程中基因没有丢失，各物种中都有

chaining
旁系同源 paralog 来自于一个物种内部基因组的复制

netting

NGS: Sequence alignment

Map the large numbers of short reads to a reference genome

In a broader sense: Identify similar sequences (DNA, RNA, or protein) inconsequence of functional, structural, or evolutionary relationships between the them
Applications: Genome assembly, SNP detection, homology search, etc

short: greater search sensitivity

large: faster search speed

In-exact alignment

BWA 和 bowtie 的相关算法，大大减少了对服务器的要求

如何快速的知道某段序列大约在基因组的哪个位置

如何定义大约这个概念

Hamming Distance or Sequence Similarity
Ungapped vs Gapped Global vs Local
All positions or the single best

Efficiency depends on the data characteristics & goals

Smith-Waterman: Exhaustive search for optimal alignments
BLAST: Hash-table based homology searches
Bowtie: BWT alignment for short read mapping

BWT 算法

核心是绕最后一个序列转圈

先给每一个 T 做 rotations, 再进行 sort, 生成 bw 矩阵，最后一列从头到尾就是 BWT
回溯

给每一个 T 的字母一个出现次数的排序，图示如下

Suffix（后缀） Arrays

类似于电话本中的索引结构

What if we need to check many queries

We don't need to check every page of the phone book to find 'Ma'
Sorting alphabetically lets us immediately skip 96% (25/26) of the book withoutany loss in accuracy

Sorting the genome: Suffix Array (Manber & Myers1991)

Sort every suffix of the genome

所有具有相同 prefix（前缀）的 suffixes（后缀）会聚在一起，这样就可以进行类似于二分法的排除

全基因组建立 index

An array of integers giving the startingpositions of suffixes of a string inlexicographical order
从中间的 index 开始找，过滤一半

效率

Total Runtime: O(m log n)

More complicated, but much faster!
Looking up a query loops 32 times instead of 3B

Searching the array is very fast, but it takes time to construct

This time will be amortized over many, many searches
Run it once "overnight" and save it away for all future queries
非常消耗内存

BLAST/Dot matrix

Indexing-based local alignment

Basic Local Alignment Search Tool

围绕最优比对路径进行计算

BLAST Ideas 核心思想：Seeding‐and‐extending

Find matches (seed) between the query and subject 找到高度相似的小片段，种子
Extend seed into High Scoring Segment Pairs (HSPs) 向两端延伸并进行比对

Run Smith‐Waterman algorithm on the specified region only
Assess the reliability of the alignment 打分评估

将序列切分，在数据库中定位候选序列和位置

得到候选序列和查询序列的 heatmap

去掉零散的 hit, 直留下对角线，形成 hit cluster

以 hit cluster 为基础向左右进行延伸直到分数不符合要去

在扩展的区域进行局部比对

blast 加速

标记低复杂度，易产生假阳性
考虑与种子相似的邻居单字

分数评估，避免随机因素

E value

n 数据库大小
k 和打分矩阵相关
m 长度
s 比对的分数

在随机情况下获得比当前分数高的可能比对条数，不是概率是个望值。p 为 0.05 时，E 也是 0.05。

BLAST 是一种启发式算法，不确定有最优解

只在有效区域应用动态规划算法

本文作者：思考问题的熊

如果你对这篇文章感兴趣，欢迎通过邮箱订阅我的 「熊言熊语」会员通讯，我将第一时间与你分享肿瘤生物医药领域最新行业研究进展和我的所思所学所想，点此链接即可进行免费订阅。

· 分享链接 https://kaopubear.top/blog/2017-08-14-longxing-bioinfo-blastmore/

生物信息