variant分析阶段小结3-对变异进行注释

2017-12-12, 4085 words, 20 min read

variant annotation

通过上面几步内容，我们找到了一些可信度相对高的突变位置，接下来一个很重要的内容就是对这些突变位点进行注释和功能预测。注释目前常用的工具有两种，一个是 snpEFF，另一个是 annovar。注释的思路也可以分为两类，一类是按照基因注释，另一类是按照位置注释。

SnpEFF

首先来说说SnpEFF，这个软件的用法说难不难，但是需要注意的地方不少。输入文件是前面生成的 filter 过的 vcf 文件（也可以是跑 ChIP-seq 等出来的 peak 文件），正式使用前首先需要下载好所研究物种的数据库，比对后的新 vcf 文件会在 INFO 这一列生成一个名字是 ANN 的 tag 。软件下载安装好之后，注释过程主要有三步：

查看官方已经准备好的数据库

据说目前已经有了超过 2500 个物种的注释信息，植物研究常用的拟南芥和水稻自然是有的。

java -jar snpEff.jar databases | less

下载自己需要的数据库

这里我们下载水稻的数据库信息为例

java -jar /home/zf/software/snpEff/snpEff.jar Oryza_sativa

下载好之后，会在软件安装目录中的 data 目录下出现名字是 Oryza_sativa，里面会有一个后缀是 bin 的文件，这个文件就是后面注释时要使用的文件。

画外音：开始我使用的是官方已经准备好的数据库，但是因为这个植物相关的数据库来自 Ensembl，里面的信息很多物种并不是最新的，而且会有很多我们不想要的注释内容存在。又因为水稻目前主要有来自 MSU 和 RAPDB 的两版注释信息，所有我们选择重新构建自己的水稻注释数据库。

构建自己的数据库

# 修改 snpEff.config 文件

vi snpEff.config

# 添加如下两行信息，含义是构建两个版本的Rice数据库
# 分别是msu和rapdb
-----
rapdb_rice.genome: Rice
msu_rice.genome: Rice
-----

mkdir ./data/msu_rice
mv ./data/msu_rice
# 下载相应genome,cds,protein文件和gtf文件到该文件夹
# 利用gffread 把gff转换为gtf
# 重命名如下
data/msu_rice/
├── cds.fa
├── genes.gtf
├── protein.fa
├── sequences.fa

# build database
java -Xmx20g -jar snpEff.jar build -v msu_rice 2>&1 | tee msu_rice.build

# 构建完成后结构如下

data/msu_rice/
├── cds.fa
├── genes.gtf
├── protein.fa
├── sequences.fa
├── snpEffectPredictor.bin # 后续注释需要的文件

对 vcf 结果文件进行注释

java -Xmx32g -jar /home/zf/software/snpEff/snpEff.jar msu_rice filtered_freeabyes.vcf -v > filtered_freeabyes_anno.vcf

说明

在注释的过程中，会输出所用数据库统计信息和运行中产生的警告。如下所示，可以看到我注释使用的 Rice Genome 的 msu 版本。一共有 55986 个基因。

00:00:12	Genome stats :
#-----------------------------------------------
# Genome name                : 'Rice'
# Genome version             : 'msu_rice'
# Genome ID                  : 'msu_rice[0]'
# Has protein coding info    : true
# Has Tr. Support Level info : true
# Genes                      : 55986
# Protein coding genes       : 55986
#-----------------------------------------------
# Transcripts                : 66338
# Avg. transcripts per gene  : 1.18
# TSL transcripts            : 0
#-----------------------------------------------
# Checked transcripts        :
#               AA sequences :  66112 ( 99.66% )
#              DNA sequences :  66135 ( 99.69% )
#-----------------------------------------------
# Protein coding transcripts : 66338
#              Length errors :     25 ( 0.04% )
#  STOP codons in CDS errors :      5 ( 0.01% )
#         START codon errors :      0 ( 0.00% )
#        STOP codon warnings :      0 ( 0.00% )
#              UTR sequences :  35273 ( 53.17% )
#               Total Errors :     25 ( 0.04% )
#-----------------------------------------------
# Cds                        : 292025
# Exons                      : 312496
# Exons with sequence        : 312496
# Exons without sequence     : 0
# Avg. exons per transcript  : 4.71
# WARNING                    : No mitochondrion chromosome found
#-----------------------------------------------
# Number of chromosomes      : 14
# Chromosomes                : Format 'chromo_name size codon_table'
#		'1'	43270923	Standard
#		'3'	36413819	Standard
#		'2'	35937250	Standard
#		'4'	35502694	Standard
#		'6'	31248787	Standard
#		'5'	29958434	Standard
#		'7'	29697621	Standard
#		'11'	29021106	Standard
#		'8'	28443022	Standard
#		'12'	27531856	Standard
#		'10'	23207287	Standard
#		'9'	23012720	Standard
#		'Un'	633585	Standard
#		'Sy'	592136	Standard

...
...
WARNINGS: Some warning were detected
Warning type	Number of warnings
INFO_REALIGN_3_PRIME	171
WARNING_TRANSCRIPT_INCOMPLETE	126
WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS	14

rapdb 版本的基因组信息如下

00:00:07	Genome stats :
#-----------------------------------------------
# Genome name                : 'Rice'
# Genome version             : 'rapdb_rice'
# Genome ID                  : 'rapdb_rice[0]'
# Has protein coding info    : true
# Has Tr. Support Level info : true
# Genes                      : 37851
# Protein coding genes       : 35667
#-----------------------------------------------
# Transcripts                : 44618
# Avg. transcripts per gene  : 1.18
# TSL transcripts            : 0
#-----------------------------------------------
# Checked transcripts        :
#               AA sequences :  42215 ( 99.99% )
#              DNA sequences :  42219 ( 94.62% )
#-----------------------------------------------
# Protein coding transcripts : 42219
#              Length errors :   1254 ( 2.97% )
#  STOP codons in CDS errors :      0 ( 0.00% )
#         START codon errors :   9239 ( 21.88% )
#        STOP codon warnings :    677 ( 1.60% )
#              UTR sequences :  38131 ( 85.46% )
#               Total Errors :  10099 ( 23.92% )
#-----------------------------------------------
# Cds                        : 164901
# Exons                      : 196575
# Exons with sequence        : 196575
# Exons without sequence     : 0
# Avg. exons per transcript  : 4.41
# WARNING                    : No mitochondrion chromosome found
#-----------------------------------------------
# Number of chromosomes      : 12
# Chromosomes                : Format 'chromo_name size codon_table'
#		'1'	43270923	Standard
#		'3'	36413819	Standard
#		'2'	35937250	Standard
#		'4'	35502694	Standard
#		'6'	31248787	Standard
#		'5'	29958434	Standard
#		'7'	29697621	Standard
#		'11'	29021106	Standard
#		'8'	28443022	Standard
#		'12'	27531856	Standard
#		'10'	23207287	Standard
#		'9'	23012720	Standard

结果文件

snpEff_genes.txt
snpEff_summary.html
filtered_indels_gatk_anno.vcf

其中 html 文件会给出主要（详细）的注释统计信息，所有关心的统计信息这个 html 里面都有而且还进行了简单的可视化，除了丑没啥大毛病。我这里仅用了之前 freebayes 产生的结果作为测试文件。例如：

整体情况

突变的影响和功能

突变分布的区域

再比如 SNP 的类型

snpEFF_gene.txt 这个文件里面统计了每一个基因的相关突变情况，其基因信息主要来自 Ensembl 数据库。

filtered_indels_gatk_anno.vcf 是主要的注释 vcf 文件，这里面和源文件相比变化的是 INFO 列 ANN tag , 形式如下所示，每两个竖线之间具体表示一类信息。总的来说一条注释一共有 16 列,有些不一定每个位点都有，如果一个位点有多条注释则使用逗号分隔。可以参考官网的详细解释。我从生成的文件中提取了一个 ANN 注释信息，为了方便说明，整理成如下表格。

内容	解释
T	Allele (or ALT)
missense_variant	Annotated using Sequence Ontology terms.
MODERATE	Putative_impact A simple estimation of putative impact / deleteriousness : {HIGH, MODERATE, LOW, MODIFIER}
LOC_Os01g01380	Common gene name (HGNC). Optional: use closest gene when the variant is “intergenic”
LOC_Os01g01380	Gene ID
transcript	Feature type preferred to use Sequence Ontology (SO) terms
LOC_Os01g01380.1	Feature ID
protein_coding	Transcript biotype
6/6	Rank / total Exon or Intron rank / total number of exons or introns.
c.341C>T	HGVS.c Variant using HGVS notation (DNA level)
p.Ser114Phe	HGVS.p: If variant is coding, this field describes the variant using HGVS notation (Protein level)
1399/3482	cDNA_position / cDNA_len: Position in cDNA and trancript's cDNA length (one based).
341/1566	CDS_position / CDS_len: Position and number of coding bases (one based includes START and STOP codons).
114/521	Protein_position / Protein_len: Position and number of AA (one based, including START, but not STOP).
	Distance to feature All items in this field are options, so the field could be empty. Up/Downstream: Distance to first / last codon Intergenic: Distance to closest gene Distance to closest Intron boundary in exon (+/- up/downstream). If same, use positive number. Distance to closest exon boundary in Intron (+/- up/downstream) Distance to first base in MOTIF Distance to first base in miRNA Distance to exon-intron boundary in splice_site or splice _region ChipSeq peak: Distance to summit (or peak center) Histone mark / Histone state: Distance to summit (or peak center)
	Errors, Warnings or Information messages

多注释问题

通常情况下，一个位点会出现多个注释信息，比如可能是一个基因的上游，也可能是一个基因的下游，或者一个基因有过个转录本，再或者位点本身就是 MNPs。既然是多个就有一个谁在前谁在后的问题，SnpEff 的规则是影响大的在前，突变有害的在前，实在没得挑了，按照基因位置排序。如果一个位点有很多条注释信息并不利于我们进行后续统计，这是可以使用软件本身提供的一些脚本进行处理。

如果想保留所有的注释信息，可以让每一个注释信息独立一行

cat filtered_freeabyes_anno.vcf |/home/zf/software/snpEff/scripts/vcfEffOnePerLine.pl |less

也可以每个变异只保留第一条注释

cat filtered_freeabyes_anno.vcf |/home/zf/software/snpEff/scripts/vcfAnnFirst.py |less

进行 regulatory 或者 non-coding 注释

annovar

这个软件也还行，不过用起来没有 SnpEff 那么顺手，一是需要转换一下 vcf 文件变成软件支持的格式，二是生成的结果文件比较简单（是缺点但是有时候也是优点），三是生成数据库的时候需要对注释文件进行一些转换，格式但凡有些问题就会很心累。当然，如果做的是人的研究不存在这个问题，而且关于人各种注释信息 annovar 都支持的非常好。

研究植物的话，无论基于基因还是位置的注释都需要自己生成 gff3 或者 bed 文件。

使用过程主要有一下几个步骤。

生成注释数据库

和使用 SnpEff 类似的，首先准备好水稻两个版本注释的 gtf 文件和基因组文件。

首先要把 gtf 文件（如果是 gff 文件需要先用 gffread 转换成 gtf 文件）转换为 genepred 文件，话说这个 genepred 文件的有点还是非常之多的，介绍可以看我之前写的一篇博客。转换的工具是 gtfToGenePred，要想用这个软件，得先安装 kentUtils。多少有点麻烦。

# 转换rapdb文件
gffread transcripts.gff -T -o transcripts.gtf
gtfToGenePred -genePredExt transcripts.gtf rap_refGene.txt

# 报错
# transcripts.gtf doesn't appear to be a GTF file (GFF not supported by this program)

# 转换 msu文件
gffread genes.gff3 -T -o msu.gtf
gtfToGenePred -genePredExt msu.gtf msu_refGene.txt

#正常

神奇不神奇，惊喜不惊喜！一样的命令，msu 的文件转换成功但是 rapbd 的就不行。于是查找原因。

head -n1 msu.gtf
Chr1	MSU_osa1r7	exon	2903	3268	.	+	.	transcript_id "LOC_Os01g01010.1"; gene_id "LOC_Os01g01010"; gene_name "LOC_Os01g01010";

head -n1 transcripts.gtf
Chr01	irgsp1_rep	exon	2983	3268	.	+	.	transcript_id "Os01t0100100-01"

问题出来了，rapdb 转换的 gtf 文件少了两个 gene_id 和 gene_name，这是为什么呢？

再看原始 gff 文件

msu
Chr1	MSU_osa1r7	gene	2903	10817	.	+	.	ID=LOC_Os01g01010;Name=LOC_Os01g01010
Chr1	MSU_osa1r7	mRNA	2903	10817	.	+	.	ID=LOC_Os01g01010.1;Name=LOC_Os01g01010.1;Parent=LOC_Os01g01010
Chr1	MSU_osa1r7	exon	2903	3268	.	+	.	ID=LOC_Os01g01010.1:exon_1;Parent=LOC_Os01g01010.1

rapdb
chr01	irgsp1_rep	mRNA	2983	10815	.	+	.	ID=Os01t0100100-01;Name=Os01t0100100-01;Locus_id=Os01g0100100
chr01	irgsp1_rep	five_prime_UTR	2983	3268	.	+	.	Parent=Os01t0100100-01
chr01	irgsp1_rep	five_prime_UTR	3354	3448	.	+	.	Parent=Os01t0100100-01
chr01	irgsp1_rep	CDS	3449	3616	.	+	0	Parent=Os01t0100100-01

问题出来了，rapdb 的 gff 文件没有 gene 信息，也没有 gene_id，很尴尬。

sed 's/Locus_id/gene_id/;s/Name/gene_name/' transcripts.gff > new.gff
gffread new.gff -T -o new.gtf
awk -F';' '{printf $1";"$2";"$2"\n"}' new.gtf |sed 's/gene_id/gene_name/' > rapdb.gtf
gtfToGenePred -genePredExt rapdb.gtf rap_refGene.txt

到这里终于有了需要的基因注释文件

LOC_Os01g01010.1	Chr1	+	2902	10817	3448	10297	12	2902,3353,4356,5456,7135,8027,8231,8407,9209,10103,10273,10503,	3268,3616,4455,5560,7944,8150,8320,8608,9617,10187,10430,10817,	0	LOC_Os01g01010	incmpl	incmpl	-1,0,0,0,2,1,1,0,0,0,0,-1,

Os01t0100100-01	Chr1	+	2982	10815	3448	10297	12	2982,3353,4356,5456,7135,8027,8231,8407,9209,10101,10273,10503,	3268,3616,4455,5560,7944,8150,8320,8608,9615,10187,10430,10815,	0	Os01g0100100	incmpl	incmpl	-1,0,0,0,2,1,1,0,0,1,0,-1,

接下来是生成 m 的 fRNAasta 文件

retrieve_seq_from_fasta.pl --format refGene --seqfile ../msu_rice/sequences.fa msu_refGene.txt --out msu_refGeneMrna.fa
retrieve_seq_from_fasta.pl --format refGene --seqfile ../rap_rice/sequences.fa rap_refGene.txt --out rap_refGeneMrna.fa

然后把这两个文件分别放在 annovar 下的 msu 和 rap 文件夹即可。

生成输入文件

convert2annovar.pl -format vcf4 filtered_samtools.vcf > filtered_samtools.annovar.txt

根据基因信息进行注释

annotate_variation.pl -buildver msu filtered_samtools.annovar.txt /home/zf/software/annovar/msu/ --geneanno --outfil filtered_samtools.annovar

结果文件

filtered_samtools.annovar.variant_function

intergenic	LOC_Os01g01100(dist=1986),LOC_Os01g01110(dist=1684)	Chr1	55398	55398	A	G	hom	38.415	2
exonic	LOC_Os01g01380	Chr1	194123	194123	C	T	hom	39.4149	2
exonic	LOC_Os01g01380	Chr1	194204	194204	C	T	hom	102	4
exonic	LOC_Os01g01380	Chr1	194698	194698	C	T	hom	39.4149	2

filtered_samtools.annovar.exonic_variant_function

line2	nonsynonymous SNV	LOC_Os01g01380:LOC_Os01g01380.1:exon6:c.C260T:p.A87V,	Chr1	194123	194123	C	T	hom	39.4149	2
line3	nonsynonymous SNV	LOC_Os01g01380:LOC_Os01g01380.1:exon6:c.C341T:p.S114F,	Chr1	194204	194204	C	T	hom	102	4
line4	nonsynonymous SNV	LOC_Os01g01380:LOC_Os01g01380.1:exon6:c.C835T:p.L279F,	Chr1	194698	194698	C	T	hom	39.4149	2
line5	synonymous SNV	LOC_Os01g01380:LOC_Os01g01380.1:exon6:c.T894C:p.D298D,	Chr1	194757	194757	T	C	hom	74	3

关于结果的解读可以参考官网，和 SnpEff 相比，variant_function 文件第一列是突变的位置，第二列是相关基因，后面就是输入文件。而 exonic_variant_function 则是专门针对外显子区域的突变进行注释。第一列的信息是在原始注释文件的行数，第二列是突变的种类，和 SnpEff 不同的是，如果有多个突变注释信息，annovar 只会根据突变的权重列出最重要的一个。

关于突变区域的解释

Value	Default precedence	Explanation	Sequence Ontology
exonic	1	variant overlaps a coding	exon_variant (SO:0001791)
splicing	1	variant is within 2-bp of a splicing junction (use -splicing_threshold to change this)	splicing_variant (SO:0001568)
ncRNA	2	variant overlaps a transcript without coding annotation in the gene definition (see Notes below for more explanation)	non_coding_transcript_variant (SO:0001619)
UTR5	3	variant overlaps a 5' untranslated region	5_prime_UTR_variant (SO:0001623)
UTR3	3	variant overlaps a 3' untranslated region	3_prime_UTR_variant (SO:0001624)
intronic	4	variant overlaps an intron	intron_variant (SO:0001627)
upstream	5	variant overlaps 1-kb region upstream of transcription start site	upstream_gene_variant (SO:0001631)
downstream	5	variant overlaps 1-kb region downtream of transcription end site (use -neargene to change this)	downstream_gene_variant (SO:0001632)
intergenic	6	variant is in intergenic region	intergenic_variant (SO:0001628)

关于突变类型的解释

Annotation	Precedence	Explanation	Sequence Ontology
frameshift insertion	1	an insertion of one or more nucleotides that cause frameshift changes in protein coding sequence	frameshift_elongation (SO:0001909)
frameshift deletion	2	a deletion of one or more nucleotides that cause frameshift changes in protein coding sequence	frameshift_truncation (SO:0001910)
frameshift block substitution	3	a block substitution of one or more nucleotides that cause frameshift changes in protein coding sequence	frameshift_variant (SO:0001589)
stopgain	4	a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate creation of stop codon at the variant site. For frameshift mutations, the creation of stop codon downstream of the variant will not be counted as "stopgain"!	stop_gained (SO:0001587)
stoploss	5	a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate elimination of stop codon at the variant site	stop_lost (SO:0001578)
nonframeshift insertion	6	an insertion of 3 or multiples of 3 nucleotides that do not cause frameshift changes in protein coding sequence	inframe_insertion (SO:0001821)
nonframeshift deletion	7	a deletion of 3 or mutliples of 3 nucleotides that do not cause frameshift changes in protein coding sequence	inframe_deletion (SO:0001822)
nonframeshift block substitution	8	a block substitution of one or more nucleotides that do not cause frameshift changes in protein coding sequence	inframe_variant (SO:0001650)
nonsynonymous SNV	9	a single nucleotide change that cause an amino acid change	missense_variant (SO:0001583)
synonymous SNV	10	a single nucleotide change that does not cause an amino acid change	synonymous_variant (SO:0001819)
unknown	11	unknown function (due to various errors in the gene structure definition in the database file)	sequence_variant (SO:0001060)

基于位置注释

基于位置注释有多种具体的思路，比如可以是组蛋白修饰区域，转录因子集合区域，或者是 miRAN 区域，也可以是基因组上的重复片段，还可以是基因组上一些特征区域比如启动子。

其中重复区域的注释比较重要，可以参考下面这段解释

Genetic variants that are mapped to segmental duplications are most likely sequence alignment errors and should be treated with extreme caution. Sometimes they manifest as SNPs with high fold coverage and probably high confidence score, but they may actually represent two non-polymorphic sites in the genomes that happen to have the same flanking sequence.

针对植物研究而言没有现成的数据下载，所以通常使用我们自己分析生成的 gff3 文件或者 bed 文件来进行注释。

比如我们利用 repeatmasker 软件找到了水稻中的简单重复区域，可以利用其生成的注释文件对我们的 vcf 结果进行基于位置信息的进一步注释。

说到 RepeatMasker 这个软件就再说两句，软件安装完成后利用以下命令，可以完成 repeat 的寻找

RepeatMasker -parallel 10 -species rice -gff -dir repeat rice.fa

会得到四个结果文件，其中 gff 和.out 文件是需要的，但是这个 gff 文件是 gff2 格式，并不是 annovar 要求的 gff3 格式（每一列必须有 ID=），虽然这个软件有一个脚本可以 out 文件改为 gff3，但还是不符合要求，需要讲结果文件的 Target=改为 ID=

cat /home/zf/software/annovar/msu/rice_repeat.gff3|sed 's/Target=/ID=/' |cut -d' ' -f1 >rice_repeat.gff

annotate_variation.pl -regionanno -buildver msu -dbtype gff3 -gff3dbfile rice_repeat.gff filtered_samtools.annovar.txt /home/zf/software/annovar/msu/ --outfile filtered_samtools.annovar

注释结果: 在测试文件的 9 万个变异位点中，有 1088 个注释到了重复区域。

gff3	Score=14;Name=(AAT)n	Chr1	444478	444478	-	GTA	hom	121	3
gff3	Score=14;Name=(AGGAA)n	Chr1	528020	528020	-	AAAGGGAAGAG	hom	30.4183	1
gff3	Score=60;Name=(AT)n	Chr1	563936	563936	A	T	hom	36.4154	2
gff3	Score=26;Name=(CAATAC)n	Chr1	574725	574725	A	G	hom	36.4154	2

可以看出富集最多的重复区域是 A-rich

cut -f2  filtered_samtools.annovar.msu_gff3|cut -d';' -f2|sort \
|uniq -c |sed 's/ *//'|tr ' ' '\t' |sort -k1,1nr|head

118	Name=A-rich
57	Name=(AT)n
40	Name=(TA)n
28	Name=GA-rich
16	Name=(TC)n
14	Name=(ATAT)n
14	Name=(ATT)n
12	Name=(AG)n

下一篇讲一些常用的下游分析

本文作者：思考问题的熊

如果你对这篇文章感兴趣，欢迎通过邮箱订阅我的 「熊言熊语」会员通讯，我将第一时间与你分享肿瘤生物医药领域最新行业研究进展和我的所思所学所想，点此链接即可进行免费订阅。

· 分享链接 https://kaopubear.top/blog/2017-12-12-variant3annotation/

variant