# 解决 Stringtie 基因重复定量的意外收获

Bioinformatics was like a box of chocolates. You never konw what you’re gonna get ——阿飞正传

## 惊喜

I am currently running stringtie on Arabidopsis rna-seq data sets and I am encountering duplication events that vary from sample to sample. The command I am running is:

I am using the arabidopsis gtf downloaded from TAIR, and the reads I am running are downloaded from NCBI. My workflow is trimmomatic, hisat2, samtools_sort, samtools_index, stringtie. I am running stringtie v1.3.4d. My workflow was run on a llinux slurm cluster, and I have also verified the same results on my local machine (Ubuntu 18.04).

I typically have 1 or 2 genes per run that have duplication events.

A colleague running the same workflow on bovine data is having the same issue.

Sorry, the name of the gtf was confusing, it makes more sense in the context of our workflow. Yes, the gtf was produced from the file on the TAIR page named

I cannot attach my file due to size constraints, but by downloading the above file and using that command you should be able to get the same reference file to work with. The file that results from the above command does not appear to have duplicates.

In the output files from stringtie on my arabidopsis data, each file seems to have a random duplication of 1 of three genes. Those genes are:

Here is a subset of my list of my output files (whole list is very long), and below each file name is what gene is duplicated (bash uniq command). The name of the file indicates the srx ID from NCBI. Each of these samples had multiple runs associated with it (sra). Before running through the pipeline these “sra” were combined into one large fastq file corresponding to their respective “srx”. You will notice that each file had different combinations of genes that were duplicated, but that they were always 1 of the 3 genes listed above.

As I was thinking about it, I got paranoid that this combining of sra’s into srx may have caused the issue, so I redid it with a sample without combining (used sample SRR4426362). I got the same results. The attached files are a subset of the resulting *.bam that will cause a duplication event when used with this command and the above mentioned reference file:

I used this command to see the duplicated gene in the gene abundance file:

I have also included the output gene abundance file from this subset bam.

I went through all of this again to verify that the results are still happening. Our workflow is using the latest editions of all of the other software as well.

I had to rename the attached files with a “.txt” extension to be able to upload them to git hub, so I hope that the bam still work when you get it.

Ah, I thought you somehow suggested that transcripts were duplicated (the fix mentioned in the release notes for v1.3.1 was about that), but now I see that your problem is that genes are duplicated in the “gene abundance” output file.

That’s actually not quite true; each line in that output file is rather about estimating abundance for”gene”regions (loci), formed by overlapping transcripts, and in some cases such gene regions are not uniquely identified by their ID/name, the coordinates (genomic location) of each such”gene region” (locus) sometimes make a very important distinction.

Look at the transcripts for that gene (AT1G58520) in the annotation file: there are 7 transcripts there, but one of them (AT1G58520.3) is actually not overlapping any of the others, so StringTie treats it as a separate locus (“gene”), and thus it creates a separate entry in the gene abundance file for that particular locus (AT1G58520|Chr1(+)21729913-21731344) which is different from the other, “main” locus which has all the other overlapping transcripts: AT1G58520|Chr1(+)21732566-21738808

StringTie cannot “trust” the reference annotation, as sometimes the gene ID can be just a duplicated string providing no true locus identity.. So you can think of it as StringTie splitting that “gene” into two non-overlapping gene regions and assessing the expression for each gene region independently. Again, the identity of a “gene” (actually “locus”) in that file is given by gene ID and the genomic location of the underlying cluster of transcripts which define that locus.

As you can see, that single transcript gene region (locus Chr1(+):21729913-21731344) has zero coverage so I don’t know if that gene region is even real (or perhaps just an annotation artifact).

If you really do not care about these situations (where gene definitions are not quite consistent with the transcripts so StringTie separates them like this), you could just add up the coverage values for all these lines with the same gene ID in the gene abundance file and call that the “total” abundance of that “gene” – but this could be really misleading if the gene ID is not uniquely identifying a locus, i.e. if it’s duplicated in other places on the genome (perhaps even on another chromosome).

Ok, that makes a lot of sense, I like that stringtie identifies genes this way so that it does not combine only on a gene ID. What I am seeing though is that stringtie will “duplicate” some genes in only some of the output files.

It would make sense if it did it all the time, but I only see it happening in some of them. I guess what I am concerned about is that stringtie is Identifying different amounts of genes each time. So far for arabidopsis I have seen it identify either 37363 or 37364 genes.

From your last response, it seems that it should be identifying scenarios like the AT1G58520 and the AT1G58520.3 every single time it is run, not just occasionally. I was looking back at my data, and it appears that yes, this is true for AT1G58520, but it is not for ATCG09900. At the bottom of this response I have a grep of my gene abundance files for gene ATCG09900, and it appears that in some samples stringtie identifies 2 copies, but sometimes only 1.

I looked at the reference.gtf, and it looks like it should have 2 copies every time like AT1G58520 because even though it has the same gene ID, it has different genomic location. Your response above makes a lot of sense to me, stringtie should require both gene ID and genomic location to avoid bad reference.gtf files. My problem now is that I don’t understand why in the case of ATCG09900 it is not being consistent across samples? I would think that the part of stringtie that identifies the gene from the reference.gtf would be independent of the sample.

Here is the reference.gtf for gene ATCG09900, I would expect that stringtie would always identify it as 2 genes from your last response:

Here is a section of a grep of my output gene abundance files that shows that ATCG09900 is sometimes represented by 2 copies, and sometimes only 1. It seems that its representation as 2 copies is independent of if it is actually expressed or not.

Good question – apologies for leaving out an important piece of information in my previous explanation: the fact that overlaps with read alignments from the BAM file are also taken into account when determining a “gene region” (locus), so they can act as a “glue”, or “bridge” which could put together “gene regions” otherwise separated when looking only at overlaps between reference transcripts. So it is likely that in some samples the two gene regions of ATCG09900 get “bridged” by read alignments overlapping both those regions.. Thus only one gene region is reported for such samples.

It is all related to the concept of “bundle” as used by StringTie. Read alignments and reference transcripts are binned together in a “bundle” defined as a transitive closure of the exon overlap relationship between these objects (read alignments or reference transcripts (guides)). StringTie analyses a “bundle” and unless “weak spots” are found (spurious alignments) to break the bundle into multiple regions, the “bundle” will end up as a “gene region” (locus) reported in the output.

This “clustering by overlap” approach can also have the downside of merging multiple otherwise clearly distinct gene regions together (i.e. with different reference gene IDs), when alignment artifacts and/or transcriptional noise artificially “bridge” over intergenic regions, in some rare cases (especially when the actual genes are very close to each other).

Ahh, very good. So the samples where there is only 1 gene region means that there was either a bridge between the two transcripts, or there was no spurious alignment to break apart the bundle (in my case, this often meant no alignment).

With that info, I think I am able to answer my last question on my own. Above, some of the samples has genes that had fpkm of 0, but they still split, while others had fpkm of 0 and did not split. For example:

It would appear that SRX2248582 and SRX2248583 should both fail to break the bundle since no alignment has happened in either that would cause the bundle to break. This is the case in SRX2248583 but not in SRX2248582.

I went and looked at the sam file for these, and found that SRX2248582 had a read in that region, while SRX2248583 did not (I also checked the region 100 before, but none overlapped with the region so for brevity I will omit them):

I suppose that this spurious alignment caused the split, but was not counted towards the fpkm for some other reason.

Thank you for all of your help and quick response times, I appreciate it a lot.