Modeling RNA degradation for RNA-Seq with applications
LIN WAN a, XITING YAN, TING CHEN b, FENGZHU SUN *c
a Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA and Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, People’s Republic of China
b Department of Epidemiology and Public Health, Yale University, New Haven, CT 06520, USA
c Molecular and Computational Biology Program, University of Southern California,Los Angeles, CA 90089, USA and Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, Beijing 100084, People’s Republic of China fsun@usc.edu
Abstract
RNA-Seq is widely used in biological and biomedical studies. Methods for the estimation of the transcript’s abundance using RNA-Seq data have been intensively studied, many of which are based on the assumption that the short-reads of RNA-Seq are uniformly distributed along the transcripts. However, the short-reads are found to be nonuniformly distributed along the transcripts, which can greatly reduce the accuracies of these methods based on the uniform assumption. Several methods are developed to adjust the biases induced by this nonuniformity, utilizing the short-read’s empirical distribution in transcript. As an alternative, we found that RNA degradation plays a major role in the formation of the short-read’s nonuniform distribution and thus developed a new approach that quantifies the short-read’s nonuniform distribution by precisely modeling RNA degradation. Our model of RNA degradation fits RNA-Seq data quite well, and based on this model, a new statistical method was further developed to estimate transcript expression level, as well as the RNA degradation rate, for individual genes and their isoforms. We showed that our method can improve the accuracy of transcript isoform expression estimation. The RNA degradation rate of individual transcript we estimated is consistent across samples and/or experiments/platforms. In addition, the RNA degradation rate from our model is independent of the RNA length, consistent with previous studies on RNA decay rate.
Keywords: EM algorithm; Gene expression; Next generation sequencing; RNA degradation; RNA-Seq.
|