As the first step in performing the lentiviral mutagenesis, we developed a biochemical and analytical pipeline to detect lentiviral insertions in the genome that we named InSET (Insertion-based Screen for functional Elements and Transcripts) (Fig. 1, Additional file 1: Fig. S1, Methods). The biochemical basis of the method was similar to the FLEA-PCR [35] method and was further adapted for the NGS Illumina platform (Additional file 1: Fig. S1a and Methods). Then, we applied the pipeline to map insertions in 3 independently generated cell libraries — HepG2.LTR, K562.LTR, and K562.LTR2, each of which harbors lentivirus insertions of >5 kb in the genome that should disrupt the genomic elements (e.g., exons) in the insertion sites (Fig. 1 and Methods). In addition, due to the presence of enhancer/promoter sequences in the vectors required to drive expression of fluorescent proteins to allow for selection of transfected cells, certain amount of transactivation effect on nearby genes could be expected (see below).
Fig. 1Flow chart diagram illustrating the steps of the project. The diagram depicts (1) biological systems used in this this study, (2) concept and validation of InSET method, and (3) discovery and characterization of novel transcripts and genes corresponding to the novel functional elements found by InSET method
The 3 libraries were used in 4 different phenotypic assay systems. The HepG2.LTR and K562.LTR cell lines were expanded for 2 days post transfection, after which the cells were split and, for each replicate, one aliquot was used for genomic DNA isolation and another aliquot was allowed to grow with normal frequency of passaging for 1 month and used for genomic DNA isolation. The K562.LTR2 cell line was allowed to expand for 3 months after the transfection and then subjected to survival experiment in response to two anticancer drug treatments essentially as described previously [29]. Briefly, cells were treated with 0.5 μM imatinib or 40 μM etoposide for 48 and 24 h respectively, then resuspended in fresh medium for recovery and passaged daily until most cells recovered the normal shape or the doubling rate of the untreated cells. Five and one rounds of drug treatment and recovery were performed for imatinib and etoposide respectively. The insertional mutagenesis screens in each phenotypic assay system were performed in 3 biological replicates.
As the first step, we evaluated the potential of lentiviral insertions to affect cellular fitness since only one copy of a functional element would be inactivated by an insertion that would cause relatively modest knockdown in the aneuploid cancer genome like the one present in K562 cells [36]. Indeed, inactivation-based insertional mutagenesis has so far been tested only in the very few mammalian haploid or near-haploid cell lines like KBM7 or HAP1 [30,31,32,33]. If insertions affect fitness, then cells harboring them would be lost, resulting in depletion of insertions in essential genes — a phenomena that has been observed in retroviral mutagenesis based on a haploid cell line [33]. To investigate whether this is indeed the case, we took advantage of the high-throughput CRISPR/Cas9 screen by Wang et al. aimed at identifying essential genes in K562 cell line [22]. Each gene in that study was assigned a CRISPR score (CS) defined by the authors as the average log2 fold change in the abundance of all single-guide (sg) RNAs targeting a given gene [22]. Negative CS score means depletion of cells expressing sgRNAs targeting specific gene and would thus signify importance of the gene for cellular fitness. Thus, such genes should be less likely to contain insertions.
To test this, unique positions of insertions in the genome and number of insertions in each such position (depth) were determined for each sample and mapped to various types of genomic elements (Fig. 1 and Additional file 1: Fig. S1b). Since an insertion in an exon of a gene should destroy the function of that gene, we calculated density of insertions in exons of genes with CS < 0 and CS > 0 for every K562 sample based on unique positions of insertions (Fig. 2a and Additional file 2: Table S1). Indeed, for every sample, we observed a lower density of insertions in exons of the genes with CS < 0 compared to the genes with CS > 0 (Fig. 2a). This trend was statistically significant (p value of 1.4E−03, two-sided paired Student’s t test). We also observed lower average depth of insertions in the exons of genes with CS < 0 compared to depth in those with CS > 0 with the corresponding p value of 4.5E−03 (two-sided paired Student’s t test) (Fig. 2a and Additional file 2: Table S1). The reason for using both metrics (density of unique insertions and average depth) is that some insertion positions may have a very high depth and skew the results and using unique insertions would obviate this potential problem. On the other hand, the metric based on unique insertions could have lower sensitivity. Theoretically, true signal should have changes in both metrics.
Fig. 2Validation of InSET method. a Density of positions and average depth of insertions in the exons of the CS > 0 and CS < 0 genes. b Average TPM of all CS > 0 or CS < 0 genes and of those with insertions in exons in the growth or drug survival challenge assays. c, d Density of positions and average depth of insertions in the exons of the SS ≤ 0 and SS > 0 lncRNAs (c) and in the various distance bins upstream or downstream of TSSs of the CS > 0 and CS < 0 genes (d). e Density of positions and average depth of insertions in the introns and not within ±10 kb region of TSSs of the CS > 0 and CS < 0 genes. f Average TPM of all genes and those with IACFs in exons. For average and median depth of insertions, insertion positions with insertion number greater than or equal to 10,000 were removed as outliers to avoid the bias. For density of positions and average depth of insertions, error bars indicate the SD based on 5 K562 sample types. For average TPM, error bars indicate SD of 2 biological replicates. Red asterisks indicate significant differences under two-sided paired Student’s t test (p value < 0.05). Black asterisk indicates p value = 0.05. Source data are provided in Additional file 2: Tables S1-S3
An additional proof of the bias against the essential genes came from analysis of expression levels of all essential genes and those with insertions. As shown in Fig. 2b and Additional file 2: Table S2, the essential genes (CS < 0) with or without insertions had a strong tendency to have higher expression levels, as defined by transcripts per million (TPM) metrics measured by RNA-seq in untreated K562 cells, compared to the CS > 0 genes. Moreover, among the CS < 0 genes, the expression levels of genes with insertions were significantly lower than all genes with CS < 0 (p value < 0.05, two-sided paired Student’s t test) (Fig. 2b and Additional file 2: Table S2). These observations suggest that cells harboring insertions in higher expressed essential genes have lower survivability and thus are preferentially lost in the screens.
We then explored whether insertional mutagenesis can affect cellular fitness via inactivation of functional genomic elements other than exons of protein-coding genes. Since the annotated genes tested in the above study by Wang et al. [22] are mostly represented by canonical protein-coding mRNAs, we further tested effects of insertions in lncRNAs using a similar approach. The study by Liu et al. has conducted a genome-wide CRISPR/Cas9 screen for functional lncRNAs in K562 cell line and generated a screen score (SS) for every lncRNA transcript [37]. A lncRNA with a higher SS was considered to be more essential [37]. Similar to the results in the annotated genes above, the density and average depth of insertions in the exons of lncRNAs with SS > 0 were significantly lower than the corresponding values for the lncRNAs with SS ≤ 0, with the corresponding p values of 9.0E−04 and 8.2E−03 (two-sided paired Student’s t test) (Fig. 2c and Additional file 2: Table S3). Altogether, these results strongly suggest that detection of insertions is biased against more essential genes and lncRNAs, most likely because cells containing those insertions do not survive even after a very short time (2 days) in culture after lentivirus integration. These observations are consistent with the previous results obtained with a haploid cell line [33] and suggest that inactivation of one copy of a genomic element, such as an exon, via lentiviral insertion in an aneuploid cell line can have a measurable effect on fitness, and therefore insertional mutagenesis can reveal biologically relevant genomic elements in such system.
An inherent feature of insertional mutagenesis strategies based on retroviral insertions is that the insertions can sometimes activate expression of nearby genes via enhancer/promoter sequences present in the viral genomes [38, 39]. Upregulation of essential genes due to insertion of lentivirus in the vicinity of their transcription start sites (TSSs) could theoretically increase cellular fitness. To test whether this is indeed the case in our system, we estimated density and average depths of insertions in various bins of distances upstream or downstream from annotated TSSs for the genes with CS < 0 and CS > 0. To calculate the insertions in the regions downstream from TSSs, insertions in exons were not counted. The 0-5 kb and 5-10 kb upstream and downstream bins were the only 4 bins where most consistent statistically significant difference in both the density and average depths of insertions between the two groups of genes were observed (Fig. 2d and Additional file 2: Table S1). Also, in both distance bins, the density and average depths were higher in the CS < 0 group of genes (Fig. 2d and Additional file 2: Table S1). Therefore, these results are consistent with the notion that in addition to disruption of a genomic element, the effect of activation of nearby genes could also exist. However, this effect appears to be limited to insertions located within ±10 kb region of nearby TSSs. Furthermore, the effect of insertion-caused depletion was stronger than that of the transactivation according to the Hedge’s g estimators of the sizes of these two types of effects. Specifically, the absolute Hedge’s g value for the density of positions in exons of essential vs non-essential genes was 2.53 while the average Hedge’s g value for the 4 distance bins in ±10 kb region around TSSs was 1.37 (Additional file 2: Table S1). The corresponding values for the average depth of insertions were 2.69 and 2.26 respectively (Additional file 2: Table S1).
Conceivably, insertion of a retrovirus insertion into an intron of a gene could indirectly affect the function of that gene, for example by providing alternative splice sites or transcription termination signals. To estimate whether this could be a common phenomenon, we further calculated the density of insertion positions and depth of insertions in introns and not within ±10 kb region of TSSs in CS < 0 and CS > 0 genes. As shown in Fig. 2e and Additional file 2: Table S1, there was no bias against integrations into the introns of essential (CS < 0) genes. In fact, a significantly higher density of positions, but not depth of insertions, could be observed in CS < 0 genes. Therefore, in contrast to significant and consistent deleterious effect of insertions in exons (Fig. 2a, b), these results suggest that integration into introns do not have a widespread disrupting effect on the corresponding genes.
Taken together, these results suggest that the most common phenotypic effect of a lentiviral insertion, at least in our system, is mediated by direct disruption of the functional element, such as an exon of a spliced transcript, targeted by the insertions. The presence of some transactivation effect should also be considered if the insertions happen to appear within ±10 kb region of the TSSs of nearby genes. In the following part, we will focus on the genomic elements harboring insertions as the candidates for downstream analyses unless stated otherwise.
InSET reveals that most of sequences affecting cellular fitness map outside of annotated genomic elementsTo detect insertions that affect functional genomic elements, we identified insertion sites that affect fitness in various phenotypic assay systems used in this study. Unique sites enriched or depleted during normal growth of the HepG2.LTR and K562.LTR cell lines were obtained by comparing depth of insertions of each unique position after 1 month with that at 2 days as control (Fig. 1). Unique sites enriched or depleted following the anticancer drug survival experiment in the K562.LTR2 cell line were obtained by comparing the depth of each unique position in cells after the survival challenge to that in cells before the challenge (Fig. 1). The genomes of cells after growth and drug treatment should be the same as those of the control cells since they were based on the same initial population of transfected cells. Therefore, we do not expect that depletion or enrichment of insertions to be caused by the copy number variation. As summarized in Additional file 2: Table S4, we could detect 3,574,490 sites of insertions across the 4 phenotypic assay systems where the insertion sites in the same genomic location but detected in different systems were counted as independent insertions, of which 2,845,600 (79.6%) were unique insertion sites. Using 25% false discovery rate (FDR) threshold, 29,366/3,574,490 (0.8%) individual sites representing insertions affecting cellular fitness (IACFs) could be detected as either depleted or enriched in at least one system (Additional file 2: Table S4). Among the IACFs, 28,987 (98.7%) were unique. Genes harboring IACFs in exons had consistently lower expression levels than all genes (Fig. 2f and Additional file 2: Table S2), consistent with the results above showing that our system is biased against detecting essential genomic elements.
However, the analysis based on single insertion events would suffer from lower sensitivity and potentially higher false positive rate. Therefore, we used an independent analytical approach based on the assumption that true functional elements should be detected by multiple independent phenotypic insertion events. Such clusters were calculated in a two-step process. First, we used SICER software [40] with a very stringent 1% FDR threshold to identify genome-wide clusters of nearby insertion events that were either enriched or depleted in each of the 3 pairs of biological replicates of each phenotypic assay system (Figs. 1 and 3a, Additional file 1: Fig. S1, Methods). The input into the SICER program were the positions of the raw insertion events in each biological replicate of treatment with either 3-month culture or drug challenge and in the corresponding control biological replicate. Second, clusters shared by at least 2 out of 3 biological replicates were defined as clusters affecting cellular fitness (CACFs) (Fig. 1, Additional file 1: Fig. S1, Methods). We could detect 36,393 CACFs across the 4 phenotypic assay systems. The CACF analysis was based on raw insertions and therefore independent of IACF; however, most (27,972/29,366 or 95.3%) IACFs were found inside CACFs. On the other hand, since analysis based on multiple insertions would be expected to be more sensitive, many (25,918/36,393 or 71.2%) CACFs did not contain IACFs (Fig. 3b). CACFs with overlapping coordinates or detected in different systems were counted independently. However, merging coordinates of the 36,393 CACFs resulted in 18,531 (50.9%) unique CACFs, suggesting that the functional clusters were much more common to different systems than individual IACFs.
Fig. 3Genome-wide distribution of IACFs and CACFs. a Schematic diagram of the analytical pipeline for identification and hierarchical mapping of IACFs and CACFs to different genomic elements and regions shown in the pie charts in the panels e and f. Number of insertions in each site was compared among the 3 biological replicates of the samples subjected to a phenotypic challenge (time of growth or drug treatment) and the corresponding control to identify IACFs using FDR threshold of 25%. In the independent approach, clusters of phenotypic insertions were first identified using SICER with a FDR threshold of 1%, and those shared by at least 2 biological replicates were defined as CACFs. For more details, see Additional file 1: Fig. S1. b Venn diagram of overlap between IACFs (black) and CACFs (white). The numbers represent non-unique events where the count for the same insertion or cluster would be added for each system where it was found. c, d Ratio of real vs simulated SICER clusters (c) and CACFs (d) based on 100 simulations for one representative example of the simulation analysis for insertions depleted in one biological replicate of the K562 growth survival experiment (see Additional file 1: Fig. S2 for the complete set of simulation analyses). Error bars indicate SD. e, f Genome-wide distributions of all IACFs and CACFs across the 4 phenotypic assay systems. Each unique IACF and CACF position was assigned to only one genomic element or region as shown in a. g, h Odds ratios of enrichment of the phenotypic (depleted or enriched) and no phenotype insertions in different genomic elements. i Distribution of p values of IACFs and CACFs mapping inside or outside of the annotations defined as all elements from exons to insulators in panel a. j Conservation scores of the phenotypic vs non-phenotypic insertions based on IACFs (left) or insertions located inside or outside of the CACFs (right). Error bars indicate SD based on the 4 phenotypic assay systems. The p values were calculated with one-sided paired Student’s t test. Source data are provided in Additional file 2: Tables S4-S10
To validate the procedure of CACF detection, we performed a two-step simulation analysis as follows. For each sample, we generated the same number of insertions with random coordinates across the genome and used SICER to generate clusters. After 100 simulations, the maximum number of simulated clusters varied between 2.2 and 10.6% of the number of the real clusters for the different types of samples (Fig. 3c, Additional file 1: Fig. S2a, Additional file 2: Table S5). These results shows that SICER, originally designed to identify regions enriched in ChIP-seq analysis [40], can indeed identify truely enriched or depleted clusters of insertions in the lentiviral mutagenesis context. Second, when the simulated clusters shared by at least two replicates were used to generate simulated CACFs, virtually no CACFs were obtained: the average ratio of simulated CACFs compared to the real ones varied between 0 and 7 × 10−6 (Fig. 3d, Additional file 1: Fig. S2b, Additional file 2: Table S6). Overall, these results clearly show that CACFs were not random.
We then investigated genomic distribution of IACFs and CACFs (Fig. 3a and Additional file 2: Tables S4 and S7). A small fraction of IACFs and CACFs mapped to exons of known genes (3.8 and 5.6%, Fig. 3e, f), and comparable numbers of IACFs and CACFs (0.6 and 1.1% respectively) mapped to exons of annotated lncRNAs. Furthermore, additional 1.6 and 1.8% of IACFs and CACFs mapped to very long intergenic non-coding RNAs (vlincRNAs) [41] (Fig. 3e, f). We then explored contribution of potentially functional elements defined by potential regulatory elements detected based on DNA-protein interactions. Additional 0.4% and 0.3%, 4.3% and 3.7%, and 0.5% and 0.8% of IACFs and CACFs mapped to promoters, enhancers, or insulators found in K562 and HepG2 cells based on ChIP-seq mapping of chromatin states [42, 43] and located outside of the above mentioned elements (Fig. 3e, f).
We then calculated odds ratios of enrichment of IACFs and CACFs in the annotated functional elements to further characterize and validate performance of the method. Interestingly, the strongest enrichments were observed for depleted IACFs and CACFs found in exons of protein-coding genes, but only in the phenotypic assay systems where cells were challenged with the anticancer drugs (Fig. 3g and Additional file 2: Tables S8 and S9). This is most likely due to the fact that insertions in exons of genes required for growth have already been removed from the cellular population as has been shown above. However, new survival stress to which cells have not yet been exposed, like the one provided by the anticancer drug treatments, revealed additional genes required for stress resistance. These results also illustrate the limitation of the method — insertions in the elements required for growth and survival under normal conditions would be under-represented in insertional mutagenesis screens.
On the other hand, our results strongly argue that most of the phenotypic insertions do not exert their effect by somehow affecting protein-coding genes. As can be seen on the Fig. 3h and Additional file 2: Tables S8 and S9, depleted or enriched insertions were not significantly more enriched in the immediate vicinity of the TSSs of genes or inside the introns. However, if the phenotypic insertions identified in our screens did in fact function by somehow affecting functions of known genes, either by transactivation, or by affecting splicing or terminating transcription, then the opposite would be expected. In fact, strikingly, the vast majority (96.2 and 94.4%) of IACFs and CACFs were located outside of exons of known genes (Fig. 3e, f and Additional file 2: Tables S4 and S7). Even considering possible transactivating effects, most (81.9% and 77.3%) of IACFs and CACFs were located outside of exons of known genes, or 10 kb up- or downstream from their TSSs (Fig. 3e, f and Additional file 2: Tables S4 and S7). Moreover, we did not observe increase in enrichment of phenotypic insertions in annotated exons of lncRNAs (Fig. 3g and Additional file 2: Tables S8 and S9), arguing that the annotated spliced lncRNAs cannot explain most of the phenotypic insertions observed in this work, with the caveat though that many lncRNAs and their exons are not present in the current genomic annotations (see “ Discussion”). Even accounting for all of the annotated functional elements (exons of lncRNAs, regulatory elements, and so on), 69.8% and 74.5% of IACFs and CACFs did not map to any functional or potentially functional genomic region (Fig. 3e, f and Additional file 2: Tables S4 and S7). Furthermore, 34.4 and 34.8% of IACFs and CACFs were found in the intergenic space and outside of any functional or potentially functional genomic region (Fig. 3e, f and Additional file 2: Tables S4 and S7). Moreover, 16.8 and 18.7% of IACFs and CACFs mapped to “gene desert” regions, 50 kb from boundaries of known genes (Fig. 3e, f and Additional file 2: Tables S4 and S7). To estimate whether IACFs and CACFs mapping outside of the known functional or potentially functional elements were somehow biased to being less significant compared to those mapping to these annotations, we compared the p value distribution of the two groups of IACFs and CACFs, and found that the distribution of the p values of IACFs and CACFs mapping inside and outside of the annotated elements matched well (Fig. 3i). Taken together, these results show that the human genome is likely to contain multiple yet uncharacterized elements that have physiological significance.
To validate potential biological relevance of the novel elements tagged by the phenotypic insertions, we asked whether such insertions correspond to more evolutionarily conserved sequences compared to the insertions that did not result in phenotypes. In this analysis, insertions located in the annotated exons or within 10 kb upstream/downstream of TSSs of known genes were excluded. Interestingly, the evolutionary sequence conservation, as defined by the phastCons conservation scores of the genomic base pairs immediately adjacent to the integration sites (Methods), was significantly higher for the IACFs than insertions with no phenotypes, and for the insertions located inside CACFs than those located outside (Fig. 3j and Additional file 2: Table S10). Overall, these results argue that the phenotypic insertions detected in this work do indeed mark some unknown functional elements in the human genome.
IACFs or CACFs mapping to unannotated genomic regions can disrupt novel exonsIACFs or CACFs mapping outside of genomic regions of known functions could potentially affect multiple types of functional genomic elements, requiring a plethora of different types of assays to uncover their mechanisms of action. However, based on the analysis of the insertions mapping to exons of essential genes and lncRNAs shown above, it is reasonable to assume that IACFs or CACFs could exert their effects by disrupting exons of novel transcripts. Therefore, to prove that IACFs or CACFs that mapped to unannotated genomic space could potentially affect novel bona fide genomic elements, we focused on such IACFs or CACFs that mapped to exons predicted using in silico sequence analysis program GENSCAN [34] (Additional file 2: Tables S8 and S9). It is important to stress however that GENSCAN program was designed based on the general transcriptional, splicing and translational signals of protein-coding genes [34]; therefore, the corresponding predictions are biased towards sequences with protein-coding potential.
Strikingly, the phenotypic insertions, both depleted and enriched, were consistently enriched in exons found solely by GENSCAN program and not present in the databases of annotated human genes such as UCSC Genes [44] or GENCODE [45, 46] — the GENSCAN-specific exons (Fig. 4a and Additional file 2: Tables S8 and S9). Importantly, the enrichment was higher than that of the insertions with no phenotypes in any phenotypic assay system tested (Fig. 4a and Additional file 2: Tables S8 and S9). Interestingly, however, the increase in the enrichment of phenotypic insertions was limited to GENSCAN-specific exons, found in introns of known genes, as opposed to those found in the intergenic space (Fig. 4a and Additional file 2: Tables S8 and S9). Also, there was no increase in the enrichment of phenotypic insertions in the intronic regions outside of the GENSCAN-specific exons (Fig. 4b). This strongly suggests that phenotypes caused by insertions into intronic GENSCAN-specific exons were caused by direct insertions in these exons and not by indirect disruption of the host gene.
Fig. 4Characterization of InSETT and InSETG structures by Nanopore sequencing. a Odds ratios of enrichment of the phenotypic (depleted or enriched) and non-phenotypic insertions in the intronic and intergenic GENSCAN-specific exons. b Odds ratios of enrichment of IACFs, insertions in CACFs and non-phenotypic insertions in the intronic GENSCAN-specific exons or in intronic regions outside of the GENSCAN-specific exons. Error bars indicate SD based on the 4 systems (K562 growth, HepG2 growth, imatinib survival challenge, and etoposide survival challenge). Asterisks indicate significant differences per two-sided paired Student’s t test (p value < 0.05). c, d Overlaps of InSETes detected by the IACF and CACF analysis in the K562 drug survival assays genome-wide (c) and among the 131 exons selected for the RACE assays (d). e Average TPM of all protein-coding transcripts and those with insertions in their exons, and of all GENSCAN-specific exons and those with insertions. Error bars indicate SD based on 2 biological replicates. Asterisks indicate significant differences under two-sided paired Student’s t test (p value < 0.05). f Sensitivity of detection of novel functional transcripts with PacBio, Illumina, or with the RACE enrichment followed by Nanopore sequencing. Note: the relatively short Illumina reads overlapping InSETes could be derived from transcripts different from those detected by RACE/Nanopore. g Summary of the types of novel transcripts detected by the RACE combined with Nanopore sequencing for the 131 InSETes. Transcript classification was done based on RACE, unless only RT-PCR results were available. h CAGE signal in the real and simulated InSET TSS clusters. For each InSET TSS cluster (left) or intergenic InSET TSS cluster (right), the number of tissue or primary cell samples with overlapping CAGE tags (upper) and the normalized abundance of the CAGE signal (lower) were calculated. In the boxplots, center red lines indicate median; box limits indicate upper and lower quartiles; whiskers extend from the box limits no more than 1.5× interquartile range. In the violin plots, all violins were scaled to have the same maximum width; three horizontal white lines indicate upper, median, and lower quartiles of the density estimate respectively. All data including TSS clusters with 0 CAGE tags were plotted. To plot on the log scale, 0.1 was added to all values. The p values were calculated using Wilcoxon rank-sum test. Source data are provided in Additional file 2: Tables S8, S9, and S11-14
The results presented above strongly argue that GENSCAN-specific exons, especially those of annotated genes, represent novel functional elements. Since such exons were identified solely based on in silico sequence analysis and were not present in the current human gene annotations, we first focused on identification of transcripts corresponding to GENSAN-specific exons harboring phenotypic insertions found in any of the 4 phenotypic assay systems. Overall, 4.5% (1,320/29,366) of IACFs and 4.5% (1,651/36,393) of CACFs mapped to exons predicted only by GENSCAN. In the drug survival experiment, 151 and 445 such GENSCAN exons harbored IACFs and CACFs respectively, among which 137 harbored both IACFs and CACFs (Fig. 4c). For this analysis, we have chosen 131 GENSCAN exons harboring IACFs and/or CACFs (114 exons contained both (Fig. 4d)) that will be referred to as InSET exons or InSETes. The 131 InSETes consist of 40 intragenic exons defined as being inside the known genes and on the same strand as the genes, and 91 intergenic exons defined as being outside or antisense to known genes (Additional file 2: Table S11).
We initially tested the overall abundance of all GENSCAN-specific exons in the human genome and the 131 InSETes by RNA-seq analysis using high-throughput short-read Illumina platform. We found that while GENSCAN-specific exons could be detected, their average abundance (TPM = 1.31) was significantly lower than that of protein-coding transcripts (TPM = 2.31, p = 0.041, two-sided paired Student’s t test, Fig. 4e), potentially explaining why they remain largely unannotated. Furthermore, similar to the exons of known genes, GENSCAN-specific exons with insertions had even lower abundance than those without: the corresponding average TPM of 0.79 vs 1.31 (p = 0.028, two-sided paired Student’s t test, Fig. 4e), suggesting that the insertions of in silico predicted exons important for cell survival were also subjected to loss. Nonetheless, the RNA-seq analysis suggested that the exons predicted only in silico can in fact represent bona fide exons. In fact, 51 or 38.9% of the 131 InSETes could be detected with the coverage of over 0.5 based on Illumina deep RNA-seq (Fig. 4f). However, the relatively short reads were not sufficient to define the isoforms of the corresponding transcripts; therefore, we performed long-read NGS analysis using the PacBio platform. However, after analysis of 656,065 circular consensus sequencing (CCS) reads obtained in this experiment (average read length 3038 nt), we could identify 1 CCS read corresponding to only 1 out of 131 such exons (0.8%) (Fig. 4f). Therefore, long-read RNA-seq analysis of a complex RNA p
Comments (0)