Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity [METHODS]

Brianna Chrisman1,2, Chloe He3, Jae-Yoon Jung4, Nate Stockham5, Kelley Paskov3, Peter Washington1, Juli Petereit2 and Dennis P. Wall3,4 1Department of Bioengineering, Stanford University, Stanford, California 94305, USA; 2Nevada Bioinformatics Center, University of Nevada, Reno, Nevada 89557, USA; 3Department of Biomedical Data Science, Stanford University, Stanford, California 94305, USA; 4Department of Pediatrics (Systems Medicine), Stanford University, Stanford, California 94305, USA; 5Department of Neuroscience, Stanford University, Stanford, California 94305, USA Corresponding author: brianna.chrismangmail.com Abstract

Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 and human genetic diversity, we developed an algorithm for sequence location approximation using nuclear families (ASLAN) to identify the region of origin of reads that do not align to GRCh38. Using unmapped reads and variant calls from whole-genome sequences (WGSs), ASLAN uses a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to given the distribution of the subsequence in the unmapped reads and phasings of families. Validating ASLAN on synthetic data and on reads from the alternative haplotypes in the decoy genome, ASLAN localizes >90% of 100-bp sequences with >92% accuracy and ∼1 Mb of resolution. We then ran ASLAN on 100-mers from unmapped reads from WGS from more than 700 families, and compared ASLAN localizations to alignment of the 100-mers to the recently released T2T-CHM13 assembly. We found that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in GRCh38. ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identified sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, highlighting new hotspots for genetic diversity.

Footnotes

[Supplemental material is available for this article.]

Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.277175.122.

Freely available online through the Genome Research Open Access option.

Received August 2, 2022. Accepted May 25, 2023.

Comments (0)

No login
gif