Research

List of recent talks, released software, services, and collaborators.


Talks


Selected Software

BLEND: BLEND is a mechanism that can efficiently find fuzzy (approximate) seed matches between sequences to significantly improve the performance and accuracy while reducing the memory space usage of two important applications: 1) finding overlapping reads and 2) read mapping. To achieve this, BLEND provides mechanisms to generate the same hash value for highly similar seeds such that BLEND can find fuzzy seed matches with a single lookup from the hash values of seeds. BLEND can be integrated into any seeding mechanism by properly replacing the hash functions of these seeding technqiues. As proof of work, we integrate the BLEND mechanism into minimap2 to generate the hash values for minimizer and strobemer seeds.

GitHub Link | arXiv

AirLift: As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for better understanding the health data of a genome donor. Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear similarly in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for moving alignments from one genome to another. AirLift reduces 1) the number of reads that need to be fully mapped from the entire read set by up to 99.99% and 2) the overall execution time to remap read sets between two reference genome versions by 19.6x, 6.6x, and 2.7x for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar rates of identifying ground truth SNPs and INDELs as fully mapping a read set.

GitHub Link | bioRxiv

Apollo: Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively.We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Baum-Welch algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts.

GitHub Link | Paper


Services & Activities


Collaborators

  • Onur Mutlu (ETH Zurich)
  • Can Alkan (Bilkent University)
  • Damla Senol Cali (Bionano Genomics)
  • Jeremie S. Kim (Apple)
  • Sreenivas Subramoney (Intel Labs)
  • Gurpreet S. Kalsi (Intel Labs)
  • Kamlesh Pillai (Intel Labs)
  • Taha Shahroodi (TU Delft)
  • Juan Gomez-Luna (ETH Zurich)
  • Mohammed Alser (ETH Zurich)
  • Giulia Guidi (UC Berkeley)
  • Joel Lindegger (ETH Zurich)
  • Nour Almadhoun Alserr (ETH Zurich)
  • Nika Mansouri Ghiasi (ETH Zurich)
  • Jisung Park (ETH Zurich)
  • Gagandeep Singh (ETH Zurich)
  • Konstantinos Kanellopoulos (ETH Zurich)
  • Saugata Ghose (University of Illinois Urbana-Champaign)
  • A. Ercument Cicek (Bilkent University)
  • Levent Onural (Bilkent University)
  • Ziv Bar-Joseph (Carnegie Mellon University)
  • Oznur Tastan (Sabanci University)