Research
List of recent talks, released software, services, and collaborators.
Talks
-
Can Firtina, “BLEND: A fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis”, 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB), Istanbul, Turkey, April 19 2023. [Video] [Slides (PDF)] [Slides (PPT)]
-
Can Firtina, “AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes”, The 21st Asia Pacific Bioinformatics Conference (APBC), Changsha, China, April 16 2023. [Video] [Slides (PDF)] [Slides (PPT)]
-
Can Firtina, “Enabling Accurate, Fast, and Memory-Efficient Genome Analysis via Efficient and Intelligent Algorithms”, Invited Talk, Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, CA, United States, May 27 2022. [Video] [Slides (PDF)] [Slides (PPT)]
-
Can Firtina, “Enabling Accurate, Fast, and Memory-Efficient Genome Analysis via Efficient and Intelligent Algorithms”, Invited Seminar Talk, BIN590 Graduate Seminar in Bioinformatics, Graduate School of Informatics, Middle East Technical University, Virtual, January 10 2022. [Video] [Slides (PDF)] [Slides (PPT)]
-
Can Firtina, “Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm”, Workshop Talk, ExaBiome/PASSION/SAFARI Workshop on Architectures and HPC for Genomics, Virtual, May 26 2021. [Video] [Slides (PDF)] [Slides (PPT)]
Selected Software
RawHash: RawHash is a hash-based mechanism to map raw nanopore signals to a reference genome in real-time. To achieve this, it 1) generates an index from the reference genome and 2) efficiently and accurately maps the raw signals to the reference genome such that it can match the throughput of nanopore sequencing even when analyzing large genomes (e.g., human genome.
Below figure shows the overview of the steps that RawHash takes to find matching regions between a reference genome and a raw nanopore signal.
To efficiently identify similarities between a reference genome and reads, RawHash has two steps, similar to regular read mapping tools, 1) indexing and 2) mapping. The indexing step generates hash values from the expected signal representation of a reference genome and stores them in a hash table. In the mapping step, RawHash generates the hash values from raw signals and queries the hash table generated in the indexing step to find seed matches. To map the raw signal to a reference genome, RawHash performs chaining over the seed matches.
RawHash can be used to map reads from FAST5 or POD5 files to a reference genome in sequence format. We will provide the support for using SLOW5 files.
RawHash performs real-time mapping of nanopore raw signals. When the prefix of reads in FAST5 or POD5 file can be mapped to a reference genome, RawHash will stop mapping and provide the mapping information in PAF format. We follow the similar PAF template used in UNCALLED and Sigmap to report the mapping information.
BLEND: BLEND is a mechanism that can efficiently find fuzzy (approximate) seed matches between sequences to significantly improve the performance and accuracy while reducing the memory space usage of two important applications: 1) finding overlapping reads and 2) read mapping. To achieve this, BLEND provides mechanisms to generate the same hash value for highly similar seeds such that BLEND can find fuzzy seed matches with a single lookup from the hash values of seeds. BLEND can be integrated into any seeding mechanism by properly replacing the hash functions of these seeding technqiues. As proof of work, we integrate the BLEND mechanism into minimap2 to generate the hash values for minimizer and strobemer seeds.
GitHub Link | NARGAB Paper Link
AirLift: As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for better understanding the health data of a genome donor. Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear similarly in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for moving alignments from one genome to another. AirLift reduces 1) the number of reads that need to be fully mapped from the entire read set by up to 99.99% and 2) the overall execution time to remap read sets between two reference genome versions by 19.6x, 6.6x, and 2.7x for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar rates of identifying ground truth SNPs and INDELs as fully mapping a read set.
Apollo: Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively.We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Baum-Welch algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts.
GitHub Link | Bioinformatics Paper Link
Services & Activities
- Reviewer/sub-reviewer for RECOMB, RECOMB-Seq, Bioinformatics, BMC Bioinformatics, MICRO, ISCA, HPCA, ASPLOS, PACT, IEEE Micro, FAST, DSN, IEEE Transactions on Computers, ISCAS, ISPASS, USENIX ATC.
- Organized BIO-Arch: Workshop on Hardware Acceleration of Bioinformatics Workloads, jointly held with RECOMB 2023.
- Organized the Panel on Hardware Acceleration of Bioinformatics Workloads at RECOMB 2023.
- Organized the ExaBiome/PASSION/SAFARI Workshop on Architectures and HPC for Genomics workshop with Giulia Guidi. The workshop included 10 presenters from ExaBiome, PASSION, and SAFARI. May 26 2022, Virtual.
Collaborators
- Onur Mutlu (ETH Zurich)
- Can Alkan (Bilkent University)
- Damla Senol Cali (Bionano Genomics)
- Jeremie S. Kim (Apple)
- Sreenivas Subramoney (Intel Labs)
- Gurpreet S. Kalsi (Intel Labs)
- Kamlesh Pillai (Intel Labs)
- Taha Shahroodi (TU Delft)
- Juan Gomez-Luna (ETH Zurich)
- Mohammed Alser (ETH Zurich)
- Giulia Guidi (UC Berkeley)
- Joel Lindegger (ETH Zurich)
- Nour Almadhoun Alserr (ETH Zurich)
- Nika Mansouri Ghiasi (ETH Zurich)
- Jisung Park (ETH Zurich)
- Gagandeep Singh (ETH Zurich)
- Konstantinos Kanellopoulos (ETH Zurich)
- Saugata Ghose (University of Illinois Urbana-Champaign)
- A. Ercument Cicek (Bilkent University)
- Levent Onural (Bilkent University)
- Ziv Bar-Joseph (Carnegie Mellon University)
- Oznur Tastan (Sabanci University)