List of publications in reverse chronological order.
2023
ISMB/ECCB
RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes
Can Firtina, Nika Mansouri Ghiasi, Joel Lindegger, Gagandeep Singh, Meryem Banu Cavlak, Haiyu Mao, and Onur Mutlu
In Proceedings of the 31st Annual Conference on Intelligent Systems for Molecular Biology (ISMB) and the 22nd European Conference on Computational Biology (ECCB), Jul 2023
Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either 1) require powerful computational resources that may not be available for portable sequencers or 2) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: 1) read mapping, 2) relative abundance estimation, and 3) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better average throughput and 2) an average speedup of 32.1x and 2.1x in the mapping time, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash.
@inproceedings{firtina_rawhash_2023,booktitle={Proceedings of the 31st Annual Conference on Intelligent Systems for Molecular Biology (ISMB) and the 22nd European Conference on Computational Biology (ECCB)},title={{RawHash}: {Enabling} {Fast} and {Accurate} {Real}-{Time} {Analysis} of {Raw} {Nanopore} {Signals} for {Large} {Genomes}},doi={10.1093/bioinformatics/btad272},author={Firtina, Can and Ghiasi, Nika Mansouri and Lindegger, Joel and Singh, Gagandeep and Cavlak, Meryem Banu and Mao, Haiyu and Mutlu, Onur},month=jul,year={2023},}
DAC
Accelerating Genome Analysis via Algorithm-Architecture Co-Design
High-throughput sequencing (HTS) technologies have revolutionized the field of genomics, enabling rapid and cost-effective genome analysis for various applications. However, the increasing volume of genomic data generated by HTS technologies presents significant challenges for computational techniques to effectively analyze genomes. To address these challenges, several algorithm-architecture co-design works have been proposed, targeting different steps of the genome analysis pipeline. These works explore emerging technologies to provide fast, accurate, and low-power genome analysis.
This paper provides a brief review of the recent advancements in accelerating genome analysis, covering the opportunities and challenges associated with the acceleration of the key steps of the genome analysis pipeline. Our analysis highlights the importance of integrating multiple steps of genome analysis using suitable architectures to unlock significant performance improvements and reduce data movement and energy consumption. We conclude by emphasizing the need for novel strategies and techniques to address the growing demands of genomic data generation and analysis.
@inproceedings{mutlu_accelerating_2023,booktitle={Proceedings of the 60th Design Automation Conference (DAC)},title={Accelerating Genome Analysis via Algorithm-Architecture Co-Design},author={Mutlu, Onur and Firtina, Can},year={2023},month=jul,}
APBC
AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes
As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations (i.e., coding regions in a genome) are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads (out of the entire read set) that need to be fully mapped to the new reference by up to 99.99% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7×, 6.6×, and 2.8× for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.Code Availability AirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.Competing Interest StatementThe authors have declared no competing interest.
@inproceedings{kim_airlift_2023,booktitle={The 21st Asia Pacific Bioinformatics Conference (APBC)},address={Changsha, Hunan, China},title={{AirLift}: {A} {Fast} and {Comprehensive} {Technique} for {Remapping} {Alignments} between {Reference} {Genomes}},doi={10.1101/2021.02.16.431517},author={Kim, Jeremie S. and Firtina, Can and Cavlak, Meryem Banu and Cali, Damla Senol and Hajinazar, Nastaran and Alser, Mohammed and Alkan, Can and Mutlu, Onur},month=apr,year={2023},}
APBC
TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering
Meryem Banu Cavlak, Gagandeep Singh, Mohammed Alser, Can Firtina, Joel Lindegger, Mohammad Sadrosadati, Nika Mansouri Ghiasi, Can Alkan, and Onur Mutlu
In The 21st Asia Pacific Bioinformatics Conference (APBC), Apr 2023
Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally-inefficient and memory-hungry; bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation.To overcome this issue, we propose TargetCall, the first fast and widely-applicable pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall’s key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. TargetCall filters out all off-target reads before basecalling; and the highly-accurate but slow basecalling is performed only on the raw signals whose noisy reads are labeled as on-target.Our thorough experimental evaluations using both real and simulated data show that TargetCall 1) improves the end-to-end basecalling performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) sensitivity in keeping on-target reads, 2) maintains high accuracy in downstream analysis, 3) precisely filters out up to 94.71% of off-target reads, and 4) achieves better performance, sensitivity, and generality compared to prior works. We freely open-source TargetCall to aid future research in pre-basecalling filtering at https://github.com/CMU-SAFARI/TargetCall.
@inproceedings{cavlak_targetcall_2023,booktitle={The 21st Asia Pacific Bioinformatics Conference (APBC)},address={Changsha, Hunan, China},title={TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering},author={Cavlak, Meryem Banu and Singh, Gagandeep and Alser, Mohammed and Firtina, Can and Lindegger, Joel and Sadrosadati, Mohammad and Ghiasi, Nika Mansouri and Alkan, Can and Mutlu, Onur},year={2023},month=apr,}
Can Firtina, Jisung Park, Mohammed Alser, Jeremie S Kim, Damla Senol Cali, Taha Shahroodi, Nika Mansouri Ghiasi, Gagandeep Singh, Konstantinos Kanellopoulos, Can Alkan, and Onur Mutlu
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×–83.9× (on average 19.3×), has a lower memory footprint by 0.9×–14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×–4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
@article{firtina_blend_2023,title={{BLEND}: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis},volume={5},url={https://academic.oup.com/nargab/article/5/1/lqad004/6993940},doi={10.1093/nargab/lqad004},number={1},journal={NAR Genomics and Bioinformatics},author={Firtina, Can and Park, Jisung and Alser, Mohammed and Kim, Jeremie S and Cali, Damla Senol and Shahroodi, Taha and Ghiasi, Nika Mansouri and Singh, Gagandeep and Kanellopoulos, Konstantinos and Alkan, Can and Mutlu, Onur},month=mar,year={2023},pages={lqad004},}
The conventional virtual-to-physical address mapping scheme enables a virtual address to flexibly map to any physical address. This flexibility necessitates large data structures to store virtual-to-physical mappings, which incurs significantly high address translation latency and translation-induced interference in the memory hierarchy, especially in data-intensive workloads. Restricting the address mapping so that a virtual address can map to only a specific set of physical addresses can significantly reduce the overheads associated with the conventional address translation by making use of compact and more efficient translation structures. However, restricting the address mapping flexibility across the entire main memory severely limits data sharing across different processes and increases memory under-utilization. In this work, we propose Utopia, a new hybrid virtual-to-physical address mapping scheme that allows both flexible and restrictive hash-based address mapping schemes to co-exist in a system. The key idea of Utopia is to manage the physical memory using two types of physical memory segments: restrictive segments and flexible segments. A restrictive segment uses a restrictive, hash-based address mapping scheme to map the virtual addresses to only a specific set of physical addresses and enable faster address translation using compact and efficient translation structures. A flexible segment is similar to the conventional address mapping scheme and provides full virtual-to-physical address mapping flexibility. By mapping data to a restrictive segment, Utopia enables faster address translation with lower translation-induced interference whenever a flexible address mapping is not necessary. Our evaluation using 11 data-intensive workloads shows that Utopia improves performance by 32% on average in single-core workloads over the baseline four-level radix-tree page table design.
@article{kanellopoulos_utopia_2022,title={Utopia: Efficient Address Translation using Hybrid Virtual-to-Physical Address Mapping},journal={arXiv},author={Kanellopoulos, Konstantinos and Bera, Rahul and Stojiljkovic, Kosta and Firtina, Can and Ausavarungnirun, Rachata and Hajinazar, Nastaran and Park, Jisung and Vijaykumar, Nandita and Mutlu, Onur},year={2022},month=nov,doi={10.48550/ARXIV.2211.12205},}
Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases (i.e., A, C, G, T) using a computational step called basecalling. The accuracy and speed of basecalling have critical implications for every subsequent step in genome analysis. Currently, basecallers are developed mainly based on deep learning techniques to provide high sequencing accuracy without considering the compute demands of such tools. We observe that state-of-the-art basecallers are slow, inefficient, and memory-hungry as researchers have adapted deep learning models from other domains without specialization to the basecalling purpose. Our goal is to make basecalling highly efficient and fast by building the first framework for specializing and optimizing machine learning-based basecaller. We introduce RUBICON, a framework to develop hardware-optimized basecallers. RUBICON consists of two novel machine-learning techniques that are specifically designed for basecalling. First, we introduce the first quantization-aware basecalling neural architecture search (QABAS) framework to specialize the basecalling neural network architecture for a given hardware acceleration platform while jointly exploring and finding the best bit-width precision for each neural network layer. Second, we develop SkipClip, the first technique to remove the skip connections present in modern basecallers to greatly reduce resource and storage requirements without any loss in basecalling accuracy. We demonstrate the benefits of QABAS and SkipClip by developing RUBICALL, the first hardware-optimized basecaller that performs fast and accurate basecalling. We show that QABAS and SkipClip can help researchers develop hardware-optimized basecallers that are superior to expert-designed models.
@article{singh_aframework_2022,title={A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers},journal={arXiv},author={Singh, Gagandeep and Alser, Mohammed and Khodamoradi, Alireza and Denolf, Kristof and Firtina, Can and Cavlak, Meryem Banu and Corporaal, Henk and Mutlu, Onur},year={2022},month=nov,doi={10.48550/ARXIV.2211.03079},}
Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome into raw electrical signals at low cost. Nanopore sequencing requires two computationally-costly processing steps for accurate downstream genome analysis. The first step, basecalling, translates the raw electrical signals into nucleotide bases (i.e., A, C, G, T). The second step, read mapping, finds the correct location of a read in a reference genome. In existing genome analysis pipelines, basecalling and read mapping are executed separately. We observe in this work that such separate execution of the two most time-consuming steps inherently leads to (1) significant data movement and (2) redundant computations on the data, slowing down the genome analysis pipeline.
This paper proposes GenPIP, an in-memory genome analysis accelerator that tightly integrates basecalling and read mapping. GenPIP improves the performance of the genome analysis pipeline with two key mechanisms: (1) in-memory fine-grained collaborative execution of the major genome analysis steps in parallel; (2) a new technique for early-rejection of low-quality and unmapped reads to timely stop the execution of genome analysis for such reads, reducing inefficient computation. Our experiments show that, for the execution of the genome analysis pipeline, GenPIP provides 41.6X (8.4X) speedup and 32.8X (20.8X) energy savings with negligible accuracy loss compared to the state-of-the-art software genome analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design that combines state-of-the-art in-memory basecalling and read mapping accelerators, GenPIP provides 1.39X speedup and 1.37X energy savings.
@inproceedings{mao_genpip_2022,booktitle={Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)},title={GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping},author={Mao, Haiyu and Alser, Mohammed and Sadrosadati, Mohammad and Firtina, Can and Baranwal, Akanksha and Cali, Damla Senol and Manglik, Aditya and Alserr, Nour Almadhoun and Mutlu, Onur},year={2022},month=oct,url={https://ieeexplore.ieee.org/document/9923847/},pages={710-726},doi={10.1109/MICRO56248.2022.00056},}
Profile hidden Markov models (pHMMs) are widely used in many bioinformatics applications to accurately identify similarities between biological sequences (e.g., DNA or protein sequences). PHMMs use a commonly-adopted and highly-accurate method, called the Baum-Welch algorithm, to calculate these similarities. However, the Baum-Welch algorithm is computationally expensive, and existing works provide either software- or hardware-only solutions for a fixed pHMM design. When we analyze the state-of-the-art works, we find that there is a pressing need for a flexible, high-performant, and energy-efficient hardware-software co-design to efficiently and effectively solve all the major inefficiencies in the Baum-Welch algorithm for pHMMs.
We propose ApHMM, the first flexible acceleration framework that can significantly reduce computational and energy overheads of the Baum-Welch algorithm for pHMMs. ApHMM leverages hardware-software co-design to solve the major inefficiencies in the Baum-Welch algorithm by 1) designing a flexible hardware to support different pHMMs designs, 2) exploiting the predictable data dependency pattern in an on-chip memory with memoization techniques, 3) quickly eliminating negligible computations with a hardware-based filter, and 4) minimizing the redundant computations. We implement our 1) hardware-software optimizations on a specialized hardware and 2) software optimizations for GPUs to provide the first flexible Baum-Welch accelerator for pHMMs. ApHMM provides significant speedups of 15.55x-260.03x, 1.83x-5.34x, and 27.97x compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms the state-of-the-art CPU implementations of three important bioinformatics applications, 1) error correction, 2) protein family search, and 3) multiple sequence alignment, by 1.29x-59.94x, 1.03x-1.75x, and 1.03x-1.95x, respectively.
@article{firtina_aphmm_2022,title={ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis},journal={arXiv},author={Firtina, Can and Pillai, Kamlesh and Kalsi, Gurpreet S. and Suresh, Bharathwaj and Cali, Damla Senol and Kim, Jeremie and Shahroodi, Taha and Cavlak, Meryem Banu and Lindegger, Joel and Alser, Mohammed and Luna, Juan Gómez and Subramoney, Sreenivas and Mutlu, Onur},year={2022},month=jul,doi={10.48550/ARXIV.2207.09765},}
@article{kim_fastremap_2022,title={{FastRemap}: {A} {Tool} for {Quickly} {Remapping} {Reads} between {Genome} {Assemblies}},issn={1367-4803},url={https://doi.org/10.1093/bioinformatics/btac554},doi={10.1093/bioinformatics/btac554},journal={Bioinformatics},author={Kim, Jeremie S and Firtina, Can and Cavlak, Meryem Banu and Senol Cali, Damla and Alkan, Can and Mutlu, Onur},year={2022},pages={btac554},month=aug,}
@article{alser_molecules_2022,title={From {Molecules} to {Genomic} {Variations}: {Accelerating} {Genome} {Analysis} via {Intelligent} {Algorithms} and {Architectures}},issn={2001-0370},url={https://doi.org/10.1016/j.csbj.2022.08.019},doi={10.1016/j.csbj.2022.08.019},journal={Computational and Structural Biotechnology Journal},author={Alser, Mohammed and Lindegger, Joel and Firtina, Can and Almadhoun, Nour and Mao, Haiyu and Singh, Gagandeep and Gomez-Luna, Juan and Mutlu, Onur},month=aug,year={2022},}
@article{shahroodi_demeter_2022,title={Demeter: {A} {Fast} and {Energy}-{Efficient} {Food} {Profiler} {Using} {Hyperdimensional} {Computing} in {Memory}},volume={10},doi={10.48550/ARXIV.2206.01932},url={https://doi.org/10.1109/ACCESS.2022.3195878},journal={IEEE Access},author={Shahroodi, Taha and Zahedi, Mahdi and Firtina, Can and Alser, Mohammed and Wong, Stephan and Mutlu, Onur and Hamdioui, Said},year={2022},pages={82493--82510},month=aug,}
A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available.We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. Since sequence-to-sequence mapping can be treated as a special case of sequence-to-graph mapping, we aim to design an accelerator that is efficient for both linear and graph-based read mapping.To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator, which finds the candidate locations in a given genome graph; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator, which performs alignment between a given read and the subgraph identified by MinSeed. We couple SeGraM with high-bandwidth memory to exploit low latency and highly-parallel memory access, which alleviates the memory bottleneck.We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph (i.e., S2G) and sequence-to-sequence (i.e., S2S) mapping pipelines. First, SeGraM outperforms state-of-the-art S2G mapping tools by 5.9×/3.9× and 106×/- 742× for long and short reads, respectively, while reducing power consumption by 4.1×/4.4× and 3.0×/3.2×. Second, BitAlign outperforms a state-of-the-art S2G alignment tool by 41×-539× and three S2S alignment accelerators by 1.2×-4.8×. We conclude that SeGraM is a high-performance and low-cost universal genomics mapping accelerator that efficiently supports both sequence-to-graph and sequence-to-sequence mapping pipelines.
@inproceedings{cali_segram_2022,title={{SeGraM}: {A} {Universal} {Hardware} {Accelerator} for {Genomic} {Sequence}-to-{Graph} and {Sequence}-to-{Sequence} {Mapping}},url={https://dl.acm.org/doi/10.1145/3470496.3527436},doi={10.1145/3470496.3527436},booktitle={Proceedings of the 49th {Annual} {International} {Symposium} on {Computer} {Architecture} (ISCA)},author={Cali, Damla Senol and Kanellopoulos, Konstantinos and Lindegger, Joël and Bingöl, Zülal and Kalsi, Gurpreet S. and Zuo, Ziyi and Firtina, Can and Cavlak, Meryem Banu and Kim, Jeremie and Ghiasi, Nika Mansouri and Singh, Gagandeep and Gómez-Luna, Juan and Alserr, Nour Almadhoun and Alser, Mohammed and Subramoney, Sreenivas and Alkan, Can and Ghose, Saugata and Mutlu, Onur},year={2022},month=jun,pages={638--655},}
Mohammed Alser, Sharon Waymost, Ram Ayyala, Brendan Lawlor, Richard J. Abdill, Neha Rajkumar, Nathan LaPierre, Jaqueline Brito, Andre M. Ribeiro-dos-Santos, Can Firtina, Nour Almadhoun Alserr, Varuni Sarwal, Eleazar Eskin, Qiyang Hu, Derek Strong, Byoung-Do, Kim, Malak S. Abedalthagafi, Onur Mutlu, and Serghei Mangul
Omics software tools have reshaped the landscape of modern biology and become an essential component of biomedical research. The increasing dependence of biomedical scientists on these powerful tools creates a need for easier installation and greater usability. Packaging, virtualization, and containerization are different approaches to satisfy this need by wrapping omics tools in additional software that makes the omics tools easier to install and use. Here, we systematically review practices across prominent packaging, virtualization, and containerization platforms. We outline the challenges, advantages, and limitations of each approach and some of the most widely used platforms from the perspectives of users, software developers, and system administrators. We also propose principles to make packaging, virtualization, and containerization of omics software more sustainable and robust to increase the reproducibility of biomedical and life science research.
@article{alser_packaging_2022,title={Packaging, containerization, and virtualization of computational omics methods: {Advances}, challenges, and opportunities},journal={arXiv},author={Alser, Mohammed and Waymost, Sharon and Ayyala, Ram and Lawlor, Brendan and Abdill, Richard J. and Rajkumar, Neha and LaPierre, Nathan and Brito, Jaqueline and Ribeiro-dos-Santos, Andre M. and Firtina, Can and Almadhoun Alserr, Nour and Sarwal, Varuni and Eskin, Eleazar and Hu, Qiyang and Strong, Derek and Byoung-Do and {Kim} and Abedalthagafi, Malak S. and Mutlu, Onur and Mangul, Serghei},year={2022},month=mar,doi={10.48550/ARXIV.2203.16261},}
Read mapping is a fundamental step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). Read mapping is costly because it needs to perform approximate string matching (ASM) on large amounts of data. To address the computational challenges in genome analysis, many prior works propose various approaches such as accurate filters that select the reads within a dataset of genomic reads (called a read set) that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the amount of expensive computation, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different properties such as read lengths and error rates, which highly depend on the sequencing technology, and 2) different degrees of genetic variation compared to the reference genome, which highly depends on the genomes that are being compared. Through rigorous analysis of read mapping processes of reads with different properties and degrees of genetic variation, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based solid-state drive (SSD). Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern NAND flash-based SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05× (1.52-3.32×) for read sets with high similarity to the reference genome and 1.45-33.63× (2.70-19.2×) for read sets with low similarity to the reference genome.
@inproceedings{mansouri_ghiasi_genstore_2022,title={{GenStore}: {A} {High}-{Performance} in-{Storage} {Processing} {System} for {Genome} {Sequence} {Analysis}},url={https://dl.acm.org/doi/10.1145/3503222.3507702},doi={10.1145/3503222.3507702},booktitle={Proceedings of the 27th {ACM} {International} {Conference} on {Architectural} {Support} for {Programming} {Languages} and {Operating} {Systems} (ASPLOS)},publisher={Association for Computing Machinery},author={Mansouri Ghiasi, Nika and Park, Jisung and Mustafa, Harun and Kim, Jeremie and Olgun, Ataberk and Gollwitzer, Arvid and Senol Cali, Damla and Firtina, Can and Mao, Haiyu and Almadhoun Alserr, Nour and Ausavarungnirun, Rachata and Vijaykumar, Nandita and Alser, Mohammed and Mutlu, Onur},year={2022},month=mar,pages={635--654},}
The complicated economic behavior of entities in a population can be modeled as a Gibbs random field (GRF). Even with simple GRF models, which restrict direct statistical interactions with a small number of neighbors of an entity, real life economic and financial activities may be effectively described. A computer simulator is developed to run empirical experiments to assess different coupling structures and parameters of the presented model; it is possible to test many economic and financial models and policies in terms of their transient and steady-state consequences.
@article{onural_modeling_2021,title={Modeling {Economic} {Activities} and {Random} {Catastrophic} {Failures} of {Financial} {Networks} via {Gibbs} {Random} {Fields}},volume={58},issn={1572-9974},url={https://link.springer.com/article/10.1007/s10614-020-10023-3},doi={10.1007/s10614-020-10023-3},number={2},journal={Computational Economics},author={Onural, Levent and Pınar, Mustafa Çelebi and Firtina, Can},month=aug,year={2021},pages={203--232},}
Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. To perform genome sequencing, devices extract small random fragments of an organism’s DNA sequence (known as reads). The first step of genome sequence analysis is a computational process known as read mapping. In read mapping, each fragment is matched to its potential location in the reference genome with the goal of identifying the original location of each read in the genome. Unfortunately, rapid genome sequencing is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor to this bottleneck is approximate string matching (ASM), which is used at multiple points during the mapping process. ASM enables read mapping to account for sequencing errors and genetic variations in the reads. We propose GenASM, the first ASM acceleration framework for genome sequence analysis. GenASM performs bitvectorbased ASM, which can efficiently accelerate multiple steps of genome sequence analysis. We modify the underlying ASM algorithm (Bitap) to significantly increase its parallelism and reduce its memory footprint. Using this modified algorithm, we design the first hardware accelerator for Bitap. Our hardware accelerator consists of specialized systolic-array-based compute units and on-chip SRAMs that are designed to match the rate of computation with memory capacity and bandwidth, resulting in an efficient design whose performance scales linearly as we increase the number of compute units working in parallel. We demonstrate that GenASM provides significant performance and power benefits for three different use cases in genome sequence analysis. First, GenASM accelerates read alignment for both long reads and short reads. For long reads, GenASM outperforms state-of-the-art software and hardware accelerators by 116× and 3.9×, respectively, while reducing power consumption by 37× and 2.7×. For short reads, GenASM outperforms state-of-the-art software and hardware accelerators by 111× and 1.9×. Second, GenASM accelerates pre-alignment filtering for short reads, with 3.7× the performance of a state-of-the-art pre-alignment filter, while reducing power consumption by 1.7× and significantly improving the filtering accuracy. Third, GenASM accelerates edit distance calculation, with 22-12501× and 9.3-400× speedups over the state-of-the-art software library and FPGA-based accelerator, respectively, while reducing power consumption by 548-582× and 67×. We conclude that GenASM is a flexible, high-performance, and low-power framework, and we briefly discuss four other use cases that can benefit from GenASM.
@inproceedings{cali_genasm_2020,address={Virtual},title={{GenASM}: {A} {High}-{Performance}, {Low}-{Power} {Approximate} {String} {Matching} {Acceleration} {Framework} for {Genome} {Sequence} {Analysis}},url={https://ieeexplore.ieee.org/document/9251930},doi={10.1109/MICRO50266.2020.00081},booktitle={Proceedings of the 53rd {International} {Symposium} on {Microarchitecture}},author={Senol Cali, Damla and Kalsi, Gurpreet S. and Bingöl, Zülal and Firtina, Can and Subramanian, Lavanya and Kim, Jeremie S. and Ausavarungnirun, Rachata and Alser, Mohammed and G{\'o}mez-Luna, Juan and Boroumand, Amirali and Norion, Anant and Scibisz, Allison and Subramoney, Sreenivas and Alkan, Can and Ghose, Saugata and Mutlu, Onur},year={2020},month=oct,pages={951--966},}
Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively.We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward–Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. Source code is available at https://github.com/CMU-SAFARI/Apollo. Supplementary data are available at Bioinformatics online.
@article{firtina_apollo_2020,title={Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm},volume={36},issn={1367-4803},url={https://academic.oup.com/bioinformatics/article/36/12/3669/5804978},doi={10.1093/bioinformatics/btaa179},number={12},journal={Bioinformatics},author={Firtina, Can and Kim, Jeremie S. and Alser, Mohammed and Senol Cali, Damla and Cicek, A Ercument and Alkan, Can and Mutlu, Onur},month=jun,year={2020},pages={3669--3679},}
Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several types of studies require long and accurate reads. In such cases researchers often combine both technologies and the erroneous long reads are corrected using the short reads. Current approaches rely on various graph or alignment based techniques and do not take the error profile of the underlying technology into account. Efficient machine learning algorithms that address these shortcomings have the potential to achieve more accurate integration of these two technologies. We propose Hercules, the first machine learning-based long read error correction algorithm. Hercules models every long read as a profile Hidden Markov Model with respect to the underlying platform’s error profile. The algorithm learns a posterior transition/emission probability distribution for each long read to correct errors in these reads. We show on two DNA-seq BAC clones (CH17-157L1 and CH17-227A2) that Hercules-corrected reads have the highest mapping rate among all competing algorithms and have the highest accuracy when the breadth of coverage is high. On a large human CHM1 cell line WGS data set, Hercules is one of the few scalable algorithms; and among those, it achieves the highest accuracy.
@article{firtina_hercules_2018,title={Hercules: a profile {HMM}-based hybrid error correction algorithm for long reads},volume={46},issn={0305-1048},url={https://academic.oup.com/nar/article/46/21/e125/5075030},doi={10.1093/nar/gky724},number={21},journal={Nucleic Acids Research},author={Firtina, Can and Bar-Joseph, Ziv and Alkan, Can and Cicek, A. Ercument},month=nov,year={2018},pages={e125--e125},}
Genomic studies identify genomic loci representing genetic variations, transcription factor (TF) occupancy, or histone modification through next generation sequencing (NGS) technologies. Interpreting these loci requires evaluating them with known genomic and epigenomic annotations.We present GLANET as a comprehensive annotation and enrichment analysis tool which implements a sampling-based enrichment test that accounts for GC content and/or mappability biases, jointly or separately. GLANET annotates and performs enrichment analysis on these loci with a rich library. We introduce and perform novel data-driven computational experiments for assessing the power and Type-I error of its enrichment procedure which show that GLANET has attained high statistical power and well-controlled Type-I error rate. As a key feature, users can easily extend its library with new gene sets and genomic intervals. Other key features include assessment of impact of single nucleotide variants (SNPs) on TF binding sites and regulation based pathway enrichment analysis.GLANET can be run using its GUI or on command line. GLANET’s source code is available at https://github.com/burcakotlu/GLANET. Tutorials are provided at https://glanet.readthedocs.org.Supplementary data are available at Bioinformatics online.
@article{otlu_glanet_2017,title={{GLANET}: genomic loci annotation and enrichment tool},volume={33},issn={1367-4803},url={https://academic.oup.com/bioinformatics/article/33/18/2818/3852077},doi={10.1093/bioinformatics/btx326},number={18},journal={Bioinformatics},author={Otlu, Burçak and Firtina, Can and Keleş, Sündüz and Tastan, Oznur},month=sep,year={2017},pages={2818--2828},}
Results: Here, we present a comprehensive analysis on the reproducibility of computational characterization of genomic variants using high throughput sequencing data. We reanalyzed the same datasets twice, using the same tools with the same parameters, where we only altered the order of reads in the input (i.e. FASTQ file). Reshuffling caused the reads from repetitive regions being mapped to different locations in the second alignment, and we observed similar results when we only applied a scatter/gather approach for read mapping—without prior shuffling. Our results show that, some of the most common variation discovery algorithms do not handle the ambiguous read mappings accurately when random locations are selected. In addition, we also observed that even when the exact same alignment is used, the GATK HaplotypeCaller generates slightly different call sets, which we pinpoint to the variant filtration step. We conclude that, algorithms at each step of genomic variation discovery and characterization need to treat ambiguous mappings in a deterministic fashion to ensure full replication of results.
@article{firtina_genomic_2016,title={On genomic repeats and reproducibility},volume={32},issn={1367-4803},url={https://academic.oup.com/bioinformatics/article/32/15/2243/1743552},doi={10.1093/bioinformatics/btw139},number={15},journal={Bioinformatics},author={Firtina, Can and Alkan, Can},month=aug,year={2016},pages={2243--2247},}