Abstract
© 2023, Pergamon Press, Inc. The attached document (embargoed until 22/06/2025) is an author produced version of a paper published in Expert Systems with Applications: An International Journal uploaded in accordance with the publisher’s self-archiving policy. The final published version (version of record) is available online at the link. Some minor differences between this version and the final published version may remain. We suggest you refer to the final published version should you wish to cite from it.
Genome assembly is the computational process of merging short parts of DNA into larger sequences called contigs. Rapid growth of high-throughput genome sequencing technologies and production of large amount of data have led to the genome assembly paradigms shift from shared memory to distributed memory systems in the recent years. Among the existing assembly algorithms, the iterative de Bruijn Graph is a leading approach for assembling short reads. This approach by exploring the advantages of all k between kmin to kmax, generates high quality assembly. However, the assembly operations are decelerated especially in the larger data sets. RMI-DBG is an agile iterative de Bruijn Graph algorithm that has the computational efficiency of de Bruijn Graph methods and the flexibility of overlap-based algorithms. In this paper, we suggest a distributed iterative DBG model based on RMI-DBG, named DRMI-DBG. The proposed idea is to address the problem of parallelizing the de Bruijn Graph construction and processing on distributed memory systems at each iteration of the algorithm. DRMI-DBG is a scalable iterative DBG framework over a Hadoop cluster by applying the power of Spark (a batch processing engine) and Giraph (a distributed big graph processing system). Experiments on a variety of real data sets show that DRMI-DBG accelerates the performance of RMI-DBG algorithm and IDBA-UD assembler up to 4.8 times with comparable or better results in the quality of the assembly. For more evaluation, performance of the proposed model is compared to ScalaDBG, as the state-of-the-art distributed assembler based on the multiple k-values strategy.
Genome assembly is the computational process of merging short parts of DNA into larger sequences called contigs. Rapid growth of high-throughput genome sequencing technologies and production of large amount of data have led to the genome assembly paradigms shift from shared memory to distributed memory systems in the recent years. Among the existing assembly algorithms, the iterative de Bruijn Graph is a leading approach for assembling short reads. This approach by exploring the advantages of all k between kmin to kmax, generates high quality assembly. However, the assembly operations are decelerated especially in the larger data sets. RMI-DBG is an agile iterative de Bruijn Graph algorithm that has the computational efficiency of de Bruijn Graph methods and the flexibility of overlap-based algorithms. In this paper, we suggest a distributed iterative DBG model based on RMI-DBG, named DRMI-DBG. The proposed idea is to address the problem of parallelizing the de Bruijn Graph construction and processing on distributed memory systems at each iteration of the algorithm. DRMI-DBG is a scalable iterative DBG framework over a Hadoop cluster by applying the power of Spark (a batch processing engine) and Giraph (a distributed big graph processing system). Experiments on a variety of real data sets show that DRMI-DBG accelerates the performance of RMI-DBG algorithm and IDBA-UD assembler up to 4.8 times with comparable or better results in the quality of the assembly. For more evaluation, performance of the proposed model is compared to ScalaDBG, as the state-of-the-art distributed assembler based on the multiple k-values strategy.
Original language | English |
---|---|
Article number | 233 |
Pages (from-to) | 120859 |
Number of pages | 28 |
Journal | Expert Systems with Applications |
Publication status | Published - 22 Jun 2023 |
Keywords
- Iterative de Bruijn Graph (DBG) algorithm
- Distributed graph assembly
- Spark
- Giraph
- Short read genome assembly