zbMATH — the first resource for mathematics

MULKSG: MULtiple K Simultaneous Graph assembly. (English) Zbl 1416.92120
Holmes, Ian (ed.) et al., Algorithms for computational biology. 6th international conference, AlCoB 2019, Berkeley, CA, USA, May 28–30, 2019. Proceedings. Cham: Springer. Lect. Notes Comput. Sci. 11488, 125-136 (2019).
Summary: This work shows how to parallelize multi K de Bruijn graph genome assembly simultaneously, removing the bottleneck of iterative multi K assembly. The expected execution time on a single node with 40 cores is variable, with the average execution time for the entire pipeline over 16 datasets tested being 1613 s for SPAdes vs. 1581 s for MULKSG, with the MULKSG graph creation and traversal averaging 15% faster than SPAdes. We implement a multi-node implementation for the graph creation and traversal portions of the assembly, showing the speedups in Fig. 4. We show that when implemented correctly with correction phases performed per graph in parallel, the expected outcome is very close to the original method, in some cases having less errors while keeping the same NGA50 and genome coverage %. We show this works in practice, implementing with the popular genome assembler SPAdes. Further, this algorithmic change gets rid of the single node sequential bottleneck on multi K genome assembly, allowing for the use of parallel error correction, graph building, graph correction, and graph traversal. We implement a parallel version of the assembly and show the statistics are the same as when run on a single node. The code is open source and can be found at https://github.com/cwright7101/mulksg.
For the entire collection see [Zbl 1416.92001].
92D10 Genetics and epigenetics
68W10 Parallel algorithms in computer science
05C90 Applications of graph theory
Full Text: DOI