Show simple item record

dc.contributor.advisorBongo, Lars Ailo
dc.contributor.authorFagerli, Jarl
dc.date.accessioned2016-07-01T10:32:02Z
dc.date.available2016-07-01T10:32:02Z
dc.date.issued2016-05-31
dc.description.abstractIn light of recent years’ exploding data generation in life sciences, increasing downstream analysis capabilities is paramount to address the asymmetry of innovation in data creation contra processing capacities. Many contemporaneously used tools are sequential programs, ofttimes including convoluted dependencies leading to workflows crashing due to misconfiguration, detrimental to both development efforts and production, also inducing duplicate work upon re-execution. This thesis proposes a distributed and easy-to-use general framework for work- flow creation and ad hoc parallelization of existing serial programs. In furtherance of reducing wall-clock time consumed by big data processing pipelines, its processing is horizontally scaled out, whilst supporting recovery and tool validation. COMBUSTI/O is a cloud and hpc ready framework for pipelined execution of unmodified third-party program binaries on Spark. It supports tool requirements of named input and output files, usage and redirection of standard streams, and combinations of these, as well as both coarse and fine granularity state recovery. Designed to run independently, its scalability is reduced to Spark and the underlying fault-tolerant big data frameworks. We evaluate COMBUSTI/O on real and synthetic workflows, demonstrating its propriety for facilitation of complex compute-intensive workflows, as well as its applicability for data-intensive and latency-sensitive workflows, and validate the coarse-grained recovery mechanism and its cost for the different flavors of workflows. We show stage recovery to be beneficial during development, for compute-intensive workflows, and for error-prone data-intensive workflows. Moreover, we show that the I/O overhead of COMBUSTI/O grows for dataintensive workflows, and that our remote tool execution is inexpensive. COMBUSTI/O is open-sourced at https://github.com/jarlebass/combustio, and currently used by SfB at the University of Tromsø.en_US
dc.identifier.urihttps://hdl.handle.net/10037/9361
dc.identifier.urnURN:NBN:no-uit_munin_8919
dc.language.isoengen_US
dc.publisherUiT Norges arktiske universiteten_US
dc.publisherUiT The Arctic University of Norwayen_US
dc.rights.accessRightsopenAccess
dc.rights.holderCopyright 2016 The Author(s)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-sa/3.0en_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)en_US
dc.subject.courseIDINF-3981
dc.subjectVDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Systemutvikling og – arbeid: 426en_US
dc.subjectVDP::Mathematics and natural science: 400::Information and communication science: 420::System development and system design: 426en_US
dc.titleCOMBUSTI/O. Abstractions facilitating parallel execution of programs implementing common I/O patterns in a pipelined fashion as workflows in Sparken_US
dc.typeMaster thesisen_US
dc.typeMastergradsoppgaveen_US


File(s) in this item

Thumbnail
Thumbnail
Thumbnail

This item appears in the following collection(s)

Show simple item record

Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)