COMBUSTI/O. Abstractions facilitating parallel execution of programs implementing common I/O patterns in a pipelined fashion as workflows in Spark
Permanent lenke
https://hdl.handle.net/10037/9361Dato
2016-05-31Type
Master thesisMastergradsoppgave
Forfatter
Fagerli, JarlSammendrag
In light of recent years’ exploding data generation in life sciences, increasing
downstream analysis capabilities is paramount to address the asymmetry of
innovation in data creation contra processing capacities. Many contemporaneously
used tools are sequential programs, ofttimes including convoluted dependencies
leading to workflows crashing due to misconfiguration, detrimental to
both development efforts and production, also inducing duplicate work upon
re-execution.
This thesis proposes a distributed and easy-to-use general framework for work-
flow creation and ad hoc parallelization of existing serial programs. In furtherance
of reducing wall-clock time consumed by big data processing pipelines,
its processing is horizontally scaled out, whilst supporting recovery and tool
validation. COMBUSTI/O is a cloud and hpc ready framework for pipelined
execution of unmodified third-party program binaries on Spark. It supports
tool requirements of named input and output files, usage and redirection of
standard streams, and combinations of these, as well as both coarse and fine
granularity state recovery. Designed to run independently, its scalability is reduced
to Spark and the underlying fault-tolerant big data frameworks.
We evaluate COMBUSTI/O on real and synthetic workflows, demonstrating its
propriety for facilitation of complex compute-intensive workflows, as well as its
applicability for data-intensive and latency-sensitive workflows, and validate
the coarse-grained recovery mechanism and its cost for the different flavors of
workflows. We show stage recovery to be beneficial during development, for
compute-intensive workflows, and for error-prone data-intensive workflows.
Moreover, we show that the I/O overhead of COMBUSTI/O grows for dataintensive
workflows, and that our remote tool execution is inexpensive.
COMBUSTI/O is open-sourced at https://github.com/jarlebass/combustio,
and currently used by SfB at the University of Tromsø.
Forlag
UiT Norges arktiske universitetUiT The Arctic University of Norway
Metadata
Vis full innførselSamlinger
Copyright 2016 The Author(s)
Følgende lisensfil er knyttet til denne innførselen: