Vis enkel innførsel

dc.contributor.advisorJohansen, Dag
dc.contributor.authorViken Valvåg, Steffen
dc.date.accessioned2012-01-31T09:27:02Z
dc.date.available2012-01-31T09:27:02Z
dc.date.issued2012-01-30
dc.description.abstractMapReduce has become a widely employed programming model for large-scale data-intensive computations. Traditional MapReduce engines employ dynamic routing of data as a core mechanism for fault tolerance and load balancing. An alternative mechanism is static routing, which reduces the need to store temporary copies of intermediate data, but requires a tighter coupling between the components for storage and processing. The initial intuition motivating our work is that reading and writing less temporary data could improve performance, while the tight coupling of storage and processing could be leveraged to improve data locality. We therefore conjecture that a high-performance MapReduce engine can be based on static routing, while preserving the non-functional properties associated with traditional engines. To investigate this thesis, we design, implement, and experiment with Cogset, a distributed MapReduce engine that deviates considerably from the traditional design. We evaluate the performance of Cogset by comparing it to a widely used traditional MapReduce engine using a previously established benchmark. The results confirm our thesis that a high-performance MapReduce engine can be based on static routing, although analysis indicates that the reasons for Cogset's performance improvements are more subtle than expected. Through our work we develop a better understanding of static routing, its benefits and limitations, and its ramifications for a MapReduce engine. A secondary goal of our work is to explore how higher-level abstractions that are commonly built on top of MapReduce will interact with an execution engine based on static routing. Cogset is therefore designed with a generic, low-level core interface, upon which MapReduce is implemented as a relatively thin layer, as one of several supported programming interfaces. At its core, Cogset provides a few fundamental mechanisms for reliable and distributed storage of data, and parallel processing of statically partitioned data. While this dissertation mainly focuses on how these capabilities are leveraged to implement a distributed MapReduce engine, we also demonstrate how two other higher-level abstractions were built on top of Cogset. These may serve as alternative access points for data-intensive applications, and illustrate how some of the lessons learned from Cogset can be applicable in a broader context.en
dc.description.doctoraltypeph.d.en
dc.description.popularabstractMapReduce er en etablert og populær programmeringsmodell for data-intensive distribuerte beregninger. Denne oppgaven presenterer en ny, eksperimentell motor for MapReduce-beregninger som har et utradisjonelt design basert på statisk ruting av data. Ytelsen til vår eksperimentelle motor sammenlignes mot en mye brukt tradisjonell motor ved hjelp av en tidligere etablert benchmark, og vi avdekker store ytelsesforbedringer ved bruk av vår motor. Basert på denne evalueringen konkluderer vi med at en MapReduce-motor med høy ytelse kan baseres på våre ukonvensjonelle designprinsipper. Vi viser også hvordan høyerenivås abstraksjoner for programmering av distribuerte systemer kan bygges med vår motor som et underliggende fundament.en
dc.descriptionThe papers of this thesis are not available in Munin: <br/>1. Steffen Viken Valvåg and Dag Johansen: 'Oivos : simple and efficient distributed data processing' (2008). In Proceedings of the 2008 Tenth IEEE International Conference on High Performance Computing and Communications (HPCC 2008), pages 113– 122. IEEE Computer Society. Available at <a href=http://dx.doi.org/10.1109/HPCC.2008.105>http://dx.doi.org/10.1109/HPCC.2008.105</a> <br/>2. Steffen Viken Valvåg and Dag Johansen: 'Update Maps : a new abstraction for High-Throughput Batch processing' (2009). In Proceedings of the 2009 IEEE International Conference on Networking, Architecture, and Storage (NAS 2009), pages 431–438. IEEE Computer Society. Available at <a href=http://dx.doi.org/10.1109/NAS.2009.73>http://dx.doi.org/10.1109/NAS.2009.73</a> <br/>3. Steffen Viken Valvåg and Dag Johansen: 'Cogset : a unified engine for reliable storage and parallel processing' (2009). In Proceedings of the 2009 Sixth IFIP International Conference on Network and Parallel Computing (NPC 2009), pages 174– 181. IEEE Computer Society. Available at <a href=http://dx.doi.org/10.1109/NPC.2009.23>http://dx.doi.org/10.1109/NPC.2009.23</a> <br/>4. Steffen Viken Valvåg, Dag Johansen, and Åge Kvalnes: 'Cogset vs. Hadoop : measurements and analysis', (2010). In Proceedings of the 2010 Second IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2010), pages 768–775. IEEE Computer Society. Available at <a href=http://dx.doi.org/10.1109/CloudCom.2010.103>http://dx.doi.org/10.1109/CloudCom.2010.103</a>en
dc.identifier.isbn978-82-8236-054-8
dc.identifier.isbn978-82-8236-055-5
dc.identifier.urihttps://hdl.handle.net/10037/3817
dc.identifier.urnURN:NBN:no-uit_munin_3527
dc.language.isoengen
dc.publisherUniversitetet i Tromsøen
dc.publisherUniversity of Tromsøen
dc.rights.accessRightsopenAccess
dc.rights.holderCopyright 2012 The Author(s)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-sa/3.0en_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)en_US
dc.subjectVDP::Mathematics and natural science: 400::Information and communication science: 420::Communication and distributed systems: 423en
dc.subjectMapReduceen
dc.subjectVDP::Mathematics and natural science: 400::Information and communication science: 420::Databases and multimedia systems: 428en
dc.subjectParallel Databasesen
dc.subjectVDP::Mathematics and natural science: 400::Information and communication science: 420::Theoretical computer science, programming languages and programming theory: 421en
dc.subjectProgramming Modelsen
dc.subjectVDP::Mathematics and natural science: 400::Information and communication science: 420::Communication and distributed systems: 423en
dc.subjectDistributed Data Processingen
dc.titleCogset : A High-Performance MapReduce Engineen
dc.typeDoctoral thesisen
dc.typeDoktorgradsavhandlingen


Tilhørende fil(er)

Thumbnail
Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel

Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Med mindre det står noe annet, er denne innførselens lisens beskrevet som Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)