Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance.
Authors
Michael Anderson
Related Content
A Case against Tiny Tasks in Iterative Analytics
Big data systems such as Spark are built around the idea of splitting an iterative parallel program into tiny tasks.
High-Performance Incremental SVM Learning on Intel Xeon Phi
Support vector machines (SVMs) are conventionally batch trained. Such implementations can be very inefficient for online streaming applications demanding real-time.
Network Sketching: Exploiting Binary Structure in Deep CNNs
Convolutional neural networks (CNNs) with deep architectures have substantially advanced the state-of-the-art in computer vision tasks. However, deep networks are
Parallelizing Julia with a Non-Invasive DSL
Computational scientists often prototype software using productivity languages that offer highlevel programming abstractions. When higher performance is needed, they are