MemVerge announced on Tuesday MemVerge Splash, its open source solution that allows shuffle data to be stored in an external storage system. MemVerge Splash is designed for Apache Spark software users looking to improve the performance, flexibility and resiliency of shuffle manager.
Traditionally, when shuffle data is stored remotely, system performance can degrade due to network and storage bottlenecks which can negatively impact performance and stability. MemVerge Splash, working together with MemVerge's distributed system software named Distributed Memory Objects (DMO), solves these issues to make Spark highly performant through a high performance in-memory storage and networking stack.
With MemVerge Splash, users can choose any external storage systems as a remote shuffle service; extract storage and network implementations from the shuffle procedure to allow users to apply different plugins for different storage and networks; separate storage and compute; and tolerate node failure.
MemVerge Splash allows shuffle data to be stored reliably by using pluggable storage and network backends and maintains a dedicated storage cluster. Splash also helps improve elasticity by allowing users to adjust the size of the computing cluster without interrupting their shuffle computation. This is particularly important when Kubernetes is used for scheduling Spark tasks.
"We engaged with the Spark community to identify their pain points and built MemVerge Splash with these in mind," said Charles Fan, founder and CEO of MemVerge. "There is no other solution currently on the market that can provide a complete solution to tackle the shuffle elasticity and performance problems like Splash. We welcome all users and developers to try and contribute to this new open source solution."
"We chose to work with MemVerge because of the company's deep understanding of big data applications and their ability to extract the most performance from the data," said Zhen Fan, senior technologist at JD.com. "Splash is an optimized shuffle manager for a large scale Spark cluster. This solution improves shuffle performance and enables better tolerance of Spark node failures. With Splash, users can direct shuffle data to higher performance external storage to avoid data loss when Spark nodes fail. This is especially useful for users who manage Spark cluster of thousands of nodes, such as JD.com."
MemVerge Splash works with MemVerge's DMO, while being compatible with any third party distributed storage system like HDFS and CephFS, and network stack. Additionally, Splash works with both on-premise and cloud deployments.
MemVerge's proprietary DMO technology provides a logical memory-storage convergence layer that leverages Intel's latest persistent memory technology to allow data-intensive workloads to run seamlessly at memory speed, and can analyze and process large volumes of data in real time with ease.
MemVerge Splash is now available and can be accessed online. Additionally, MemVerge is currently available via its beta program.
No comments:
Post a Comment