Monday, June 15, 2015

IBM Backs Apache Spark for Cloud Data Processing

IBM is putting its weight behind Apache Spark, which is an open source engine for large-scale data processing and compatible with Hadoop data.

Apache Spark can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

IBM said Spark is potentially the most important new open source project in a decade that is being defined by data. As such, IBM plans to embed Spark into its Analytics and Commerce platforms, and to offer Spark as a service on IBM Cloud. The company said its will put more than 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide; donate its IBM SystemML machine learning technology to the Spark open source ecosystem; and educate more than one million data scientists and data engineers on Spark.

“IBM has been a decades long leader in open source innovation. We believe strongly in the power of open source as the basis to build value for clients, and are fully committed to Spark as a foundational technology platform for accelerating innovation and driving analytics across every business in a fundamental way,” said Beth Smith, General Manager, Analytics Platform, IBM Analytics. “Our clients will benefit as we help them embrace Spark to advance their own data strategies to drive business transformation and competitive differentiation.”

http://www.ibm.com