Friday, September 25, 2015

Databricks: Apache Spark Outgrowing Hadoop

The number of standalone deployments of Spark eclipses those on YARN as more users run Spark independent of Hadoop, according to a newly published survey of Spark users conducted by Databricks, the company founded by the creators of Apache Spark.

Databricks said that users that are running Spark in standalone (48 percent of respondents) exceeds those running Spark on YARN (40 percent of respondents), alongside a majority of users running Spark in the public cloud. The survey also found that 51 percent of respondents run Spark on a public cloud.

Key findings from the survey include:

  • Spark is outgrowing Hadoop: The most common Spark deployments according to the community are: 48 percent standalone, 40 percent YARN within Hadoop and 11 percent Apache Mesos. Spark users who do not use any Hadoop components have more than doubled in 2015 (from 2014). 
  • Streaming and advanced analytics uses rising: Spark is being used for an increasingly diverse set of applications, particularly data scientists for machine learning, streaming and graph analysis use cases. In 2015, there are 56 percent more Spark streaming users than in 2014. The production use of advanced analytics, like MLib for machine learning and GraphX for graph processing, increased from 11 percent in 2014 to 15 percent in 2015. 75 percent of Spark users are also using two or more Spark components (51 percent of Spark users are using three or more Spark components).
  • Spark users are becoming more diverse:  Of those surveyed, 41 percent identified themselves as Data Engineers, while 22 percent of respondents identified themselves as Data Scientists. Spark users are solving a variety of problems in different languages -- Scala (71 percent), Python (58 percent), SQL (36 percent), Java (31 percent) and R (18 percent) -- and all within the same framework.
  • Spark's most popular use cases come to light: Fifty-two percent use Spark for data warehousing, 68 percent use it for business intelligence, 40 percent for processing application and system logs, 48 percent to build recommendation engines, 36 percent for user-facing services and 29 percent for fraud detection and security.
  • Spark is increasing access to big data:  Ninety one percent of those surveyed claim performance as their reason for adoption, while 77 percent cite ease of programming, 71 percent cite ease of deployment, 64 percent cite advanced analytics capabilities and 52 percent cite real-time streaming capabilities.

"The continued growth of Spark has been highly encouraging, as companies are going into production to obtain real business value, and they are doing so in a wide range of environments beyond Hadoop clusters," said Matei Zaharia, creator of Apache Spark and CTO of Databricks. "Databricks and our partners are 100 percent committed to the long-term growth of Spark and we'll continue to make improvements based on this survey data and our ongoing community feedback, to make the most complete big data analytics toolkit accessible to all businesses."

https://databricks.com