THUG hosts an Introduction to Apache Spark

Posted in Digital Transformation
By Laura Leslie on July 11, 2014

A Data Scientist, and Rubikrew member, Pardis Noorzad, attended the Introduction to Apache Spark meetup this past Tuesday evening. Pardis was able to uncover some interesting details on Spark, an open­ source cluster computing framework for data analytics. Officially released less than 2 months ago, the framework is drawing considerable attention in the tech world.

The guest speaker, Matei Zaharia is an assistant professor at MIT and CTO of Databricks. Pardis was impressed with his talk and Databricks Cloud demo. I asked Pardis to share a bit about her experience at the meetup, her opinions on Spark, and how she feels it could affect us here at Rubikloud.

Q) Compared to other meetups, why were you excited to attend the Intro to Apache Spark talk?:

A) I have been looking forward to this meetup for some time. Spark is not easy to ignore. Matei’s talk was well­ organized and thorough. Thanks to THUG and BNotions for organizing the event.

Q) What were you most surprised to learn?:

A) The Notebook and Dashboard functionality on Databricks Cloud do a great job at abstracting away the complexities of real-­time and distributed data aggregation and analysis. The ease of integration of tasks, including the streaming of tweets, training a model to assign word similarity to “FIFA” based on the Wikipedia corpus, using the trained model to filter the live stream, and finally the real­time plotting of the resulting filtered stream was very impressive.

Q) Generally, what is this biggest benefit you see coming from Spark? :

A) I’m looking forward to learn more about Google’s Cloud Dataflow, but I doubt Google will make the cluster computing framework open source. Databricks Cloud uses open source Spark that is said to be one of the most active open source projects on Apache.

Q) How could this change the way you/others do work ?:

A) As a startup, the faster we can execute our experiments to derive value from our customers’ data as a POC, the better. Generalized solutions like Spark that provide stream processing, graph processing and machine learning out of the box are ideal for our lean and agile strategy.

Q)What did you think about the speaker? what were your key take­-aways from his talk?:

A) Matei Zaharia is the ideal kind of data scientist: having the humility, knowledge and openness expected from an academic combined with the skill, insight and practicality of an industry engineer. His PhD Thesis is definitely worth a read.

Q) Do you think we will be using Spark or something similar soon ?

A)We use a variety of open source frameworks combined with our proprietary software in our data pipeline. Developments in Spark are certainly exciting and we hope to take advantage of its potential.