As Apache Spark enthusiasts and beer lovers, Deezer folks were attending the Spark Summit 2016 in Brussels. Here is a sum up of what we learned in the different keynotes and talks and we thought might be worth sharing.
Simplifying using Spark 2.0
Matei Zaharia — Databricks CTO — walked us through the new features which have been released with Spark 2.0; the main focus being higher abstractions notably with the DataSet API which help:
- to leverage classes and types mapping with the DataSet API (and avoid the field to field mappings);
- to rely on the framework to optimize the DAGs instead of applying user low-level transformations;
- to have better performance via columnar storage optimizations (Parquet).
The next AMP Lab (Algorithms Machines People)
Ion Stoca — Executive chairman at Databricks — presented the new academic-industry lab which will pursue AMP Lab’s work: RISE Lab for Real-time Intelligent Secure Execution focused on real-time and decision-making systems.
Not much for now except there are still many resources on AMP Lab Github and we wish RISE lab to produce just as much material: https://github.com/amplab
MmmooOgle: From Big Data to Decisions for Dairy Cows
My favorite keynote at Spark Summit since it was totally unexpected: Miel Hostens — a veterinarian — talked about his activity to measure farm activities and elaborates models based on all kind of data (cows activity from sensors, weather…) in order to optimize milk production.
No spoiler, just watch the video.
Automatic checkpointing in Spark
Nimbus Goehausen — Software Engineer at Bloomberg — gave us the overview of one of their library used to improve the Spark development cycle. Running a long process, developers are often waiting for the code’s part they are interested in to be executed. Thanks to spark-flow the user have the possibility to save intermediary results in his job using checkpoints.
How does it work?
- a signature is computed for each RDD based on hash functions;
- if the signature has changed the RDD is computed again, otherwise, it just uses the existing one;
- the user needs to specify each checkpoint himself.
Github : https://github.com/bloomberg/spark-flow
Dynamic on the fly modifications
Elena Lazovik — Research Scientist at TNO — covered the need for deployed Spark applications to be updated without downtime which may be a critical requirement especially for some streaming applications.
The different strategies
- using a parameter server to host constants, function arguments which are requested by workers
- extending Spark framework to allow hot change of data source
- …
Data-Aware Spark
Zoltan Zvara — researcher at the Hungarian Academy of Sciences — presented us one of his project helping Spark jobs to be balanced. The project consists of improving the “Data-awareness” of Apache Spark about how the data is partitioned.
How does it work?
- approximate local key distribution and statistics on each worker (scalable sampling);
- send this information to the master;
- master adjusts partitions (driver responsibility).
To be noted that because of the distribution computation an Apache Spark application can lower by 10% (in worst cases); however, the global stage runtime can be divided by half if partitioning is well done.
SparkLint — a tool for monitoring, identifying and tuning inefficient Spark jobs
One of the takeaways of the Spark Summit was the SparkLint tool released by Groupon which helps to analyze Spark deploy configuration efficiency in order to optimize parallelization and allocation.
Objective
- Maximise core usage, minimize idle time (noncalculation processing, that means driver node interaction)
- Should raise the CPU bound if a job is optimized (core usage time series), that means we are not over allocating resources. If some idle (grey parts) appears on the chart, the core usage value should be low too.
An efficient job would like
- core usage > 60%
- idle time < 1%
- CPU bound raised
Future features
- increased job & stage details
- recommendations
- auto-tuning
Github: https://github.com/groupon/sparklint
Mastering the Spark unit testing
Unit testing is mandatory to deploy Spark jobs with confidence. Moreover it can helps you during the development phase, using tests to make experiments instead of relaunch each time the entire script. That’s why Ted Malaska -data architect at Blizzard- explained to us how to write efficient unit tests for Spark.
Key points
- Running tests in local with a shared context
- Data can come from user defined, production samples or generators
- Test the true distribution using dev environment or Docker cluster
Spark streaming at Bing scale
Kaarthik Sivashanmugam — Software Engineer at Microsoft — gave us a clear presentation about the new architecture of the stream processing pipeline at Bing. Kaarthik explained how they did manage to switch from a batch processing data pipeline to a near real time one.
What we hold back?
They developed an Open Source project Mobius a C# and F# binding extensions to Apache Spark. If you are a .Net developer, you can now write Apache Spark applications.
Github: https://github.com/Microsoft/Mobius
SparkOscope enabling Spark optimization through cross-stack monitoring and visualization
Spark monitoring was one of the main subjects of this Spark summit. Yiannis Gkoufas — Research Software Engineer at IBM — presented us SparkOscope overriding the native Spark UI in order to provide to the user more metrics.
What does SparkOscope bring in?
- OS-level metrics
- Enrich the web UI to plot all available metrics + the new ones from OS
Some other features were also announced as being in development:
- pluggable storage mechanism (HBase, MongoDB…) instead of HDFS
- making smart recommendations according to metrics patterns
Github: https://github.com/ibm-research-ireland/sparkoscope
Apache Spark at Deezer?
It was our first participation to a Spark Summit and it confirmed our feeling that Apache Spark is going in the right direction and it’s going fast.
Currently, Apache Spark is becoming the first option for ETL jobs at Deezer and resource allocation and job optimization considerations are our top priority to ensure a reliable platform. It felt good that most of Apache Spark users arefacing up the same situation.
Another big topic was, of course, Apache Spark 2.0 which sounds promising; even though the adoption is not yet there we feel that it would improve code quality.
We were a bit disappointed to not hear more talks about deploy and infrastructure aspects (YARN, Mesos and Kubernetes).
By the way, you’re interested in using Apache Spark on real use case? We’re hiring. 😉
Eloïse Gomez, Jullian Bellino & Nicolas Landier