Learn how to get Free YouTube subscribers, views and likes
Get Free YouTube Subscribers, Views and Likes

Apache Spark Optimization Techniques Performance Tuning | Pepperdata

Follow
Pepperdata

Learn more about Spark Optimization in our Pepperdata Big Data Performance Report 2020: https://www.pepperdata.com/'>https://www.pepperdata.com/2020bigd...

Get Spark Performance Tuning Tips from a Veteran Field Engineer
https://www.pepperdata.com/'>https://www.pepperdata.com/blog/spark...

#sparkoptimization #bigdataperformancereport #pepperdata

00:00:01:15 00:00:08:12
Hello, this is Alex Pierce, field engineer with Pepperdata, and this is what you should know about Spark optimization.

00:00:12:15 00:00:46:21
Why Apache Spark? Several reasons. First of all: speed. Apache Spark compared to traditional Hadoop ETL type batch workloads is approximately a hundred times faster for inmemory work and ten times faster on disk due to the efficiency of its pipeline distributed architecture. It is easy to use and available in many languages including Java, Scala, R, SQL, and Python which is now the most popular language to interface with Spark with. Its generality.

00:00:46:23 00:01:22:20
It has libraries including access through SQL, DataFrames, MLlib for machine learning, Graphx, and Spark Streaming, and all of these libraries can be combined within a single application. Also, flexibility. It runs on many platforms including the Hadoop YARN scheduler, Apache Mesos, Kubernetes, standalone or in the cloud, and it provides access through many data sources: HDFS, S3, Aluxio, Cassandra HBase, Hive, and also other relational and nonrelational databases.

00:01:23:11 00:02:41:07
However, there are some challenges to Spark. One of the things Pepperdata has observed is that Spark jobs tend to fail more than other jobs. As you can see here in this chart, taken from our big data performance report, Spark is approximately four to seven times more likely to fail than other applications that we have observed within our customer base. So this is about Spark optimization. How do you do this? One of the most important parts is observability, in order for you to understand what needs to be optimized, you need to understand where the opportunities for optimization are and what needs to be changed. For example, looking at memory utilization. If a tool can tell you exactly how your memory could be optimized, maybe you'd need to use more memory because you're seeing garbage collection. Maybe you need to use less memory because you're asking for more than you are actually utilizing thereby causing problems and queuing in multitenant environments. Spark is also sensitive to data skew, in a highly distributed paralyzed application such as Spark, data skew can be very painful, causing parts of your application to last much longer than they should and causing other compute resources to sit idle in the meantime.

00:02:41:12 00:03:22:26
So being able to observe when there is data skew and take recommendations of what to do with this data skew is very important. So how do you measure success in optimizing your Spark workload? Observability is the key. You need to be able to say "hey, my applications are running without failures, my SLAs are being met consistently, and also my chosen observability tool no longer indicates there are problems with memory utilization, with data skew, with other things that while my application may work it could work better and would be a better tenant and a multitenant environment that most of us work in."

00:03:23:21 00:03:48:21
If you want to learn more on how to optimize your big data clusters and gain true cloud optimization go visit Pepperdata.com and download our Pepperdata Big Data Performance Report. This is going to tell you a lot about what we see in terms of how the market is utilizing big data and what you can do to improve your positioning and performance within that space. Thank you very much for your time.


Pepperdata Big Data Performance Report 2020
https://www.pepperdata.com/'>https://www.pepperdata.com/2020bigd...

Learn why Enterprise clients use Pepperdata products and Services https://www.pepperdata.com/'>https://www.pepperdata.com/

Check our complete Blog Series: https://www.pepperdata.com/'>https://www.pepperdata.com/blog/

/////////////////////////////////////////////////////////////////////////////////////////

Connect with us:
Visit Pepperdata Website: https://www.pepperdata.com/'>https://www.pepperdata.com/
Follow Pepperdata on LinkedIn:   / pepperdata  
Follow Pepperdata on Twitter:   / pepperdata  
Like Pepperdata on Facebook:   / pepperdata  

posted by ResulSetl7