YouTube magic that brings views, likes and suibscribers
Get Free YouTube Subscribers, Views and Likes

How To Run HDFS On Kubernetes To Speed Up Spark | Pepperdata

Follow
Pepperdata

There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks.

Learn why Enterprise clients use Pepperdata products and Services: https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/

#hadoopdistributedfilesystem #kuberntes #pepperdata

Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark.

There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark.

Kubernetes is a new cluster management software open sourced by Google in 2013. So, it manages many computers and runs many programs using those computers. In that sense it is similar to YARN or Baseless. What sets Kubernetes apart is that it runs programs, it puts them in special runtime environments called Linux containers, which are also known as dockers or rockets. So, containers and dockers are very popular. So, people wanted to run them not only on a single computer, but across a cluster of computers, and Kubernetes does just that.

More on the episode:

Hi everyone! My name is Kim Moo. I'm a software engineer working for Pepperdata. At Pepperdata we help people improve the performance of the big data clusters. That's our main focus. But, we are also interested in exploring new Big Data platforms.

In particular, Pepperdata and several other companies have been building new big datas on Kubernetes. Together, we made it possible to run Spark on Kubernetes. Now, we are adding HDFS to Hadoop distributed file system the big data stack on Kubernetes. Spark and HDFS should work closely together.

So, we had to fix a few issues in how they work together on Kubernetes. So, I'm going to talk about that today. Here's the outline at the top. First, I'll do a quick introduction on Kubernetes, if such a thing is possible. And then, I'll talk about how we run Spark in HDFS on Kubernetes. There is a short demo. And finally, we'll discuss some issues that we ran into. HDFS data locality was somehow broken initially, and secure HDFS support was missing.

So, we mixed both, and I'll explain how. But first, what is Kubernetes? How many of you are familiar with Kubernetes? Okay, well that's a surprisingly good number. Yay! Alright, so Kubernetes is a new cluster management software open sourced by Google in 2013. So, it manages many computers and runs many programs using those computers.

In that sense it is similar to YARN or Baseless. What sets Kubernetes apart is that it runs programs, it puts them in special runtime environments called Linux containers, which are also known as dockers or rockets. So, containers and dockers are very popular. So, people wanted to run them not only on a single computer, but across a cluster of computers, and Kubernetes does just that. It is actually designed and built for containers based on Google's internal experience of running containers for ten years.

So, they've been doing that very secretly. So, ever since this was released a lot of people join the project, right, and build a big community around it. Okay, that's cool. But what's the benefit of using containers? Many of us have this bad experience. My Spark job suddenly failed with the class not found exception, and someone installed a new version of Spark.

Or, you know, I had to change my Tomcat server port because someone else was running another Tomcat on the same most. They stream touch my stuff right, you'd say especially it has been working fine. I love Batman but I'm not endorsing the violence here. There's gotta be a better way, right? So, containers solve this problem fundamentally.

They create more isolation layers between programs using virtualization technologies that virtual machines are based on. But unlike virtual machines, containers are still very fast because they picked a specific set of technologies that are efficient and lightweight. First, each program gets a virtual file system that contains an independent set of software packages. This is better known as darker images. So, this way this other person can install new packages only on his container, without affecting my program at all...

Learn why Enterprise clients use Pepperdata products and Services: https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/

Check out our blog: https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/blog/

/////////////////////////////////////////////////////////////////////////////////////////

Connect with us:
Visit Pepperdata Website: https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/
Follow Pepperdata on LinkedIn:   / pepperdata  
Follow Pepperdata on Twitter:   / pepperdata  
Like Pepperdata on Facebook:   / pepperdata  

posted by ResulSetl7