Free views, likes and subscribers at YouTube. Now!

Get Free YouTube Subscribers, Views and Likes

Big Data Machine-Learning Models of Hadoop Cluster Behavior

Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include structured, semistructured, and unstructured data, from different sources, and in different sizes from terabytes to zettabytes.

Learn why Enterprise clients use Pepperdata products and Services: https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/

#bigdataanalytics #applicationperformancemanagement #pepperdata

Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency. Big data has one or more of the following characteristics: high volume, high velocity, or high variety. For example, big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media — much of it is generated in realtime and at a very large scale.

Analysis of big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. Businesses can use advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing to gain new insights from previously untapped data sources independently or together with existing enterprise data.

More on the episode:
I'm Shawn. I'm the CTO and cofounder of Pepperdata. I spent many, many years before then working in web search. Obliquely relevant is that, you know, while at Yahoo when I was working on web search there, our group, the Yahoo Web Search Team, was actually my group, was the first deployment of Hadoop.

And back then we were happy if it stayed up for 10 nodes for a day. So, it’s better now. And when we get into the really technical details of what we're talking about, Shekhar from our engineering team is going to talk about that. And you know, Shekhar actually has a Ph.D. in Optimizing Distributed Systems.

So, ridiculously relevant. Just a one very brief thing about what our company is, Pepperdata is a, you know, the big data performance company. So, we're very, very focused on performance. Our software's been deployed on over 15,000 production nodes. And this is very relevant, that we're receiving a huge amount of telemetry from all these nodes because it's actually allowed us to do the data science on top of this.

And you can see a selection of, you know, a bunch of our customers as is customary in such slides, at least ones who let me tell you about them. So, what we're going to talk about I'm going to talk about one particular aspect of cluster behavior. In particular, what happens with swapping. And then what I call the bad form of swapping, thrashing right?

So, what is this in general? How specifically do they manifest in Hadoop? I'm going to talk about you knowing when swapping is a good thing and when it's okay. And when, you know, the bad form, also known as NL, and I'll continue to simply use thrashing. How it can be avoided in kind of a basic way?

And then, you know, the thrust of our work to have a more automatic and machinelearned approach to actually detect and avoid swapping. And then, we'll talk a little bit about the results of that and wrap up and make sure to have time for questions at the end. So, the first thing I have to just basically define to get us on the same page, what is swapping?

So, swapping is what happens when you're active need for RAM from all your user programs is greater than the amount of physical memory that you actually have. And so, when the active use of RAM is bigger the OS will happily take a few of those pages, put them out to disk that isn't being that aren't actively being used, and then take a few of, you know, and the pages that you do need will be pulled back in from disk.

So, this is called swapping in and swapping out. In Hadoop, this certainly can happen, and happens with some frequency, as we've observed. You know in Hadoop, basically of course you've got your basic, your scheduler, and you know running on your resource manager that's throwing containers out to all the nodes. And, you know, you've got a plurality of worker hosts typically running node managers...

Learn why Enterprise clients use Pepperdata products and Services: https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/

Check out our blog: https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/blog/

/////////////////////////////////////////////////////////////////////////////////////////

Connect with us:
Visit Pepperdata Website: https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/'>https://www.pepperdata.com/
Follow Pepperdata on LinkedIn:   / pepperdata
Follow Pepperdata on Twitter:   / pepperdata
Like Pepperdata on Facebook:   / pepperdata