Manage Virtualization for Big Data

Statistics for Big Data For Dummies

Virtualization separates resources and services from the underlying physical delivery environment, enabling you to create many virtual systems within a single physical system. One of the primary reasons that companies have implemented virtualization is to improve the performance and efficiency of processing of a diverse mix of workloads

The big data hypervisor

In an ideal world, you don’t want to worry about the underlying operating system and the physical hardware. A hypervisor is the technology responsible for ensuring that resource sharing takes place in an orderly and repeatable way.

The hypervisor sits at the lowest levels of the hardware environment and uses a thin layer of code to enable dynamic resource sharing. The hypervisor makes it seem like each operating system has the physical resources all to itself.

In the world of big data, you may need to support many different operating environments. The hypervisor becomes an ideal delivery mechanism for the technology components of the big data stack. The hypervisor lets you show the same application on lots of systems without having to physically copy that application onto each system.

As an added benefit, because of the hypervisor architecture, it can load any different operating systems as though they were just another application. So, the hypervisor is a very practical way of getting things virtualized quickly and efficiently.

The guest operating systems are the operating systems running on the virtual machines. With virtualization technology, you can set up the hypervisor to split the physical computer’s resources. Resources can be split 50/50 or 80/20 between two guest operating systems, for example.

The beauty of this arrangement is that the hypervisor does all the heavy lifting. The guest operating system doesn’t care that it’s running in a virtual partition; it thinks it has a computer all to itself.

You find basically two types of hypervisors:

Type 1 hypervisors run directly on the hardware platform. They achieve higher efficiency because they're running directly on the platform.
Type 2 hypervisors run on the host operating system. They are often used when a need exists to support a broad range of I/O devices.

Abstraction and big data virtualization

For IT resources and services to be virtualized, they are separated from the underlying physical delivery environment. The term for this act of separation is called abstraction. Abstraction is a key concept in big data. MapReduce and Hadoop are distributed computing environments where everything is abstracted. The detail is abstracted out so that the developer or analyst does not need to be concerned with where the data elements are located.

Abstraction minimizes the complexity of something by hiding the details and providing only the relevant information. For example, if you were going to pick up someone whom you’ve never met before, he might tell you the location to meet him, and what he will be wearing. He doesn’t need to tell you where he was born, how much money he has in the bank, his birth date, and so on.

That’s the idea with abstraction — it’s about providing a high-level specification rather than going into lots of detail about how something works.

Implement virtualization to work with big data

Virtualization helps makes your IT environment smart enough to handle big data analysis. By optimizing all elements of your infrastructure, including hardware, software, and storage, you gain the efficiency needed to process and manage large volumes of structured and unstructured data. With big data, you need to access, manage, and analyze structured and unstructured data in a distributed environment.

Big data assumes distribution. In practice, any kind of MapReduce will work better in a virtualized environment. You need the capability to move workloads around based on requirements for compute power and storage.

Virtualization will enable you to tackle larger problems that have not yet been scoped. You may not know in advance how quickly you will need to scale.

Virtualization will enable you to support a variety of operational big data stores. For example, a graph database can be spun up as an image.

The most direct benefit from virtualization is to ensure that MapReduce engines work better. Virtualization will result in better scale and performance for MapReduce. Each one of the Map and Reduce tasks needs to be executed independently. If the MapReduce engine is parallelized and configured to run in a virtual environment, you can reduce management overhead and allow for expansions and contractions in the task workloads.

MapReduce itself is inherently parallel and distributed. By encapsulating the MapReduce engine in a virtual container, you can run what you need whenever you need it. With virtualization, you increase your utilization of the assets you have already paid for by turning them into generic pools of resources.