The Importance of Virtualization to Big Data
Solving big data challenges requires the management of large volumes of highly distributed data stores along with the use of compute- and data-intensive applications. Virtualization provides the added level of efficiency to make big data platforms a reality. Although virtualization is technically not a requirement for big data analysis, software frameworks are more efficient in a virtualized environment.
Virtualization has three characteristics that support the scalability and operating efficiency required for big data environments:
Partitioning: In virtualization, many applications and operating systems are supported in a single physical system by partitioning the available resources.
Isolation: Each virtual machine is isolated from its host physical system and other virtualized machines. Because of this isolation, if one virtual instance crashes, the other virtual machines and the host system aren't affected. In addition, data isn’t shared between one virtual instance and another.
Encapsulation: A virtual machine can be represented as a single file, so you can identify it easily based on the services it provides.
Big data server virtualization
In server virtualization, one physical server is partitioned into multiple virtual servers. The hardware and resources of a machine — including the random access memory (RAM), CPU, hard drive, and network controller — can be virtualized into a series of virtual machines that each runs its own applications and operating system.
A virtual machine (VM) is a software representation of a physical machine that can execute or perform the same functions as the physical machine. A thin layer of software is actually inserted into the hardware that contains a virtual machine monitor, or hypervisor.
Server virtualization uses the hypervisor to provide efficiency in the use of physical resources. Of course, installation, configuration, and administrative tasks are associated with setting up these virtual machines.
Server virtualization helps to ensure that your platform can scale as needed to handle the large volumes and varied types of data included in your big data analysis. You may not know the extent of the volume needed before you begin your analysis. This uncertainty makes the need for server virtualization even greater, providing your environment with the capability to meet the unanticipated demand for processing very large data sets.
In addition, server virtualization provides the foundation that enables many of the cloud services used as data sources in a big data analysis. Virtualization increases the efficiency of the cloud that makes many complex systems easier to optimize.
Big data application virtualization
Application infrastructure virtualization provides an efficient way to manage applications in context with customer demand. The application is encapsulated in a way that removes its dependencies from the underlying physical computer system. This helps to improve the overall manageability and portability of the application.
In addition, the application infrastructure virtualization software typically allows for codifying business and technical usage policies to make sure that each of your applications leverages virtual and physical resources in a predictable way. Efficiencies are gained because you can more easily distribute IT resources according to the relative business value of your applications.
Application infrastructure virtualization used in combination with server virtualization can help to ensure that business service-level agreements are met. Server virtualization monitors CPU and memory usage, but does not account for variations in business priority when allocating resources.
Big data network virtualization
Network virtualization provides an efficient way to use networking as a pool of connection resources. Instead of relying on the physical network for managing traffic, you can create multiple virtual networks all utilizing the same physical implementation. This can be useful if you need to define a network for data gathering with a certain set of performance characteristics and capacity and another network for applications with different performance and capacity.
Virtualizing the network helps reduce these bottlenecks and improve the capability to manage the large distributed data required for big data analysis.
Big data processor and memory virtualization
Processor virtualization helps to optimize the processor and maximize performance. Memory virtualization decouples memory from the servers.
In big data analysis, you may have repeated queries of large data sets and the creation of advanced analytic algorithms, all designed to look for patterns and trends that are not yet understood. These advanced analytics can require lots of processing power (CPU) and memory (RAM). For some of these computations, it can take a long time without sufficient CPU and memory resources.
Big data and storage virtualization
Data virtualization can be used to create a platform for dynamic linked data services. This allows data to be easily searched and linked through a unified reference source. As a result, data virtualization provides an abstract service that delivers data in a consistent form regardless of the underlying physical database. In addition, data virtualization exposes cached data to all applications to improve performance.
Storage virtualization combines physical storage resources so that they are more effectively shared. This reduces the cost of storage and makes it easier to manage data stores required for big data analysis.