Numerous combinations of deployment and delivery models exist for big data in the cloud. For example, you can utilize a public cloud IaaS or a private cloud IaaS. So, what does this mean for big data and why is the cloud a good fit for it? Well, big data requires distributed clusters of compute power, which is how the cloud is architected.
In fact, a number of cloud characteristics make it an important part of the big data ecosystem:
Scalability: Scalability with regard to hardware refers to the capability to go from small to large amounts of processing power with the same architecture. With regard to software, it refers to the consistency of performance per unit of power as hardware resources increase. The cloud can scale to large data volumes.
Distributed computing, an integral part of the cloud model, really works on a “divide and conquer” plan. So if you have huge volumes of data, they can be partitioned across cloud servers. An important characteristic of IaaS is that it can dynamically scale. This means that if you wind up needing more resources than expected, you can get them. This ties into the concept of elasticity.
Elasticity: Elasticity refers to the capability to expand or shrink computing resource demand in real time, based on need. One of the benefits of the cloud is that customers have the potential to access as much of a service as they need. This can be helpful for big data projects where you might need to expand the amount of computing resources you need to deal with the data.
Resource pooling: Cloud architectures enable the efficient creation of groups of shared resources that make the cloud economically viable.
Self-service: With self-service, the user of a cloud resource is able to use a browser or a portal interface to acquire the resources needed, say, to run a huge predictive model. This is dramatically different than how you might gain resources from a data center, where you would have to request the resources from IT operations.
Often low up-front costs: If you use a cloud provider, up-front costs can often be reduced because you’re not buying huge amounts of hardware or leasing out new space for dealing with your big data. By taking advantage of the economies of scale associated with cloud environments, the cloud can look attractive.
Pay as you go: A typical billing option for a cloud provider is Pay as You Go, which means that you are billed for resources used based on instance pricing. This can be useful if you’re not sure what resources you need for your big data project.
Fault tolerance: Cloud service providers should have fault tolerance built into their architecture, providing uninterrupted services despite the failure of one or more of the system’s components.
Clearly, the very nature of the cloud makes it an ideal computing environment for big data. So how might you use big data together with the cloud? Here are some examples:
IaaS in a public cloud: In this scenario, you would be using a public cloud provider’s infrastructure for your big data services because you don’t want to use your own physical infrastructure. IaaS can provide the creation of virtual machines with almost limitless storage and compute power. You can pick the operating system you want, and you have the flexibility to dynamically scale the environment to meet your needs.
PaaS in a private cloud: PaaS is an entire infrastructure packaged so that it can be used to design, implement, and deploy applications and services in a public or private cloud environment. PaaS enables an organization to leverage key middleware services without having to deal with the complexities of managing individual hardware and software elements.
PaaS vendors are beginning to incorporate big data technologies such as Hadoop and MapReduce into their PaaS offerings. For example, you might want to build a specialized application to analyze vast amounts of medical data. The application would make use of real-time as well as non-real-time data. It’s going to require Hadoop and MapReduce for storage and processing.
SaaS in a hybrid cloud: Here you might want to analyze “voice of the customer” data from multiple channels. Many companies have come to realize that one of the most important data sources is what the customer thinks and says about their company. Getting access to voice of the customer data can provide invaluable insights into behaviors and actions. Increasingly, customers are “vocalizing” on public sites.
The value of the customers’ input can be greatly enhanced by incorporating this public data into your analysis.