Enterprise Architecture for Big Data
In perspective, the goal for designing an architecture for data analytics comes down to building a framework for capturing, sorting, and analyzing big data for the purpose of discovering actionable results.
There is no one correct way to design the architectural environment for big data analytics. However, most designs need to meet the following requirements to support the challenges big data can bring. These criteria can be distributed mainly over six layers and can be summarized as follows:
- Your architecture should include a big data platform for storage and computation, such as Hadoop or Spark, which is capable of scaling out.
- Your architecture should include large-scale software and big data tools capable of analyzing, storing, and retrieving big data. These can consist of the components of Spark, or the components of Hadoop ecosystem (such as Mahout and Apache Storm). You might also want to adopt a big data large-scale tool that will be used by data scientists in your business. These include Radoop from RapidMiner, IBM Watson, and many others.
- Your architecture should support virtualization. Virtualization is an essential element of cloud computing because it allows multiple operating systems and applications to run at the same time on the same server. Because of this capability, virtualization and cloud computing often go hand in hand. You might also adopt a private cloud in your architecture. A private cloud offers the same architecture as a public cloud, except the services in a private cloud are restricted to a certain number of users through a firewall. Amazon Elastic Computer Cloud is one of the major providers of private cloud solutions and storage space for businesses, and can scale as they grow.
- Your architecture might have to offer real-time analytics if your enterprise is working with fast data (data that is flowing in streams at a fast rate). In a scenario where, you would need to consider an infrastructure that can support the derivation of insights from data in near real time without waiting for data to be written to disk. For example, Apache Spark’s streaming library can be glued with other components to support analytics on fast data streams.
- Your architecture should account for Big Data security by creating a system of governance around the supply of access to the data and the results. The big data security architecture should be in line with the standard security practices and policies in your organization that govern access to data sources.
If you’re looking for a robust tool to help you get started on data analytics without the need for expertise in the algorithms and complexities behind building predictive models, then you should try KNIME, RapidMiner, or IBM Watson, among others.
Most of the preceding tools offer a comprehensive, ready-to-use toolbox that consists of capabilities that can get you started. For example, RapidMiner has a large number of algorithms from different states of the predictive analytics lifecycle, so it provides a straightforward path to quickly combining and deploying analytics models.
With RapidMiner, you can quickly load and prepare your data, create and evaluate predictive models, use data processes in your applications and share them with your business users. With very few clicks, you can easily build a simple predictive analytics model.
RapidMiner can be used by both beginners and experts. RapidMiner Studio is an open-source predictive analytics software that has an easy-to-use graphical interface where you can drag and drop algorithms for data loading, data preprocessing, predictive analytics algorithms, and model evaluations to build your data analytics process.
RapidMiner was built to provide data scientists with a comprehensive toolbox that consists of more than a thousand different operations and algorithms. The data can be loaded quickly, regardless of whether your data source is in Excel, Access, MS SQL, MySQL, SPSS, Salesforce, or any other format that is supported by RapidMiner. In addition to data loading, predictive model building and model evaluation, this tool also provides you with data visualization tools that include adjustable self-organizing maps and 3-D graphs.
RapidMiner offers an open extension application programming interface (API) that allows you to integrate your own algorithms into any pipeline built in RapidMiner. It’s also compatible with many platforms and can run on major operating systems. There is an emerging online community of data scientists that use RapidMiner where they can share their processes, and ask and answer questions.
Another easy-to-use tool that is widely used in the analytics world is KNIME. KNIME stands for the Konstanz Information Miner. It’s an open source data analytics that can help you build predictive models through a data pipelining concept. The tool offers drag-and-drop components for ETL (extraction, Transformation and Loading) and components for predictive modeling as well as data visualization.
KNIME and RapidMiner are tools that you can arm your data science team to easily get started building predictive models. For an excellent use case on KNIME, check out the paper “The Seven Techniques for Dimensionality Reduction.”
RapidMiner Radoop is a product by RapidMiner that extends predictive analytics toolbox on RapidMiner Studio to run on Hadoop and Spark environments. Radoop encapsulates MapReduce, Pig, Mahout, and Spark. After you define your workflows on Radoop, then instructions are executed in Hadoop or Spark environment, so you don’t have to program predictive models but focus on model evaluation and development of new models.
For security, Radoop supports Kerberos authentication and integrates with Apache Ranger and Apache Sentry.