Characteristics of a Big Data Analysis Framework
Even though new sets of tools continue to be available to help you manage and analyze your big data framework more effectively, you may not be able to get what you need. In addition, a range of technologies can support big data analysis and requirements such as availability, scalability, and high performance. Some of these include big data appliances, columnar databases, in-memory databases, nonrelational databases, and massively parallel processing engines.
So, what are business users looking for when it comes to big data analysis? The answer to that question depends on the type of business problem they are trying to solve. Some important considerations as you select a big data application analysis framework include the following:
Support for multiple data types: Many organizations are incorporating, or expect to incorporate, all types of data as part of their big data deployments, including structured, semi-structured, and unstructured data.
Handle batch processing and/or real time data streams: Action orientation is a product of analysis on real-time data streams, while decision orientation can be adequately served by batch processing. Some users will require both, as they evolve to include varying forms of analysis.
Utilize what already exists in your environment: To get the right context, it may be important to leverage existing data and algorithms in the big data analysis framework.
Support NoSQL and other newer forms of accessing data: While organizations will continue to use SQL, many are also looking at newer forms of data access to support faster response times or faster times to decision.
Overcome low latency: If you’re going to be dealing with high data velocity, you’re going to need a framework that can support the requirements for speed and performance.
Provide cheap storage: Big data means potentially lots of storage — depending on how much data you want to process and/or keep.
Integrate with cloud deployments: The cloud can provide storage and compute capacity on demand. More and more companies are using the cloud as an analysis sandbox. Increasingly, the cloud is becoming an important deployment model to integrate existing systems with cloud deployments in a hybrid model.
While all these characteristics are important, the perceived and actual value of creating applications from a framework is quicker time to deployment. With all these capabilities in mind,consider a big data analysis application framework from a company called Continuity.
The Continuity AppFabric is a framework supporting the development and deployment of big data applications. The AppFabric itself is a set of technologies specifically designed to abstract away the vagaries of low-level big data technologies. The application builder is an Eclipse plug-in permitting the developer to build, test, and debug locally and in familiar surroundings.
AppFabric capabilities include the following:
Stream support for real-time analysis and reaction
Unified API, eliminating the need to write to big data infrastructures
Query interfaces for simple results and support for pluggable query processors
Data sets representing queryable data and tables accessible from the Unified API
Reading and writing of data independent of input or output formats or underlying component specifics
Transaction-based event processing
Multimodal deployment to a single node or the cloud
This approach is going to gain traction for big data application development primarily because of the plethora of tools and technologies required to create a big data environment.
Lack of collaboration can be costly in many ways. Large organizations can benefit from tools that drive collaborations. Very often people doing similar work are unaware of each other’s efforts leading to duplicate work.
Another good example of an application framework is OpenChorus. In addition to rapid development of big data analysis applications, it also supports collaboration and provides many other features important to software developers, like tool integration, version control, and configuration management.
Open Chorus is a project maintained by EMC Corporation and is available under the Apache 2.0 license. EMC also produces and supports a commercial version of Chorus. Both Open Chorus and Chorus have vibrant partner networks as well as a large set of individual and corporate contributors.
Open Chorus is a generic framework. Its leading feature is the capability to create a communal hub for sharing big data sources, insights, analysis techniques, and visualizations. Open Chorus provides the following:
Repository of analysis tools, artifacts, and techniques with complete versioning, change tracking, and archiving
Workspaces and sandboxes that are self-provisioned and easily maintained by community members
Visualizations, including heat maps, time series, histograms, and so on
Federated search of any and all data assets, including Hadoop, metadata, SQL repositories, and comments
Collaboration through social networking–like features encouraging discovery, sharing, and brainstorming
Extensibility for integration of third-party components and technologies