Ensuring Data Science Strategy Success: Essential Technologies for a Modern Data Architecture - dummies

Ensuring Data Science Strategy Success: Essential Technologies for a Modern Data Architecture

By Ulrika Jägare

Your data science strategy will have a higher likelihood of success if you are taking the time to implement modern data architecture. The drive today is to refactor the enterprise technology platform to enable faster, easier, more flexible access to large volumes of precious data. This refactoring is no small undertaking and is usually sparked by a shifting set of key business drivers. Simply put, the data architectures that have dominated enterprise IT for nearly 30 years can no longer handle the workloads needed to drive data-driven businesses forward.

Organizations have long been constrained in their use of data by incompatible formats, limitations of traditional databases, and the inability to flexibly combine data from multiple sources. New technologies are now starting to deliver on the promise to change all that. Improving the deployment model of software is one major step to removing barriers to data usage. Greater data agility also requires more flexible databases and more scalable real-time streaming platforms. In fact, no fewer than seven foundational technologies are needed to deliver a flexible, real-time modern data architecture to the enterprise. These seven key technologies are described below.

Data science strategy: NoSQL databases

The relational database management system (RDBMS) has dominated the database market for nearly 30 years, yet the traditional relational database has been shown to be less than adequate in handling the ever-growing data volumes and the accelerated pace at which data must be handled. NoSQL databases — “no SQL” because it’s decidedly nonrelational — have been taking over because of their speed and ability to scale. They provide a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Because of their speed, NoSQL databases are increasingly used in big data and real-time web applications.

NoSQL databases offer a simplicity of design, simpler horizontal scaling to clusters of machines (a real problem for relational databases), and finer control over availability. The data architecture structures used by NoSQL databases (key-value, wide column, graph, or document, for example) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Sometimes the data structures used by NoSQL databases are also viewed as more flexible than relational database tables.

Data science strategy: Real-time streaming platforms

Responding to customers in real-time is critical to the customer experience. It’s no mystery why consumer-facing industries —Business-to-Consumer (B2C) setups, in other words — have experienced massive disruption in the past ten years. It has everything to do with the ability of companies to react to the user in real-time. Telling a customer that you will have an offer ready in 24 hours is no good because they will have already executed the decision they made 23 hours ago. Moving to a real-time model requires event streaming.

Message-driven applications have been around for years, but today’s streaming platforms scale far better and at far lower cost than their predecessors. The recent advancement in streaming technologies opens the door to many new ways to optimize a business. Reacting to a customer in real-time is one benefit. Another aspect to consider is the benefits to development. By providing a real-time feedback loop to the development teams, event streams can also help companies improve product quality and get new software out the door faster.

Data science strategy: Docker and containers

Docker is a computer program that you can use as part of your data architecture that performs operating-system-level virtualization, also known as containerization. First released in 2013 by Docker, Inc., Docker is used to run software packages called containers, a method of virtualization that packages an application’s code, configurations, and dependencies into building blocks for consistency, efficiency, productivity, and version control. Containers are isolated from each other and bundle their own application, tools, libraries, and configuration files and can communicate with each other by way of well-defined channels.

All containers are run by a single operating system kernel and are thus more lightweight than virtual machines. Containers are created from images that specify their precise content. A container image is a self-contained piece of software that includes everything that it needs in order to run, like code, tools, and resources.

Containers hold significant benefits for both developers and operators as well as for the organization itself. The traditional approach to infrastructure isolation was that of static partitioning, the allocation of a separate, fixed slice of resources, like a physical server or a virtual machine, to each workload. Static partitions made it easier to troubleshoot issues, but at the significant cost of delivering substantially underutilized hardware. web servers, for example, would consume on average only about 10 percent of the total computational power available.

The great benefit of container technology is its ability to create a new type of isolation. Those who least understand containers might believe they can achieve the same benefits by using automation tools like Ansible, Puppet, or Chef, but in fact these technologies are missing vital capabilities. No matter how hard you try, those automation tools cannot create the isolation required to move workloads freely between different infrastructure and hardware setups. The same container can run on bare-metal hardware in an on-premises data center or in a virtual machine in the public cloud. No changes are necessary. That is what true workload mobility is all about.

Data science strategy Container repositories

A container image repository is a collection of related container images, usually providing different versions of the same application or service. It’s critical to maintaining agility in your data architecture. Without a DevOps process with continuous deliveries for building container images and a repository for storing them, each container would have to be built on every machine in which that container could run. With the repository, container images can be launched on any machine configured to read from that repository.

Where this gets even more complicated is when dealing with multiple data centers. If a container image is built in one data center, how do you move the image to another data center? Ideally, by leveraging a converged data platform, you will have the ability to mirror the repository between data centers. A critical detail here is that mirroring capabilities between on-premises and the cloud might be vastly different than between your on-premises data centers. A converged data platform will solve this problem for you by offering those capabilities regardless of the physical or cloud infrastructure you use in your organization.

Data science strategy: Container orchestration

Instead of static hardware partitions, each container appears to be entirely its own private operating system. Unlike virtual machines, containers don’t require a static partition of data computation and memory. This enables administrators to launch large numbers of containers on servers without having to worry so much about exact amounts of memory in their data architecture. With container orchestration tools like Kubernetes, it becomes easy to launch containers, kill them, move them, and relaunch them elsewhere in an environment.

Assuming that you have the new infrastructure components in place (a document database such as MapR-DB or MongoDB, for example) and an event streaming platform (maybe MapR-ES or Apache Kafka) with an orchestration tool (perhaps Kubernetes) in place, what is the next step? You’ll certainly have to implement a DevOps process for coming up with continuous software builds that can then be deployed as Docker containers. The bigger question, however, is what you should actually deploy in those containers you’ve created. This brings us to microservices.

Data science strategy: Microservices

Microservices are a software development technique that structures your data architecture using an application as a collection of services that

  • Are easy to maintain and test
  • Are loosely coupled
  • Are organized around business capabilities
  • Can be deployed independently

As such, microservices come together to form a microservice architecture, one that enables the continuous delivery/deployment of large, complex applications and also enables an organization to evolve its technology stack — the set of software that provides the infrastructure for a computer or a server. The benefit of breaking down an application into different, smaller services is that it improves modularity, which then makes the application easier to understand, develop, and test and to become more resilient to architecture erosion — the violations of a system’s data architecture that lead to significant problems in the system and contribute to its increasing fragility.

With a microservices data architecture, small autonomous teams can run in parallel to develop, deploy, and scale their respective services independently. It also allows the architecture of an individual service to emerge through continuous refactoring — a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior (thus ensuring that it continues to fit within the architectural setting).

The concept of microservices is nothing new. The difference today is that the enabling technologies like NoSQL databases, event streaming, and container orchestration can scale with the creation of thousands of microservices. Without these new approaches to data storage, event streaming, and infrastructure orchestration, large-scale microservices deployments would not be possible. The infrastructure needed to manage the vast quantities of data, events, and container instances would not be able to scale to the required levels.

Microservices are all about delivering agility. A service that is micro in nature generally consists of either a single function or a small group of related functions. The smaller and more focused the functional unit of the work, the easier it will be to create, test, and deploy the service. These services must be decoupled, meaning you can make changes to any one service without having an effect on any other service. If this is not the case, you lose the agility promised by the microservices concept. Admittedly, the decoupling must not be absolute — microservices can, of course, rely on other services — but the reliance should be based on either balanced REST APIs or event streams. (Using event streams allows you to leverage request-and-response topics so that you can easily keep track of the history of events; this approach is a major plus when it comes to troubleshooting, because the entire request flow and all the data in the requests can be replayed at any time.)

Data science strategy: Function as a service

Just as the microservices idea has attracted a lot of interest when it comes to data architecture, so has the rise of server-less computing — perhaps more accurately referred to as function as a service (FaaS). Amazon Lambda is an example of a FaaS framework, where it lets you run code without provisioning or managing servers, and you pay only for the computing time you consume.

FaaS enables the creation of microservices in such a way that the code can be wrapped in a lightweight framework built into a container, executed on demand based on some trigger, and then automatically load-balanced, thanks to the aforementioned lightweight framework. The main benefit of FaaS is that it allows the developer to focus almost exclusively on the function itself, making FaaS the logical conclusion of the microservices approach.

The triggering event is a critical component of FaaS. Without it, there’s no way for the functions to be invoked (and resources consumed) on demand. This ability to automatically request functions when needed is what makes FaaS truly valuable. Imagine, for a moment, that someone reading a user’s profile triggers an audit event, a function that must run to notify a security team. More specifically, maybe it filters out only certain types of records that are to be marked as prompting a trigger. It can be selective, in other words, which plays up the fact that, as a business function, it is completely customizable.

The magic behind a triggering service is really nothing more than working with events in an event stream. Certain types of events are used as triggers more often than others, but any event you want can be made into a trigger. The event could be a document update, or maybe running an OCR process over the new document and then adding the text from the OCR process to a NoSQL database. The possibilities here are endless.

FaaS is also an excellent area for creative uses of machine learning — perhaps machine learning as a service or, more specifically, “a machine learning function aaS.” Consider that whenever an image is uploaded, it could be run through a machine learning framework for image identification and scoring. There’s no fundamental limitation here. A trigger event is defined, something happens, the event triggers the function, and the function does its job.

FaaS is already an important part of microservices adoption, but you must consider one major factor when approaching FaaS: vendor lock-in. The idea behind FaaS is that it’s designed to hide the specific storage mechanisms, the specific hardware infrastructure, and the software component orchestration — all great features, if you’re a software developer. But because of this abstraction, a hosted FaaS offering is one of the greatest vendor lock-in opportunities the software industry has ever seen. Because the APIs aren’t standardized, migrating from one FaaS offering in the public cloud to another is difficult without throwing away a substantial part of the work that has been performed. If FaaS is approached in a more methodical way — by leveraging events from a converged data platform, for example — it becomes easier to move between cloud providers.