How to Create a Modern Data Architecture For Your Data Science Strategy - dummies

How to Create a Modern Data Architecture For Your Data Science Strategy

By Ulrika Jägare

In many larger companies, the IT function is usually tasked with defining and building data architecture, especially for data generated by internal IT systems. It is many times the case, however, that data coming from external sources — customers, products, or suppliers —are stored and managed separately by the responsible business units. When that’s the case, you’re faced with the challenge of making sure that all share a common data architecture approach, one that enables all these different data types and user needs to come together by means of an efficient and enabling data pipeline.

Data architecture for data science strategy

This data pipeline is all about ensuring an end-to-end flow of data, where applied data management and governance principles focus on a balance between user efficiency and ensuring compliance to relevant laws and regulations.

In smaller companies or modern data-driven enterprises, the IT function is usually highly integrated with the various business functions, which includes working closely with data engineers in the business units in order to minimize the gap between IT and the business functions. This approach has proven very efficient.

So, after you decide which function will set up and drive which part of the data architecture, it’s time to get started. Using the step-by-step guide provided in this list, you’ll be on your way to data-architecture perfection in no time:

  1. Identify your use cases as well as the necessary data for those use cases.

    The first step to take when starting to build your data architecture is to work with business users to identify the use cases and type of data that is either the most relevant or simply the most prioritized at that time. Remember that the purpose of a good data architecture is to bring together the business and technology sides of the company to ensure that they’re working toward a common purpose. To find the most valuable data for your company, you should look for the data that could generate insights with high business impact. This data may reside within enterprise data environments and might have been there for some time, but perhaps the means and technologies to unearth such data and draw insights from it have been too expensive or insufficient. The availability of today’s open source technologies and cloud offerings enable enterprises to pull out such data and work with it in a much more cost-effective and simplified way.

  2. Set up data governance.

    It is of the utmost importance that you make data governance activities a priority. The process of identifying and ingesting data as well as building models for your data needs to ensure quality and relevance from a business perspective is important and should also include efficient control mechanisms as part of the system support. Responsibility for data must also be established, whether it concerns individual data owners or different data science functions.

  3. Build your data architecture for flexibility.

    The rule here is that you should build data systems designed to change, not ones designed to last. A key rule for any data architecture these days is to not build in dependency to a particular technology or solution. If a new key solution or technology becomes available on the market, the architecture should be able to accommodate it. The types of data coming into enterprises can change, as do the tools and platforms that are put into place to handle them. The key is therefore to design a data environment that can accommodate such change.

  4. Decide on techniques for capturing data.

    You need to consider your techniques for acquiring data, and you especially need to make sure that your data architecture can at some point handle real-time data streaming, even if it isn’t an absolute requirement from the start. A modern data architecture needs to be built to support the movement and analysis of data to decision makers when and where it’s needed.

    Focus on real-time data uploads from two perspectives: the need to facilitate real-time access to data (data that could be historical) as well as the requirement to support data from events as they’re occurring. For the first category, existing infrastructure such as data warehouses have a critical role to play. For the second, new approaches such as streaming analytics and machine learning are critical. Data may be coming from anywhere — transactional applications, devices and sensors across various connected devices, mobile devices and, telecommunications equipment, and who-knows-where-else. A modern data architecture needs to support data movement at all speeds, whether it’s sub-second speeds or with 24-hour latency.

  5. Apply the appropriate data security measures to your data architecture.

    Do not forget to build security into your data architecture. A modern data architecture recognizes that threats to data security are continually emerging, both externally and internally. These threats are constantly evolving and may be coming through email one month and through flash drives the next. Data managers and data architects are usually the most knowledgeable when it comes to understanding what is required for data security in today’s environments, so be sure to utilize their expertise.

  6. Integrate master data management.

    Make sure that you address master data management, the method used to define and manage the critical data of an organization to provide, with the help of data integration, a single point of reference. With an agreed-on and built-in master data management (MDM) strategy, your enterprise is able to have a single version of the truth that synchronizes data to applications accessing that data. The need for an MDM-based architecture is critical because organizations are consistently going through changes, including growth, realignments, mergers, and acquisitions. Often, enterprises end up with data systems running in parallel, and often, critical records and information may be duplicated and overlap across these silos. MDM ensures that applications and systems across the enterprise have the same view of important data.

  7. Offer data as a service (aaS).

    This particular step is a relatively new approach, but it has turned out to be quite a successful component — make sure that your data architecture is able to position data as a service (aaS). Many enterprises have a range of databases and legacy environments, making it challenging to pull information from various sources. With the aaS approach, access is enabled through a virtualized data services layer that standardizes all data sources, regardless of device, applicator, or system. Data as a service is by definition a form of internal company cloud service, where data — along with different data management platforms, tools, and applications — are made available to the enterprise as reusable, standardized services. The potential advantage of data as a service is that processes and assets can be prepackaged based on corporate or compliance standards and made readily available within the enterprise cloud.

  8. Enable self-service capabilities.

    As the final step in building your data architecture, you should definitely invest in self-service environments. With self-service, business users can configure their own queries and get the data or analyses they want, or they can conduct their own data discovery without having to wait for their IT or data management departments to deliver the data. The route to self-service is providing front-end interfaces that are simply laid out and easy to use for your target audience. In the process, a logical service layer can be developed that can be reused across various projects, departments, and business units. IT could still have an important role to play in a self-service-enabled architecture, including aspects such as data pipeline operations (hardware, software, and cloud) and data governance control mechanisms, but it would have to spend less and less of its time and resources on fulfilling user requests that could be better formulated and addressed by the user themselves.