Ulrika Jägare

Ulrika Jägare is an M.Sc. Director at Ericsson AB. With a decade of experience in analytics and machine intelligence and 19 years in telecommunications, she has held leadership positions in R&D and product management. Ulrika was key to the Ericsson??s Machine Intelligence strategy and the recent Ericsson Operations Engine launch – a new data and AI driven operational model for Network Operations in telecommunications.

Articles From Ulrika Jägare

10 results
10 results
Data Science Techniques You Can Use for Successful Change Management

Article / Updated 06-30-2019

For your data science investment to succeed, the data science strategy you adopt should include well-thought-out strategies for managing the fundamental change that data science solutions impose on an organization. One effective and efficient way to tackle these data science challenges is by using data-driven change management techniques to drive the transformation itself — in other words, drive the change by “practicing what you preach.” Here are some examples of how to do this in practice. Using digital engagement tools for change management For companies, there is a new generation of real-time employee opinion tools that are starting to replace old-fashioned employee opinion surveys. These tools can help you manage your data and tell you far more than simply what employees are thinking about once a year. In some companies, employees are surveyed weekly using a limited number of questions. The questions and models are constructed in such a way that management can follow fluctuations in important metrics as they happen rather than the usual once or twice a year. These tools have obvious relevance for change management and can help answer questions like these: Is a change being equally well received across locations? Are certain managers better than others at delivering messages to employees? Assume that you have a large travel-and-tourism firm that is using one of these tools for real-time employee feedback. One data-driven approach to use in such a situation is to experiment with different change management strategies within selected populations in the company. After a few changes in the organization, you can use the data collected to identify which managers prove to be more effective in leading change than others. After that has been established, you can observe those managers to determine what they’re doing differently in their change management approach. You can then share successful techniques with other managers. This type of real-time feedback offers an opportunity to learn rapidly how communication events or engagement tactics have been received, thus optimizing your actions in days (rather than in weeks, which is typical of traditional approaches). The data can then feed into a predictive model, helping you determine with precision which actions will help accelerate adoption of a new practice, process, or behavior by a given employee group. You can find some commercial tools out there — culture IQ polls, for example — that support this kind of data collection. These kinds of polls sample groups of employees daily or weekly via a smartphone app to generate real-time insights in line with whatever scope you have defined. Another tool, Waggl.com, has a more advanced functionality, allowing you to have an ongoing conversation with employees about a change effort as well as allowing change managers to tie this dialogue to the progress of initiatives they’re undertaking. These different types of digital engagement tools can have a vast impact on change management programs, but the data stream they create could be even more important. The data that’s generated can be used to build predictive models of change. Using and deploying these models on real transformation projects and then sharing your findings helps to ensure a higher success rate with data-driven change initiatives in the future. Applying social media analytics to identify stakeholder sentiment for change management Change managers can also look beyond the boundaries of the enterprise for insights about the impact of change of process management. Customers, channel partners, suppliers, and investors are all key stakeholders when it comes to change programs. They are also more likely than employees to comment on social media about changes a company is making, thus giving potentially vital insight into how they’re responding. Ernst & Young (now known as EY) is using a tool for social media analytics called SMAART, which can interpret sentiment within consumer and influencer groups. In a project for a pharmaceutical company, EY was able to isolate the specific information sources that drove positive and negative sentiment toward the client’s brand. The company is now starting to apply these techniques to understand the external impact of change management efforts, and it’s a simple leap to extend these techniques within the enterprise. Advances in the linguistic analysis of texts mean that clues about behavior can now be captured from a person’s word choices; even the use of articles and pronouns can help reveal how someone feels. Applying sentiment analysis tools to data in anonymized company email or the dialogue in tools like Waggl.com can give fresh insight about your organization's change readiness and the reactions of employees to different initiatives. And, the insights gained from analyzing internal communication will be stronger when combined with external social media data. Capturing reference data in change projects Have you ever worked in an organization where different change management programs or projects were compared to one another in terms of how efficiently they made the change happen? Or one where a standard set of measurements were used across different change initiatives? No? Most people haven’t. Why is it that organizations often seem obsessed with measuring fractional shifts in operational performance and in capturing data on sales, inventory turns, and manufacturing efficiency, but show no interest in tracking performance from the differences in change projects and change management, beyond knowing which ones have met their goals? Some people may claim that you can't compare change projects or change management within an organization; it would be like comparing apples to oranges. But that’s not accurate: Different projects may have unique features, but you'll find more similarities than differences between different types of projects. Capturing information about the team involved, the population engaged in the change, how long it took to implement, what tactics were used, and so on is a good idea. It enables you to build a reference data set for future learning, reuse, and efficiency benchmarking. However, remember that although it may not yield immediate benefit, as the overall data set grows, it will make it easier to build accurate predictive models of organizational change of process going forward. Using data science to select people for change roles For quite a long time, companies have been using data-driven methods to select candidates for senior change management positions. And today some businesses, such as retailers, are starting to use predictive analytics for hiring frontline staff. Applying these tools when building a change team can both improve project performance significantly and help to build another new data set. If every change leader and team member would undergo testing and evaluation before a change of process project starts, that data could become important variables to include as you search for an underlying model on what leads to a successful change program. This can even be extended to more informal roles like change leaders, allowing organizations to optimize selection based on what they know about successful personalities for these types of roles. Along these lines, the California start-up LEDR Technologies is pioneering techniques to predict team performance. It integrates data sources and uses them to help teams anticipate the challenges they may face with team dynamics so that the team can prevent them before they occur. Automating change metrics Picture a company or an organization that has a personalized dashboard it has developed in partnership with the firm’s leadership team — one that reflects the company’s priorities, competitive position, and future plans. These dashboards should also be used to offer insights related to the different transformation investments you've made. Keep in mind that much of the data that can act as interesting indicators for change management are already available today — they're just not being collected. When a company builds a dashboard for identifying recruitment and attrition, it’s teaching the executive team to use data to perform people-related decisions. However, it can take quite some time to set it up correctly and iron out the bugs. Want a suggestion? Don't wait. Start building these type of dashboards as fast as possible now and, where possible, automate them. Why the automation? Change dashboards are vulnerable to version control issues, human error, and internal politics. Automating data management and dashboard generation can make it more transparent and help you keep data integrity.

View Article
Current Trends in Data

Article / Updated 06-30-2019

Big data was definitely the thing just a couple of years ago, but now there's much more of a buzz around the idea of data value — more specifically, how analysis can turn data into value. The following information examines some of the trends related to utilizing data to capture new value. Data monetization One trend in data that has taken hold is monetization. Monetizing data refers to how companies can utilize their domain expertise to turn the data they own or have access to into real, tangible business value or new business opportunities. Data monetization can refer to the act of generating measurable economic benefits from available data sources by way of analytics, or, less commonly, it may refer to the act of monetizing data services. In the case of data analytics, typically these benefits appear as revenue or cost savings, but they may also include market share or corporate market value gains. One could argue that data monetization for increased company revenue or cost savings is simply the result of being a data-driven organization. Though that argument isn’t totally wrong, company leaders are taking an increasing interest in the market to explore how data monetization can drive the innovation of entirely new business models in various different business segments. One good example of how this process can work is when telecom operators sell data on the positions of rapidly forming clusters of users (picture the conclusion of a sporting event or a concert by the latest YouTube sensation) to taxi companies. This allows taxi cars to be available proactively in the right area when a taxi will most likely be needed. This is a completely new type of business model and customer base for a traditional telecom operator, opening up new types of business and revenues based on available data. Responsible AI AI (artificial intelligence) has become a leader in data trends in recent years. Responsible AI systems are characterized by transparency, accountability, and fairness, where users have full visibility into which data is being used and how. It also assumes that companies are communicating the possible consequences of using the data. That includes both potential positive and negative impact. Responsible AI is also about generating customer and stakeholder trust based on following communicated policies and principles over time, including the ability to maintain control over the AI system environment itself. Strategically designing your company´s data science infrastructure and solutions with responsible AI in mind is not only wise, but could also turn out to be a real business differentiator going forward. Just look at how the opposite approach, taken by Facebook and Cambridge Analytica, turned into a scandal which ended by putting Cambridge Analytica out of business. You might remember that Cambridge Analytica gained access to the private and personal information of more than 50 million Facebook users in the US and then offered tools that could then use that data to identify the personalities of American voters and influence their behavior. Facebook, rather than being hacked, was a willing participant in allowing their users' data to be used for other purposes without explicit user consent. The data included details on users’ identities, friend networks, and “likes.” The idea was to map personality traits based on what people had liked on Facebook, and then use that information to target audiences with digital ads. Facebook has also been accused of spreading Russian propaganda and fake news which, together with the Cambridge Analytica incident, has severely impacted the Facebook brand the last couple of years. This type of severe privacy invasion has not only opened many people's eyes in terms of the usage of their data but also impacted the company brands. Cloud-based data architectures Cloud-based computing is a data trend that is sweeping the business world. More and more companies are moving away from on-premise-based data infrastructure investments toward virtualized and cloud-based data architectures. The driving force behind this move is that traditional data environments are feeling the pressure of increasing data volumes and are unable to scale up and down to meet constantly changing demands. On-premise infrastructure simply lacks the flexibility to dynamically optimize and address the challenges of new digital business requirements. Re-architecting these traditional, on-premise data environments for greater access and scalability provides data platform architectures that seamlessly integrate data and applications from various sources. Using cloud-based compute and storage capacity enables a flexible layer of artificial intelligence and machine learning tools to be added as a top layer in the architecture so that you can accelerate the value that can be obtained from large amounts of data. Computation and intelligence in the edge Let’s take a look at a truly edgy data trend. Edge computing describes a computing architecture in which data processing is done closer to where the data is created— Internet of Things (IoT) devices like connected luggage, drones, and connected vehicles like cars and bicycles, for example. There is a difference between pushing computation to the edge (edge compute) and pushing analytics or machine learning to the edge (edge analytics or machine learning edge). Edge compute can be executed as a separate task in the edge, allowing data to be preprocessed in a distributed manner before it’s collected and transferred to a central or semi-centralized environment where analytics methods or machine learning/artificial intelligence technologies are applied to achieve insights. Just remember that running analytics and machine learning on the edge requires some form of edge compute to also be in place to allow the insight and action to happen directly at the edge. The reason behind the trend to execute more in the edge mainly depends on factors such as connectivity limitations, low-latency use cases where millisecond response times are needed to perform an immediate analysis and make a decision (in the case of self-driving cars, for example). A final reason for executing more in the edge is bandwidth constraints on transferring data to a central point for analysis. Strategically, computing in the edge is an important aspect to consider from an infrastructure-design perspective, particularly for companies with significant IoT elements. When it comes to infrastructure design, it’s also worth considering how the edge compute and intelligence solutions will work with the centralized (usually cloud-based) architecture. Many view cloud and edge as competing approaches, but cloud is a style of computing where elastically scalable technology capabilities are delivered as a service, offering a supporting environment for the edge part of the infrastructure. Not everything, however, can be solved in the edge; many use cases and needs are system- or network-wide and therefore need a higher-level aggregation in order to perform the analysis. Just performing the analysis in the edge might not give enough context to make the right decision. Those types of computational challenges and insights are best solved in a cloud-based, centralized model. As you can see, the cloud setup can be done in a decentralized manner as well, and these decentralized instances are referred to as cloud-edge. For a larger setup on a regional or global scale, the decentralized model can be used to support edge implementations at the IoT device level in a certain country or to support a telecom operator in its efforts to include all connected devices in the network. This is useful for keeping the response time low and not moving raw data over country borders. Digital twins This particular trend in data will have you seeing double. A digital twin refers to a digital representation of a real-world entity or system — a digital view of a city's telecommunications network built up from real data, for example. Digital twins in the context of IoT projects is a promising area that is now leading the interest in digital twins. It’s most likely an area that will grow significantly over the next three to five years. Well-designed digital twins are assets that have the potential to significantly improve enterprise control and decision-making going forward. Digital twins integrate artificial intelligence, machine learning, and analytics with data to create living digital simulation models that update and change as their physical counterparts change. A digital twin continuously learns and updates itself from multiple sources to represent its near real-time status, working condition, or position. Digital twins are linked to their real-world counterparts and are used to understand the state of the system, respond to changes, improve operations, and add value. Digital twins start out as simple digital views of the real system and then evolve over time, improving their ability to collect and visualize the right data, apply the right analytics and rules, and respond in ways that further your organization's business objectives. But you can also use a digital twin to run predictive models or simulations which can be used to find certain patterns in the data building up the digital twin that might lead to problems. Those insights can then be used to prevent a problem proactively. Adding automated abilities to make decisions based on the digital-twin concept of predefined and preapproved policies would be a great capability to add to any operational perspective — managing an IoT system such as a smart city, for example. Blockchain Blockchain is a trend in data that holds promise for future innovations. The blockchain concept has evolved from a digital currency infrastructure into a platform for digital transactions. A blockchain is a growing list of records (blocks) that are linked using cryptography. Each block contains a cryptographic hash of the previous block, a timestamp, and transaction data. By design, a blockchain is resistant to modification of the data. It’s an open and public ledger that can record transactions between two parties efficiently and in a verifiable and permanent way. A blockchain is also a decentralized and distributed digital ledger that is used to record transactions across many computers so that any involved record cannot be altered retroactively without the alteration of all subsequent blocks. The blockchain technologies offer a significant step away from the current centralized, transaction-based mechanisms and can work as a foundation for new digital business models for both established enterprises and start-ups. The image below shows how to use blockchain to carry out a blockchain transaction. Although the hype surrounding blockchains was originally focused on the financial services industry, blockchains have many potential areas of usage, including government, healthcare, manufacturing, identity verification, and supply chain. Although blockchain holds long-term promise and will undoubtedly create disruption, its promise has yet to be proven in reality: Many of the associated technologies are too immature to use in a production environment and will remain so for the next two to three years. Conversational platforms Conversational AI is a form of artificial intelligence that allows people to communicate with applications, websites, and devices in everyday, humanlike natural language via voice, text, touch, or gesture input. For users, it allows fast interaction using their own words and terminology. For enterprises, it offers a way to build a closer connection with customers via personalized interaction and to receive a huge amount of vital business information in return. This image shows the interaction between a human and a bot. This trend in data will most likely drive the next paradigm shift in how humans interact with the digital world. The responsibility for translating intent shifts from humans to machines. The platform takes a question or command from the user and then responds by executing some function, presenting some content, or asking for additional input. Over the next few years, conversational interfaces will become a primary design goal for user interaction and will be delivered in dedicated hardware, core OS features, platforms, and applications. Check out the following list for some potential areas where one could benefit from applying conversational platforms by way of bots: Informational: Chatbots that aid in research, informational requests, and status requests of different types Productivity: Bots that can connect customers to commerce, support, advisory, or consultative services B2E (business-to-employee): Bots that enable employees to access data, applications, resources, and activities Internet of Things (IoT): Bots that enable conversational interfaces for various device interactions, like drones, appliances, vehicles, and displays Using these different types of conversational platforms, you can expect increased bot productivity (because they can concentrate on the most valuable interactions), a 24/7 automated workforce, increased customer loyalty and satisfaction, new insights into customer interactions, and reduced operational expenses. Conversational platforms have now reached a tipping point in terms of understanding language and basic user intent, but they still aren’t good enough to fully take off. The challenge that conversational platforms face is that users must communicate in a structured way, and this is often a frustrating experience in real life. A primary differentiator among conversational platforms is the robustness of their models and the application programming interfaces (APIs) and event models used to access, attract, and orchestrate third-party services to deliver complex outcomes.

View Article
The Ethics of Artificial Intelligence

Article / Updated 06-29-2019

So, what does artificial intelligence (AI) ethics actually refer to and which areas are important to address to generate trust around your data and algorithms? Well, there are many aspects to this concept, but there are five cornerstones to rely on when it comes to the ethics of artificial intelligence: Unbiased data, teams, and algorithms. This refers to the importance of managing inherent biases that can arise from the development team composition if there isn’t a good representation of gender, race, and sex. Data and training methods must be clearly identified and addressed through the AI design. Gaining insights and potentially making decisions based on a model that is in some way biased (a tendency toward gender inequality or racist attitudes, for example) isn’t something you want to happen. Algorithm performance. The outcomes from AI decisions shall be aligned with stakeholder expectations that the algorithm performs at a desired level of precision and consistency and doesn´t deviate from the model objective. When models are subsequently deployed in their target environment in a dynamic manner and continue to train and optimize model performance, the model will adjust to the potential new data patterns and preferences and might start deviating from the original goal. Setting sufficient policies to keep the model training on target is therefore vital. Resilient infrastructure. Make sure that the data used by the AI system components and the algorithm itself are secured from unauthorized access, corruption, and/ or adversarial attack. Usage transparency and user consent. A user must be clearly notified when interacting with an AI and must be offered an opportunity to select a level of interaction or reject that interaction completely. It also refers to the importance of obtaining user consent for data captured and used. The introduction of the General Data Protection Regulation (GDPR) in the EU has prompted discussions in the US calling for similar measures, meaning that the awareness of the stakes involved in personal information as well as the need to protect that information are slowly improving. So, even if the data is collected in an unbiased manner and models are built in an unbiased setup, you could still end up with both ethically challenging situations (or even breaking the law) if you’re using personal data without the right permissions. Explainable models. This refers to the need for AI’s training methods and decisions criteria to be easily understood, documented, and readily available for human assessment and validation. It refers to situations where care has been taken to ensure that an algorithm, as part of an intelligent machine, produces actions that can be trusted and easily understood by humans. The opposite of AI explainability is when the algorithm is treated as a black box, where even the designer of the algorithm cannot explain why the AI arrived at a specific insight or decision. An additional ethical consideration, which is more technical in nature, relates to the reproducibility of results outside of the lab environment. AI is still immature, and most research-and-development is exploratory by nature. There is still little standardization in place for machine learning/artificial intelligence. De facto rules for AI development are emerging, but slowly and they are still very much community driven. Therefore, you must ensure that any results from an algorithm are actually reproducible— meaning you get the same results in the real, target environment as you would not only in the lab environment but also between different target environments (between different operators within the telecommunications sector, for example.) How to ensure trustworthy artificial intelligence If the data you need access to in order to realize your business objectives can be considered ethically incorrect, how do you manage that? It’s easy enough to say that applications should not collect data about race, gender, disabilities, or other protected classes. But the fact is that if you do not gather that type of data, you'll have trouble testing whether your applications are in fact fair to minorities. Machine learning algorithms that learn from data will become only as good as the data they’re running on. Unfortunately, many algorithms have proven to be quite good at figuring out their own proxies for race and other classes, in ways that run counter to what many would consider proper human ethical thinking. Your application would not be the first system that could turn out to be unfair, despite the best intentions of its developers. But, to be clear, at the end of the day your company will be held responsible for the performance of its algorithms, and (hopefully) bias-related legislation in the future will be stricter than it is today. If a company isn’t following laws and regulations or ethical boundaries, the financial cost could be significant — and perhaps even worse, people could lose trust in the company altogether. That could have serious consequences, ranging from customers abandoning the brand to employees losing their jobs to folks going to jail. To avoid these types of scenarios, you need to put ethical principles into practice, and for that to happen, employees must be allowed and encouraged to be ethical in their daily work. They should be able to have conversations about what ethics actually means in the context of the business objectives and what costs to the company can be weathered in their name. They must also be able to at least discuss what would happen if a solution cannot be implemented in an ethically correct manner. Would such a realization be enough to terminate it? Data scientists in general find it important to share best practices and scientific papers at conferences, writing blog posts, and developing open source technologies and algorithms. However, problems such as how to obtain informed consent aren’t discussed quite as often. It's not as if the problems aren’t recognized or understood; they’re merely seen as less worthy of discussion. Rather than let such a mindset persist, companies should actively encourage (rather than just allow) more discussions about fairness, the proper use of data, and the harm that can be done by the inappropriate use of data. Recent scandals involving computer security breaches have shown the consequences of sticking your head in the sand: Many companies that never took the time to implement good security practices and safeguards are now paying for that neglect with damages to their reputations and their finances. It is important to exercise the same due diligence now accorded security matters when thinking about issues like fairness, accountability, and unintended consequences of your data use. It will never be possible to predict all unintended consequences of such usage and, yes, the ability to foresee the future is limited. But plenty of unintended consequences could easily have been foreseen. (Facebook’s Year in Review feature, which seemed to go out of its way to remind Facebook users of deaths in the family and other painful events, is a prime example.) Mark Zuckerberg's famous motto, "Move fast and break things," is unacceptable if it hasn’t been thought through in terms of what is likely to break. Company leaders should insist that they be allowed to ponder such aspects — and stop the production line whenever something goes wrong. This idea dates back to Toyota’s Andon manufacturing method: Any assembly line worker can stop the line if they see something going wrong. The line doesn’t restart until the problem is fixed. Workers don’t have to fear consequences from management for stopping the line; they are trusted, and are expected to behave responsibly. What would it mean if you could do this with product features or AI/ML algorithms? If anyone at Facebook could have said, “Wait, we’re getting complaints about Year in Review” and pulled it out of production, Facebook would now be in a much better position from an ethical perspective. Of course, it’s a big, complicated company, with a big, complicated product. But so is Toyota, and it worked there. The issue lurking behind all these concerns is, of course, corporate culture. Corporate environments can be hostile to anything other than short-term profitability. However, in a time when public distrust and disenchantment are running at an all-time high, ethics is turning into a good corporate investment. Upper-level management is only starting to see this, and changes to corporate culture won’t happen quickly, but it’s clear that users want to deal with companies that treat them and their data responsibly, not just as potential profit or as engagements to be maximized. The companies that will succeed with AI ethics are the ones that create space for ethics within their organizations. This means allowing data scientists, data engineers, software developers, and other data professionals, to “do ethics” in practical terms. It isn’t a question of hiring trained ethicists and assigning them to their teams; it’s about living ethical values every single day, not just talking about them. That’s what it means to “do good data science.” Introducing ethics by design for artificial intelligence and data science What's the best way to approach implementing AI ethics by design? Might there be a checklist available to use? Now that you mention it, there is one, and you'll find it in the United Kingdom. The government there has launched a data ethics framework, featuring the data ethics workbook. As part of the initiative, they have isolated seven distinct principles around AI ethics. The workbook they came up with is built up around a number of open-ended questions designed to probe your compliance with these principles. Admittedly, it's a lot of questions — 46, to be exact, which is rather too many for a data scientist to continuously keep track of and incorporate efficiently into a daily routine. For such questions to be truly useful then, they need to be embedded not only in the development ways of working but also as part of the data science infrastructure and systems support. It isn’t merely a question of making it possible as a practical matter to follow ethical principles in daily work and to prove how the company is ethically compliant — the company must also stand behind these ambitions and embrace them as part of its code of conduct. However, when a company talks about adding AI ethics to its code of conduct, the value doesn't come from the pledge itself, but rather emerges from the process people undergo in developing it. People who work with data are now starting to have discussions on a broad scale that would never have taken place just a decade ago. But discussions alone won’t get the hard work done. It is vital to not just talk about how to use data ethically but also to use data ethically. Principles must be put into practice! Here’s a shorter list of questions to consider as you and your data science teams work together to gain a common and general understanding of what is needed to address AI ethical concerns: Hacking: To what extent is an intended AI technology vulnerable to hacking, and thus potentially vulnerable to being abused? Training data: Have you tested your training data to ensure that it is fair and representative? Bias: Does your data contain possible sources of bias? Team composition: Does the team composition reflect a diversity of opinions and backgrounds? Consent: Do you need user consent to collect and use the data? Do you have a mechanism for gathering consent from users? Have you explained clearly what users are consenting to? Compensation: Do you offer reimbursement if people are harmed by the results of your AI technology? Emergency brake: Can you shut down this software in production if it’s behaving badly? Transparency and Fairness: Do the data and AI algorithms used comply with corporate values for technology such as moral behavior, respect, fairness and transparency? Have you tested for fairness with respect to different user groups? Error rates: Have you tested for different error rates among diverse user groups? Model performance: Do you monitor model performance to ensure that your software remains fair over time? Can it be trusted to perform as intended, not just during the initial training or modelling but also throughout its ongoing “learning” and evolution? Security: Do you have a plan to protect and secure user data? Accountability: Is there a clear line of accountability to an individual and clarity on how the AI operates, the data that it uses, and the decision framework that is applied? Design: Did the AI design consider local and macro social impact, including its impact on the financial, physical, and mental well-being of humans and our natural environment?

View Article
10 Mistakes to Avoid When Investing in Data Science

Article / Updated 06-29-2019

Although you must focus on your data science strategy objectives in order to succeed with them, it doesn’t hurt to also learn from others' mistakes. Here, you find a list of ten data science challenges that many companies tackle in the wrong way. Each argument not only describes what you should aim to avoid when it comes to data science, but also points you in the direction of the right approach to address the situation. Don't tolerate top management's ignorance of data science A fundamental misunderstanding occurs in the area of data science regarding the target group for data science training. The common view is that as long as the skill set for the data scientists themselves is improved, or for the software engineers who are training to become data scientists, you are spot-on. However, by adopting that approach, the company runs the significant risk of alienating the data science team from the rest of the organization. Managers and leaders are often forgotten. If managers don’t understand or trust the work done by the data scientists, the outcome won’t be utilized in the organization and insights won’t be put into action. So, the main question to ask is how to secure full utilization of the data science investment if the results cannot be interpreted by management. This is one of the most common mistakes committed by companies today, and the fact is that there’s also little training and coaching available for line management and for leaders. But without some level of understanding of data science at the management level, how can the right strategy be put in place, and how can you expect management to dare to use the statistical results to make substantive decisions? Without management understanding of data science, it’s not only difficult to capture the full business opportunity for the company, but it might also lead to further alienation of the data science team or to termination of the team altogether. Don't believe that AI is magic Data science is all about data, statistics, and algorithms. There’s nothing magic about it — the machine does what it’s told to do. However, the notion that the machine can learn causes some to think that it has the full ability to learn by itself. To some extent, that is correct — the machine can learn — but it's correct only within the boundaries you set up for it. (No magic, in other words!) A machine cannot solve problems by itself, unless a machine is allowed to develop such a design. But that’s advanced technology and not today’s reality. Overestimating what artificial intelligence can do for your company can really set you off on the wrong track, building up expectations that can never be met. This could lead to severe consequences both within the company and externally, with impacts not just in terms of trust and reliability but also in terms of financial performance. As important as it is not to underestimate the potential in artificial intelligence, one should also avoid the opposite extreme, where its potential is overestimated. Let’s repeat: Artificial intelligence isn’t magic. Yes, it’s called artificial intelligence, but a more correct definition is actually algorithmic intelligence. Why? Because at the end of the day, very advanced mathematics are applied to huge amounts of data, with the ability to dynamically interact with a defined environment in real-time. Don't approach data science as a race to the death between man and machine Some people tend to believe that task automation, driven by machine learning predictions, truly means the end of humans in the workplace. That prediction isn’t one that everyone believes in. However, the presence of AI does mean a significant change in competence and skill sets as well as a change in which job roles will be relevant and which types of responsibilities will be the focus in the workplace. Like the introduction of the Internet in the workplace, introducing artificial intelligence in a more mainstream format will change what jobs are and how they’re performed. There will be a lot less “hands-on” work, even in the software business. And yes, machines will most probably do a lot of the basic software development going forward, which means that people in the hardware-related industry will not be the only ones replaced. At the end of the day, basically all humans will be impacted as machine learning/artificial intelligence and automation capabilities and capacity expand and evolve beyond what is possible to do today. However, this also means that humans can move on to perform other tasks that are different from the ones we do today — managing and monitoring models and algorithms and their performance, for example, or setting priorities and acting as a human fallback solution in cooperation with the machine. Other typical human tasks might be managing legal concerns related to data, evaluating ethical aspects of algorithm-based decision-making, or driving standardization in data science. You could say that the new human tasks will be focused on managing the machines that manage the original tasks — tasks that were previously perceived to be either boring and repetitive or too complex to execute at all. This "putting man against machine" business isn’t the way to approach your data science implementation. Allowing the narrative to be framed that way may scare your employees and even prompt them to leave the company, which isn’t what you want. Your employees are valuable assets that you need in the next stages as well, but perhaps in new roles and with new acquired skill sets. Embrace what the machine learning/artificial intelligence technology can do for a specific line of business. Company leaders who understand how to utilize these techniques in a balanced approach between man and machine to augment the total performance and let the company evolve beyond its current business are the leaders whose companies will succeed. Don't underestimate the potential of AI As strange as it may seem, some companies just don’t understand how transformative artificial intelligence really is. They refuse to see the fundamental shift that is already starting to transform society, and cannot see artificial intelligence as anything other than just another software technique or a set of new programming languages. The key here is to a) take the time to truly understand what data science is really all about and to b) not be afraid to accept help from experts to identify and explain the strategic potential for your specific business. Because the area of data science is complex, it requires domain expertise and experience in terms of both the development of a strategy and its implementation. It also requires the ability to read and interpret where the market is moving in this area. By underestimating the impact that artificial intelligence can have on your business, you run the risk of significantly limiting the future expansion of your company. Later, once the true potential is really understood, you will find yourself entering the game too late and being equipped with the wrong skill set. You may finally be put out of business by competitors that had seen the potential much earlier and therefore invested earlier and smarter in artificial intelligence. Don't underestimate the needed data science skill set A typical sign of companies underinvesting in data science is when you find small, isolated islands of data science competence spread out in different parts of a large company. In smaller companies, you see a similar symptom when a small-but-competent data science team is working on the most important project in the company but the only one outside the team that realizes its importance is an outsider like yourself. Both of these examples are signs that top management in the company has not understood the potential of data science. They have simply realized that something is happening in this area in the market and are just following a trend to make sure that data science doesn’t pass them by. If the awareness and competency level of management doesn’t improve, the area will continue to be underinvested, distributed in a way that it cannot reach critical mass, and therefore rendered incapable of being scaled up at a later stage. Don't think that a dashboard is the end objective of data science It may sound strange for, someone knowledgeable in data science, to say that anyone can think that the main outcome of data science is a dashboard. Rest assured, however, that this is a common misunderstanding. This isn’t only wrong — it’s also one of the main reasons that many companies fail with their data science investment. At many companies, management tends to think that the main purpose of analytics and artificial intelligence is to use all that big data that has been pumped into the expensive data lake, to automate tasks and report on progress. Given such a mindset, it should come as no surprise that the main focus of management would be to use these techniques to answer their questions with statistically proven methods that could produce results that could be visualized in a nice-looking dashboard. For someone new to the field of data science, that might actually seem like a good approach. Unfortunately, they would be wrong. To be absolutely clear, the main objective of analytics and machine learning/artificial intelligence isn’t simply to do what you've always done but using more machines. The idea is to be able to move beyond what you’re able to do today and tackle new frontiers. If the only end goal was to create a dashboard in order to answer some questions posed by a manager, there would be no need to create a data-driven organization. The idea here is that, in a data-driven organization, it all starts with the data and not with the manager and the dashboard. The starting point is what the data is indicating that you need to look at, analyze, understand, and act on. Analysis should be predictive, in order for the organization to be proactive and for its actions to be preventive. The role of the dashboard should be to surprise you with new insights and make you discover new questions you should be asking — not to answer the questions you've already come up with. It should enable teams to monitor and learn from ongoing preventive actions. The dashboard should also support human or machine discovery of potential trends and forecasts in order to make long-term strategic decisions. In the real world, the steps needed to design a dashboard tend to end up being the most important tasks to discuss and focus on. Often, dashboards end up driving everything that is done in the data science implementation program, totally missing the point about keeping an open and exploratory approach to the data. This tends to happen because the dashboard is the simplest and most concrete deliverable to understand and hold on to in this new, complex, and constantly changing environment. In this sense, it acts like a crutch for those unwilling or unable to grasp the full potential of a data-driven business. You run the great risk of missing the whole point of being data driven when your starting point is all about designing the dashboard and laying down all the questions from the start. By doing so, you assume that you already know which questions are important. But how can you be sure of that? In a society and a market now undergoing huge transformations, if you don’t look at the data first and let the algorithms do the work of finding the patterns and deviations hiding there, you might end up looking at the entirely wrong problem for your business. Don't forget about the ethical aspects of AI What does artificial intelligence ethics actually refer to, and why do you think it’s of the utmost importance? Well, there are many aspects surrounding the idea of ethics in AI, many of which can have a severe impact on the artificial intelligence results. One obvious but important ethical consideration is the need to avoid machine bias in the algorithms — biases where human preconceptions of race, gender, class, or other discriminatory aspects are unconsciously built into the models and algorithms. Usually, people tend to believe that they don’t have biased opinions, but the truth is that everyone has them, more or less. People tend to lean in one direction, subconsciously or not. Modeling that tendency into self-learning algorithms can have severe consequences on the performance of the company´s algorithms. One example that comes to mind involves an innovative, online, and artificial-intelligence-driven beauty contest. The algorithm had learned to search for the ten most beautiful women in the US, using only digital photos of women. But when studying the result from the contest, it became clear that something must have gone wrong: All of the ten most beautiful women selected by the algorithm were white, blonde, and blue-eyed. So, when studying the algorithm again, it turned out that the training set used for the algorithm had a majority of white, blonde, and blue-eyed women in it, which taught the machine that this was the desired look. Other aspects in addition to machine bias include areas such as the use of personal information, the reproducibility of results outside the lab environment, and the explainability of AI insights or decisions. It’s also worth noting that this last aspect is now a law within the GDPR (General Data Protection Regulation) in the EU. Ethical considerations are for our own, human protection as machine intelligence evolves over time. You must think about such aspects early on. It’s not only a fundamental aspect to consider as part of your data science investment, but it’s actually also hugely important to consider already from the start, when designing your business models, architecture, infrastructure, ways of working, and the teams themselves. Not wanting to break the law is of course important, but securing a sustainable and trustworthy evolution of artificial intelligence in your business is far more important. Don't forget to consider the legal rights to the data When becoming data driven, one of the most common mistakes is to forget to make a proper analysis of which data is needed. Even if your main ambition with your data science investment is focused on internal efficiency and data-driven operations, this is still a fundamental area to address. Once the data need is analyzed, it’s not unusual to discover that you need other types of data than you originally thought. It might be data other than just internally generated data, owned by you. An example might be faults found in your products or services, or perhaps performance related data. It could even be the more sensitive type of data, which falls under the category of privacy data, related to how your products or services are being used by your customers. Data privacy is an area that’s getting more and more attention, in society with consumers’ enhanced awareness of how their data is being used and also in terms of new laws and regulations on data. One concrete example is the General Data Protection and Regulation law (GDPR), introduced in 2018 within the EU with significant penalties for violators. Although you might not have any plans for monetizing your data or to build new products based on the data, the whole rights issue is still central — even when all you want to do is analyze the data in order to better understand your business, enhance and innovate the current portfolio, or just improve the efficiency of your operations. No matter what your reasons are for using the data, you still need legal rights in place in order to use it! It’s absolutely vital to address this early on as part of the development of your data strategy. If you don’t, you might end up either violating the law regulating data usage and ownership or being stuck in terms of not being able to sell your new fantastic product or service because it’s using data you aren’t entitled to use. Don't ignore the scale of change needed If you don't take the time to properly sketch out the different change scenarios for your business when introducing a data science strategy, you most likely will fail. The fundamental shift needed in the company to become truly data, analytics, and machine driven is significant and should not be underestimated. The most common mistakes in data science related to managing change are listed here: Underestimating the scope of the change and not taking seriously enough what has to happen Failing to recognize that business models are sure to be impacted when introducing data science Approaching customers with a value argumentation based on introducing data science techniques without explicitly explaining what the customer value is Pricing models to stay the same or not reflect the increased value, only the lowered cost Focusing single-mindedly on cost efficiency when it comes to business operational changes Neither measuring nor understanding operational improvements Carrying out organizational changes on so small a scale that everything stays the same in practice, ensuring that the actual change never occurs Building the cost and dimensioning model on old and outdated criteria, therefore ensuring that the model won’t capture the new values Failing to see the change that data science imposes on the company and not understanding that change from an ecosystem perspective Underestimating the need for communication related to the change Don't forget the measurements needed to prove the value of data science A common mistake is to forget to introduce baseline measurements before the data science investment is made and implemented. Most of the focus in these cases tends to be on the future measurements and the results targeted with the investment. This is usually because of a resistance toward investing in new measurements in the current situation, because it’s being abandoned for the new strategy. Unfortunately, this means that the company will lack the ability to statistically prove the value of the investment in the next step. Don’t fall into that trap! It could truly backfire on the entire strategic ambition, when top management or even the board of directors asks what the value was of this major investment. Financially, you could, of course, be able to motivate the investment on a high level; however, it would be difficult to prove individual parts. Efficiency gains such as speed, agility, automation level, and process reactiveness versus proactiveness are values that are more difficult to prove and put a number on if you haven’t secured a measurement baseline before executing your data science strategy.

View Article
Data Science Careers: The Roles in a Data Science Team

Article / Updated 06-29-2019

In the past couple of years, an avalanche of different data science careers and roles have overwhelmed the market, and for someone who has little or no experience in the field, it’s hard to get a general understanding of how these roles differ and which core skills are actually required. The fact is that these different data science careers and roles are often given different titles, but tend to refer to the same or similar jobs — admittedly, sometimes with overlapping responsibilities. This crazy-quilt of job titles and job responsibilities is yet another area in data science that is in need of more standardization. Before attempting some hard-and-fast role definitions of these data science careers, then, let’s start by sketching out the different task sets you'd typically find on a data science team. The idea here is to scope the high-level competence areas that need to be covered on a data science team, regardless of who actually carries out which task. The three main data science careers are mathematics/statistics, computer science, and business domain knowledge. The image above shows the easy part because there's general agreement on which competencies are required for a successful and efficient data science careers — though you still need to define the roles and areas of responsibility for each team member. The definitions of data science careers that you find here aim to give you a general understanding of the most important roles you’ll need on your data science team. Just remember that variants may apply, depending on your own specific setup and strategic focus. Data scientist In general terms, a data scientist produces mathematical models for the purposes of prediction. And, because the development and interpretation of mathematical models requires deep technical knowledge, most data scientists have graduate level training in computer science, mathematics, or statistics. Data scientists also need strong programming skills in order to effectively leverage the range of available software tools. Aside from being technically savvy, data scientists need critical thinking skills, based on common sense as well as on a thorough understanding of a company's business objectives in order to produce high quality models. Sometimes, a role referred to as data analyst is set apart from the data scientist role. In such cases, the data analyst role is like the Sherlock Holmes of the data science team in that they focus on collecting and interpreting data as well as analyzing patterns and trends in the data which they draw conclusions from in a business context. The data analyst must master languages like R, Python, SQL, and C, and, just like the data scientist, the skills and talents that are needed for this role are diverse and span the entire spectrum of tasks in the data science process. And, to top it all off, a data analyst must demonstrate a healthy I-can-figure-it-out attitude. It’s really up to you to decide whether you want to have all your company’s data scientists take up the tasks associated with a data analyst or if you want to set up a data analyst as a separate role. Within the role of the data scientist, you'll find another, more traditional role hidden away: the statistician. In historical terms, the statistician was the leader when it came to data and the insights it could provide. Although often forgotten or replaced by fancier-sounding job titles, the statistician role represents what the data science field stands for: getting useful insights from data. With their strong background in statistical theories and methodologies, and a logically oriented mindset, statisticians harvest the data and turn it into information and knowledge. They can handle all sorts of data. What’s more, thanks to the quantitative background, modern statisticians are often able to quickly master new technologies and use these to boost their intellectual capacities. A statistician brings to the table the magic of mathematics with insights that have the ability to radically transform businesses. Data engineer The role of the data engineer is fundamental for data science. Without data, there cannot be any data science, and the job of data scientists is a) quite impossible if the requisite data isn’t available and b) definitely daunting if the data is available but only on an inconsistent basis. The problem of inconsistency is frequently faced by data scientists, who often complain that too much of their time is spent on data acquisition and cleaning. That’s where the data engineer comes in: This person’s role is to create consistent and easily accessible data pipelines for consumption by data scientists. In other words, data engineers are responsible for the mechanics of data ingestion, processing, and storage, all of which should be invisible to the data scientists. If you’re dealing with small data sets, data engineering essentially consists of entering some numbers into a spreadsheet. When you operate at a more impressive scale, data engineering becomes a sophisticated discipline in its own right. Someone on your team looking for a data science career will need to take responsibility for dealing with the tricky engineering aspects of delivering data that the rest of your staff can work with. Data engineers don’t need to know anything about machine learning (ML) or statistics to be successful. They don’t even need to be inside the core data science team, but could be part of a larger, separate data engineering team that supplies data to all data science teams. However, you should never place your data engineers and data scientists too far apart from one other organizationally. If these roles are separated into different organizations, with potentially different priorities, this could heavily impact the data science team productivity. Data science methods are quite experimental and iterative in nature, which means that it must be possible to continuously modify data sets as the analysis and algorithm development progress. For that to happen, data scientists need to be able to rely on a prompt response from the data engineers if trouble arises. Without that rapid response, you run the risk of slowing down a data science team's productivity. Machine learning engineer Data scientists build mathematical models, and data engineers make data available to data scientists as the “raw material” from which mathematical models are derived. To complete the picture, these models must first be deployed (put into operation, in other words), and, second, they must be able to act on the insights gained from data analysis in order to produce business value. This task is the purview of the machine learning engineer. The machine learning engineer role is a software engineering role, with the difference that the ML engineer has considerable expertise in data science. This expertise is required because ML engineers bridge the gap between the data scientists and the broader software engineering organization. With ML engineers dedicated to model deployment, the data scientists are free to continually develop and refine their models. Variants are always a possibility when setting up a data science team. For example, the ML engineer deployment responsibilities are often also handled by the data scientist role. Depending on the importance of the operational environment for your specific business, it can make more or less sense to separate this role from the data scientist responsibilities. It is, again, up to you to implement this responsibility within the team. Data architect A data architecture is a set of rules, policies, standards and models that govern and define the type of data collected and how it is used, stored, managed and integrated within an organization and its data systems. The person charged with designing, creating, deploying, and managing an organization's data architecture is called a data architect, and they definitely need to be accounted for on the data science team. Data architects define how the data will be stored, consumed, protected, integrated, and managed by different data entities and IT systems, as well as any applications using or processing that data in some way. A data architect usually isn’t a permanent member of a single data science team, but rather serves several data science teams, working closely with each team to ensure efficiency and high productivity. Business analyst The business analyst often comes from a different background when compared to the rest of the team. Though often less technically oriented, business analysts make up for it with their deep knowledge of the different business processes running through the company — operational processes (the sales process), management processes (the budget process) and supporting processes (the hiring process). The business analyst masters the skill of linking data insights to actionable business insights and can use storytelling techniques to spread the message across the entire organization. This person often acts as the intermediary between the "business guys" and the "techies." Software engineer The main role of a software engineer on a data science team is to secure more structure in the data science work so that it becomes more applied and less experimental in nature. The software engineer has an important role in terms of collaborating with the data scientists, data architects, and business analysts to ensure alignment between the business objectives and the actual solution. You could say that a software engineer is responsible for bringing a software engineering culture into the data science process. That is a massive undertaking, and it involves tasks such as automating the data science team infrastructure, ensuring continuous integration and version control, automating testing, and developing APIs to help integrate data products into various applications. Domain expert It takes a lot of conversations to make data science work. Data scientists can't do it on their own. Success in data science requires a multiskilled project team with data scientists and domain experts working closely together. The domain expert brings the technical understanding of her area of expertise, sometimes combined with a thorough business understanding of that area as well. It usually includes familiarity with the basics of data analysis, which means that domain experts can support many roles on the data science team. However, the domain expert usually isn’t a permanent member of a data science team; more often than not, that person is brought in for specific tasks, like validating data or providing analysis or insight from an expert perspective. Sometimes the domain expert is allocated for longer periods to a certain team, depending on the task and focus. Sometimes one or several domain experts are assigned to support multiple teams at the same time. Characteristics of a great data scientist, the foundational data science career There’s a lot of promise connected with the data scientist role. The problem is not only that the perfect data scientist doesn’t exist, but also that the few truly skilled ones are too few and too difficult to get hold of in the current marketplace. So, what should you be doing instead of searching for the perfect data scientist? The focus should be on finding someone with the ability to solve the specific problems your company is focusing on — or, to be even more specific, what your own data science team is focusing on. It’s not about hiring the perfect data scientist and hoping that they’re going to do all the things that you need done, now and in the future. Instead, it’s better to hire someone with the specific skills needed to meet the clearly defined organizational objectives you know of today. For instance, think about whether your need is more related to ad hoc data analysis or product development. Companies that have a greater need for ad hoc data insights should look for data scientists with a flexible and experimental approach and an ability to communicate well with the business side of the organization. On the other hand, if product development is more important in relation to the problems you’re trying to solve, you should look for strong software engineering skills, with a firm base in the engineering process in combination with their analytical skills. If you're hoping to find a handy checklist of all the critical skills that you should be looking for when hiring a data scientist, you'll be sorely disappointed. The fact is, not even a basic description of important traits the role should possess is agreed on across the industry. There are many opinions and ideas about it, but again the lack of standardization is troublesome. So, what makes coming up with a simple checklist of the needed tool sets, competencies, and technical skills required so difficult? For one, data science careers are evolving fast, and tools and techniques that were important to master last year might be less important this year. Therefore, staying in tune with the evolution of the field and continually learning new methods, tools, and techniques is the key in this space. Another reason it’s difficult to specify a concrete checklist of skills is because the critical skill sets needed are actually outside the data science area — they qualify more as soft skills, like interpersonal communication and projecting the right attitude. Just look at the data scientist Venn diagram of skills, traits, and attitude needed. The variety of skill sets and mindset traits that a perfect data scientist must master is almost ridiculous. So, bearing in mind that specifying competencies needed for a data scientist is more a question of attitude and mindset in combination with a certain skill set, here’s a list of characteristics that define a good data scientist: Business understanding: Having the ability to translate a problem from business language into a hypothesis is important and refers to how a data scientist should be able to understand what the business person describes, and then be able to translate that into technical terms and present a potential solution in that context. Impactful versus interesting: Data scientists must be able to resist the temptation to always prioritize the interesting problems when there might be problems that are more important to solve because of the major business impact such solutions would have. Curiosity: Having an intellectual curiosity and the ability to detail a problem into a clear set of hypotheses that can be tested is a major plus. Attention to detail: As a data scientist, pay attention to details from a technical perspective. A model cannot be nearly right. Building an advanced technical algorithm takes time and dedication to detail. Easy learner: The data scientist must have an ability to learn quickly, because the rapidly changing nature of the data science space includes technologies and methodologies but also new tools and open-source models that are made available and become ready to build on. Agile mindset: Stay flexible and agile in terms of what is possible, how problems are approached, how solutions are investigated, and how problems are solved. Experimentation mindset: The data scientist must not fear to fail or try assumptions that might be wrong in order to find the most successful way forward. Communication: A data scientist must be able to tell a story and describe the problem in focus or the opportunity that he’s aiming for, as well as describe how great the models are once they are finished and what they actually enable. Of course, there are additional skills of interest, such as in statistics, machine learning, and programming, but remember that you do not need one person to fit all categories here. First of all, you should be looking for data scientists who possess the most important skills that meet your needs. However, in the search for that top-notch data scientist, remember that the list above could also be used for hiring a complementary team of data scientists which together possess the skills and mindset needed. After your team of data scientists is in place, encourage their professional development and lifelong learning. Many data scientists have an academic mindset and a willingness to experiment, but in the pursuit of a perfect solution, they sometimes get lost among all the data and the problems they’re trying to solve. Therefore, it’s important that they stay connected with the team, though you should allow enough independence so that they can continue to publish white papers, contribute to open source, or pursue other meaningful activities in their field.

View Article
Tips for Developing a Data Strategy: Managing Your Data Appropriately

Article / Updated 06-29-2019

After your company’s objectives have become clearer, your CDO, as part of an overall data science strategy, needs to create a business-driven data strategy fleshed out with a significant level of detail. In addition, that person needs to define the scope of the desired data-driven culture and mindset for your company and move to drive that culture forward. Here, you discover what a CDO needs to keep in mind in order to accomplish these tasks, as well as an example of a data strategy scope. Data science: Caring for your data One key aspect in any data strategy involves caring for your data as if it were your lifeblood — because it is. You need to address data quality and integration issues as key factors of your data strategy, and you need to align your data governance programs with your organizational goals, making sure you define all strategies, policies, processes, and standards in support of those goals. Organizations should assess their current state and develop plans to achieve an appropriate level of maturity in terms of data governance over a specific period. It’s important to recognize that data governance is never complete; by necessity, it evolves, just as corporate needs and goals, technology, and legal and regulatory aspects do. Governance programs can range from establishing company-level, business-driven data and information programs for data integrators, to establishing customized, segment-based programs for the business optimizers and market disruptors/innovators. However, even the best strategy can falter if the business culture isn’t willing to change. Data integrators flourish in an evidence-based operational environment where data and research is used to establish a data-driven culture, whereas business optimizers and market disruptors/innovators need to adopt a “fail-fast” agile software development culture in order to increase speed-to-market and innovation. Data Science: Democratizing the data As important as it is to understand the value of the data your company has access to, it’s equally important to make sure that the data is easily available to those who need to work with it. That's what democratizing your data really means. Given its importance, you should strive to make sure that this democratization occurs throughout your organization. The fact of the matter is, everyone in your company makes business decisions every single day, and those decisions need to be grounded in a thorough understanding of all available data. It has become obvious that data-driven decisions are better decisions, so why wouldn’t you choose to provide people with access to the data they need in order to make better decisions? Although most people can understand the need for data democratization, it isn’t at all uncommon for a company´s data strategy to instead focus on locking up the data — just to be on the safe side. Nothing, however, could be more devastating for the value realization of the data for your business than adopting a bunker mentality about data. The way to start generating internal and external value on your data is to use it, not lock it up. Even adopting a radical approach of a totally open data environment internally is better than being too restrictive in terms of how data is made available and shared in the company. Data science: Driving data standardization A third key component in any data strategy is to standardize to scale quickly and efficiently. Data standardization is an important component for success — one that should not be underestimated. A company cannot hope to achieve goals that assume a 360-degree view of all customers underpinned by the correct data without a common set of data definitions and structures across the company and the customers. TM Forum, a nonprofit industry association for service providers and their suppliers in the telecommunications industry, developed something they call the Information Framework (SID) in concert with professionals from the communications and information industries working collaboratively to provide a universal information and data model. (The SID part of the name comes from Shared Information Data model.) The benefits of this common model come from its ability to significantly support increased standardization around data in the telecommunications space and include aspects such as; Faster time to market for new products and services Cheaper data and systems integration Less data management time Reduced cost and support when implementing multiple technologies Organizations have long recognized the need to seek standardization in their transactional data structures, but they need to realize the importance of seeking standardization in their analytical data structures as well. Traditional analytics and business intelligence setups continue to use data warehouses and data marts as their primary data repositories, and yes, they are still highly valuable to data-driven organizations, but enabling dynamic big data analytics and machine learning/artificial intelligence solutions requires a different structure in order to be effective. Data science: Structuring the data strategy The act of creating a data strategy is a chance to generate data conversations, educate executives, and identify exciting new data-enabled opportunities for the organization. In fact, the process of creating a data strategy may generate political support, changes in culture and mindset, and new business objectives and priorities that are even more valuable than the data strategy itself. But what should the data strategy actually include? The list below gives you an idea. Data-centric vision and business objectives including user scenarios Strategic data principles, including treating data as an asset Guidelines for data security, data rights, and ethical considerations Data management principles, including data governance and data quality Data infrastructure principles regarding data architecture, data acquisition, data storage, and data processing Data scope, including priorities over time Don´t mix-up the data strategy with the data science strategy. The main difference is that the data strategy is focused on the strategic direction and principles for the data and is a subset of the data science strategy. The data science strategy includes the data strategy, but also aspects such as organization, people, culture and mindset, data science competence and roles, managing change, measurements, and business commercial implications on the company portfolio.

View Article
What is a CDO?

Article / Updated 06-29-2019

What is a CDO? CDO stands for chief data officer. The CDO is a title that describes someone in an organization who oversees the overall data science strategy from conception to execution. The chief data officer is responsible for determining how data will be collected, processed, analyzed, and used as part of the overall business strategy. To describe the scope of a CDO, you first need to determine how the position relates to that of a chief analytics officer (CAO). Although the CDO and the CAO are two distinct roles, these two positions are customarily held by the same person or else only one role, the CDO role, is used, but when these roles are combined into one, it is sometimes also referred to as a CDAO role. However, in situations where these two roles are separate and held by two different functions, the main difference can be summarized by the title itself: data versus analytics. The main difference in the area of responsibility is captured below. If the CDO is about data enablement, the CAO role is about how you drive insights from that data — in other words, how you make the data actionable. The CAO is much more likely to have a data science background, and the CDO, a data engineering one. Let’s clarify that both the CDO and CAO positions are essentially carve-outs from the traditional CIO job in the IT domain. In the case of the CDO role, the CIO may well have welcomed eliminating some of these responsibilities. However, when it comes to the portion of the CIO role that is about IT cost for new data assets, the CIO can be deeply challenged by the new realities of big data. Both the CDO and CAO would need to argue for initially storing huge amounts of data, even if its value isn’t immediately evident. These aspects pose a significant but important change in mindset for the CIO role, one that probably would not have been recognized the same way without the introduction of the CDO and CAO roles. When it comes down to the practical implementation of this role, it’s all about securing an efficient end-to-end setup and execution of the overall data science strategy across the company. Which solution can function as the most optimized setup for your company will depend on your line of business and how you’re organized. Just remember to keep these two roles working closely together, including the teams that are attached to the roles. Separation between data engineering teams and data science teams is not advisable, especially since there is a need for a strong common foundation based on these two parts in data science. The teams may have a different focus, but they need to work closely together in an iterative way to achieve the speed, flexibility, and results expected by the business stakeholders. In cases where the CDO role is the only role in a company — where CDO and CAO responsibilities have been merged, in other words — the mandate of the CDO role is usually described in terms of the image you see below. The area referring to the business mandate refers mainly to driving areas such as: Establishing a company-wide data science strategy Ensuring the adoption of a dominant data culture within the company Building trust and legitimizing the usage of data. Driving data usage for competitive advantage Enabling data-driven business opportunities Ensuring that principles for legal, security, and ethical compliance are upheld When it comes to the technology mandate, the following aspects are usually included: Establishing a data architecture Securing efficient data governance Building an infrastructure that enables explorative and experimental data science Promoting the continuous evolution of data science methods and techniques Designing principles for legal, security, and ethical aspects Securing efficient data and model life-cycle management Notice that above, three distinct services a merged CDO/CAO must manage. This list gives a sense of what each service entails: Data foundation services includes areas such as managing data provenance and data stewardship, data architecture definition, data standards, and data governance as well as risk management and various types of compliance. Data democratization services refers to areas such as establishing a data-driven organizational culture through the business validation of data initiatives, making non-sensitive data available to all employees (data democratization) as well as proper evaluation of available data. Data enrichment services includes areas such as deriving and creating value from data through applying various analytical and machine-based methods and techniques, exploring and experimenting with data as well as ensuring a smart and efficient data lake/data pipeline setup supporting a value realization of the data science investment. Why a CDO is needed In addition to exploring revenue opportunities, developing acquisition strategies, and formulating customer data policies, the chief data officer is charged with explaining the strategic value of data and its important role as a business asset and revenue driver to executives, employees, and customers. Chief data officers are successful when they establish authority, secure budget and resources, and monetize their organization’s information assets. The role of the CDO is relatively new and evolving quickly, but one convenient way of looking at this role is to regard this person as the main defender and chief steward of an organization’s data assets. Organizations have a growing stake in aggregating data and using it to make better decisions. As such, the CDO is tasked with using data to automate business processes, better understand customers, develop better relationships with partners, and, ultimately, sell more products and services faster. A number of recent analyses of market trends claim that by 2020, 50 percent of leading organizations will have CDOs with similar levels of strategy influence and authority as CIOs. CDOs can establish a leadership role by aligning their priorities with those of their organizations. To a great extent, the role is about change management. CDOs first need to define the role and manage expectations by considering the resources made available to them. Despite the recent buzz around the concept of CDOs, in practice it has proven to be difficult for them to secure anything other than moderate budgets and limited resources when reporting into existing business units, like IT. Moreover, with usually only a handful of personnel, the CDO group must operate virtually by tagging onto, and inserting themselves into, existing projects and initiatives throughout the organization. This, of course, isn’t an optimal setup when it comes to proving the value of the CDO function. For the CDO function to truly pay off, you need to break up the silos and optimize the company structures around the data. It’s all about splitting up the scope of responsibility for your IT department so that you can separate out the data assets from the technology assets and let the CDO take ownership of the data and information part, as well as the full data science cycle when there is no CAO role appointed. Role of the CDO The main task of a chief data officer is to transform the company culture to one that embraces an insight- and data-driven approach. The value of this should not be underestimated. Establishing a data-first mentality pushes managers at all levels to treat data as an asset. When managers start asking for data in new ways and view data science competence as a core skill set, they will drive a new focus and priority across all levels of the company. Changing a company's cultural mindset is no easy task — it takes more than just a few workshops and a series of earnest directives from above to get the job done. The idea that one must treat data as an asset needs to be firmly anchored in the upper management layer — hence the importance of the CDO role. Let this list of common CDO mandates across various industries serve as an inspiration for what can fit in your company. A CDO can Establish a data-driven culture with effective data governance. As part of that project, it is also vital to gain trust the trust of the various business units so that a company-wide sense of data ownership can be established. The idea here is to foster, not hinder, the efficient use of data. Drive data stewardship by implementing useful data management principles and standards according to an agreed-on data strategy. It is also important to industrialize efficient data-quality management, since ensuring data quality throughout the data life-cycle requires substantial system support. Influence decision-making throughout the company, supported by quality data that allows for analytics and insights that can be trusted. Influence return on investment (ROI) through data enrichment and an improved understanding of customer needs. The idea here is to assist the business in delivering superior customer experiences by using data in all applicable ways. Encourage continuous data-driven innovation through experimentation and exploration of data, including making sure that the data infrastructure enables this to be done effectively and efficiently. As in most roles, you always face a set of challenges impacting the level of success that can be achieved. Just by being aware of these challenges, you'll be better able to avoid them or at least have strategies in place to deal with them if, or when, they arise. The following list summarizes some of the most common challenges related to the CDO role: Assigning business meaning to data: A CDO must make sure that data is prioritized, processed, and analyzed in the right business context in order to generate valuable and actionable insights. One aspect of this could be the timing of the insight generation: If the time it takes to generate the insight is too slow from a business usefulness perspective, the insight, rather than steering the business, would merely confirm what just happened. Establishing and improving data governance: The area of data governance is crucial for keeping data integrity during the life cycle of the data. It’s not just about managing access rights to the data, but very much about managing data quality and trustworthiness. As soon as manual tasks are part of data processing activities, you run the risk of introducing errors or bias into the data sets, making analysis and insights derived from the data less reliable. Automation-driven data processing is therefore a vital part of improving data governance. Promoting a culture of data sharing: In practice, it is the common data science function that will drive the data science activities across the company as it promotes data sharing from day to day. However, it’s also significant to have a strong spokesperson in management who enforces an understanding and acceptance of data sharing. The main focus should be to establish that you won't be able to derive value from data unless the data is used and shared. Locking in data by limiting access and usage is the wrong way to go — an open data policy within the company should be the starting point. With that in place, you can then limit access on sensitive data and still ensure that such limitations are well motivated and cannot be solved by using anonymization or other means. Building new revenue streams, enriching and leveraging data-as-a-service: A person in the CDO role also promotes and supports innovation related to data monetization. This is an inspiring task, but not always an easy one. Driving new business solutions that require data-driven business models and potentially completely new delivery models might inspire a lot of fear and resistance in the company and with management in general. Remember that new data monetization ideas might challenge existing business models and be seen as a threat rather than as new and promising business potential. "Be mindful and move slowly" is a good approach. Using examples from other companies or other lines of business can also prove effective in gaining trust and support from management for new data monetization ideas. Delivering Know Your Customer (KYC) in a real and tangible fashion: Utilizing data in such a way that it can enable data-driven sales is a proactive and efficient way to strengthen customer relationships and prove how knowledgeable the company is. However, there should be a balance here in how data is obtained and used: The last thing you want is for your customers to feel intruded on. You want the company to be perceived as proactive and forward-leaning with an innovative drive that is looking out for its customers, not as a company that invades its customers' private sphere, using their data to turn it into an advantage in negotiations. The CDO must master that balance and find a way to strategically toe that line — a line that can be quite different, depending on your business objectives and line of business. Fixing legacy data infrastructure issues while investing in the future of data science: This challenge is tricky to handle. You can't just switch from old legacy infrastructures (often focused on data transactions and reporting) to the new, often cloud based infrastructures focused on handling data enablement and monetization in completely new and different ways. There has to be a transition period, and during that period you have to deal with maintaining the legacy infrastructure, even when it's costly and feels like an unnecessary burden. At the same time, management is expecting fast and tangible results, based on the major investments needed. But be aware that the longer you have the two infrastructures working in parallel, the harder it is to truly get people to change their mindset and behavior toward the new data-driven approach realized through the new infrastructure investments. Neither will you see any real savings, because you need to cost-manage both the legacy environments and the new environments. Even if it proves difficult, try to drive this swap with an ambitious timeline, keeping in mind that there’s no going back.

View Article
Assessing and Improving Data Quality for Your Data Science Strategy

Article / Updated 06-17-2019

At the core of your data science strategy is data quality. If you are hoping to glean useful insights from your data, it needs to be of high quality. Keep reading to discover how you can assess and improve your data quality to ensure success for your data science strategy. Assessing data quality for data science Another fundamental part of data understanding involves gaining a detailed view of the quality of the data as soon as possible. Many businesses consider data quality and its impact too late — well past the time when it could have had a significant effect on the project’s success. By integrating data quality with operational applications, organizations can reconcile disparate data, remove inaccuracies, standardize on common metrics, and create a strategic, trustworthy, and valuable data asset that enhances decision making. Also, if an initial analysis suggests that the data quality is insufficient, steps can be taken to make improvements. One way to refine the data is by removing unusable parts; another way is to correct poorly formatted parts. Start by asking yourself questions such as these: Is the data complete? Does it cover all required cases? Is it correct, or does it contain errors? If there are errors, how common are they? Does the data have missing values? If so, how are they represented, where do they occur, and how common are they? A more structured list of data quality checkpoints includes steps such as these: Check data coverage (whether all possible values are represented, for example). Verifying that the meaning of attributes and available values fit together. For example, if you are analyzing data on geographical location for retail stores, is the value captured in latitude and longitude, rather than the name of the regional area it is placed in? Identifying missing attributes and blank fields. Classifying the meaning of missing or wrong data, and double-check attributes with different values but similar meanings. Checking for inconsistencies in the spelling and formatting of values (situations where the same value sometimes starts with a lowercase letter and sometimes with an uppercase letter, for example). Without consistent naming conventions and numerical format, data correlation and analysis will not be possible cross data sets. Reviewing deviations and deciding if any of them qualify as mere noise (outliers) or indicate an interesting phenomenon (pattern). Check whether there is noise and inconsistencies between data sources. If you detect data quality problems as part of the quality check, you need to define and document the possible solutions. Focus on attributes that seem to go against common sense; visualization plots, histograms, and other ways of visualizing and exploring data are great ways to reveal possible data inconsistencies. It may also be necessary to exclude low-quality or useless data entirely in order to perform the needed analysis. The image below shows table formatted to show an overview of a data set. Tables like these are a good way to get a first overview of your data from a quality perspective because it uses descriptive statistics to quickly detect extreme values in terms of things like minimum values, maximum values, median, medium, and standard deviation. The table also allows us to analyze the key values to make sure that they are 100% unique and do not include any duplicated or missing values. If you are studying data related to your customers, for example, you want to make sure that a customer does not occur twice due to a spelling error — or is missing from the list altogether! The following image shows a graphical visualization of the same data, but this graph focuses on just one column; country. By looking at the data from a country perspective, you can validate the data distribution in another way, and possibly detect inconsistencies or missing values that were difficult to detect from the overview. In this specific example, the tool actually has a functionality called Pattern which indicates when data values are deviating from the norm. Ask any thriving organization the secret to success and you’ll get many answers: a solid data strategy, or calculated risks in combination with careful budgeting. They’re all good business practices which come from the same place: a solid foundation of high-quality data. When you have accurate, up-to-date data driving your business, you’re not just breaking even — you’re breaking records. A data quality assessment process is essential to ensure reliable analytical outcomes. This process depends on human supervision-driven approaches since it is impossible to determine a defect based only on data. Improving data quality for data science So, what do you do, practically speaking, if you realize that your data quality is really bad? Use the four-step approach outlined below to get started on highlighting the gaps and defining a road map to implement needed improvements in data quality. Scope: Define the data quality problem and describe the business issue related to the quality problem. Specify data details and business processes, products and/or services impacted by the data quality issues. Explore: Conduct interviews with key stakeholders on data quality needs and expectations, as well as data quality problems. Review data quality processes and tool support (if any) across business functions and identify needed resources for the quality improvement activity. Analyze: Assess current practices against industry best practices for data quality and align it with the findings from the exploration phase. Recommend: Develop a road map for improving the data quality process and define how the technical architecture for the data quality process must look like, incorporating your findings of what is not working today and what is essential from a business perspective. If you want to take extra measure to ensure the success of your data science strategy, make sure you are implementing modern data architecture.

View Article
How to Create a Modern Data Architecture For Your Data Science Strategy

Article / Updated 06-12-2019

In many larger companies, the IT function is usually tasked with defining and building data architecture, especially for data generated by internal IT systems. It is many times the case, however, that data coming from external sources — customers, products, or suppliers —are stored and managed separately by the responsible business units. When that's the case, you're faced with the challenge of making sure that all share a common data architecture approach, one that enables all these different data types and user needs to come together by means of an efficient and enabling data pipeline. This data pipeline is all about ensuring an end-to-end flow of data, where applied data management and governance principles focus on a balance between user efficiency and ensuring compliance to relevant laws and regulations. In smaller companies or modern data-driven enterprises, the IT function is usually highly integrated with the various business functions, which includes working closely with data engineers in the business units in order to minimize the gap between IT and the business functions. This approach has proven very efficient. So, after you decide which function will set up and drive which part of the data architecture, it’s time to get started. Using the step-by-step guide provided in this list, you'll be on your way to data-architecture perfection in no time: Identify your use cases as well as the necessary data for those use cases. The first step to take when starting to build your data architecture is to work with business users to identify the use cases and type of data that is either the most relevant or simply the most prioritized at that time. Remember that the purpose of a good data architecture is to bring together the business and technology sides of the company to ensure that they’re working toward a common purpose. To find the most valuable data for your company, you should look for the data that could generate insights with high business impact. This data may reside within enterprise data environments and might have been there for some time, but perhaps the means and technologies to unearth such data and draw insights from it have been too expensive or insufficient. The availability of today’s open source technologies and cloud offerings enable enterprises to pull out such data and work with it in a much more cost-effective and simplified way. Set up data governance. It is of the utmost importance that you make data governance activities a priority. The process of identifying and ingesting data as well as building models for your data needs to ensure quality and relevance from a business perspective is important and should also include efficient control mechanisms as part of the system support. Responsibility for data must also be established, whether it concerns individual data owners or different data science functions. Build your data architecture for flexibility. The rule here is that you should build data systems designed to change, not ones designed to last. A key rule for any data architecture these days is to not build in dependency to a particular technology or solution. If a new key solution or technology becomes available on the market, the architecture should be able to accommodate it. The types of data coming into enterprises can change, as do the tools and platforms that are put into place to handle them. The key is therefore to design a data environment that can accommodate such change. Decide on techniques for capturing data. You need to consider your techniques for acquiring data, and you especially need to make sure that your data architecture can at some point handle real-time data streaming, even if it isn’t an absolute requirement from the start. A modern data architecture needs to be built to support the movement and analysis of data to decision makers when and where it’s needed. Focus on real-time data uploads from two perspectives: the need to facilitate real-time access to data (data that could be historical) as well as the requirement to support data from events as they’re occurring. For the first category, existing infrastructure such as data warehouses have a critical role to play. For the second, new approaches such as streaming analytics and machine learning are critical. Data may be coming from anywhere — transactional applications, devices and sensors across various connected devices, mobile devices and, telecommunications equipment, and who-knows-where-else. A modern data architecture needs to support data movement at all speeds, whether it’s sub-second speeds or with 24-hour latency. Apply the appropriate data security measures to your data architecture. Do not forget to build security into your data architecture. A modern data architecture recognizes that threats to data security are continually emerging, both externally and internally. These threats are constantly evolving and may be coming through email one month and through flash drives the next. Data managers and data architects are usually the most knowledgeable when it comes to understanding what is required for data security in today’s environments, so be sure to utilize their expertise. Integrate master data management. Make sure that you address master data management, the method used to define and manage the critical data of an organization to provide, with the help of data integration, a single point of reference. With an agreed-on and built-in master data management (MDM) strategy, your enterprise is able to have a single version of the truth that synchronizes data to applications accessing that data. The need for an MDM-based architecture is critical because organizations are consistently going through changes, including growth, realignments, mergers, and acquisitions. Often, enterprises end up with data systems running in parallel, and often, critical records and information may be duplicated and overlap across these silos. MDM ensures that applications and systems across the enterprise have the same view of important data. Offer data as a service (aaS). This particular step is a relatively new approach, but it has turned out to be quite a successful component — make sure that your data architecture is able to position data as a service (aaS). Many enterprises have a range of databases and legacy environments, making it challenging to pull information from various sources. With the aaS approach, access is enabled through a virtualized data services layer that standardizes all data sources, regardless of device, applicator, or system. Data as a service is by definition a form of internal company cloud service, where data — along with different data management platforms, tools, and applications — are made available to the enterprise as reusable, standardized services. The potential advantage of data as a service is that processes and assets can be prepackaged based on corporate or compliance standards and made readily available within the enterprise cloud. Enable self-service capabilities. As the final step in building your data architecture, you should definitely invest in self-service environments. With self-service, business users can configure their own queries and get the data or analyses they want, or they can conduct their own data discovery without having to wait for their IT or data management departments to deliver the data. The route to self-service is providing front-end interfaces that are simply laid out and easy to use for your target audience. In the process, a logical service layer can be developed that can be reused across various projects, departments, and business units. IT could still have an important role to play in a self-service-enabled architecture, including aspects such as data pipeline operations (hardware, software, and cloud) and data governance control mechanisms, but it would have to spend less and less of its time and resources on fulfilling user requests that could be better formulated and addressed by the user themselves.

View Article
Ensuring Data Science Strategy Success: Essential Technologies for a Modern Data Architecture

Article / Updated 06-12-2019

Your data science strategy will have a higher likelihood of success if you are taking the time to implement modern data architecture. The drive today is to refactor the enterprise technology platform to enable faster, easier, more flexible access to large volumes of precious data. This refactoring is no small undertaking and is usually sparked by a shifting set of key business drivers. Simply put, the data architectures that have dominated enterprise IT for nearly 30 years can no longer handle the workloads needed to drive data-driven businesses forward. Organizations have long been constrained in their use of data by incompatible formats, limitations of traditional databases, and the inability to flexibly combine data from multiple sources. New technologies are now starting to deliver on the promise to change all that. Improving the deployment model of software is one major step to removing barriers to data usage. Greater data agility also requires more flexible databases and more scalable real-time streaming platforms. In fact, no fewer than seven foundational technologies are needed to deliver a flexible, real-time modern data architecture to the enterprise. These seven key technologies are described below. Data science strategy: NoSQL databases The relational database management system (RDBMS) has dominated the database market for nearly 30 years, yet the traditional relational database has been shown to be less than adequate in handling the ever-growing data volumes and the accelerated pace at which data must be handled. NoSQL databases — "no SQL" because it’s decidedly nonrelational — have been taking over because of their speed and ability to scale. They provide a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Because of their speed, NoSQL databases are increasingly used in big data and real-time web applications. NoSQL databases offer a simplicity of design, simpler horizontal scaling to clusters of machines (a real problem for relational databases), and finer control over availability. The data architecture structures used by NoSQL databases (key-value, wide column, graph, or document, for example) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Sometimes the data structures used by NoSQL databases are also viewed as more flexible than relational database tables. Data science strategy: Real-time streaming platforms Responding to customers in real-time is critical to the customer experience. It’s no mystery why consumer-facing industries —Business-to-Consumer (B2C) setups, in other words — have experienced massive disruption in the past ten years. It has everything to do with the ability of companies to react to the user in real-time. Telling a customer that you will have an offer ready in 24 hours is no good because they will have already executed the decision they made 23 hours ago. Moving to a real-time model requires event streaming. Message-driven applications have been around for years, but today’s streaming platforms scale far better and at far lower cost than their predecessors. The recent advancement in streaming technologies opens the door to many new ways to optimize a business. Reacting to a customer in real-time is one benefit. Another aspect to consider is the benefits to development. By providing a real-time feedback loop to the development teams, event streams can also help companies improve product quality and get new software out the door faster. Data science strategy: Docker and containers Docker is a computer program that you can use as part of your data architecture that performs operating-system-level virtualization, also known as containerization. First released in 2013 by Docker, Inc., Docker is used to run software packages called containers, a method of virtualization that packages an application's code, configurations, and dependencies into building blocks for consistency, efficiency, productivity, and version control. Containers are isolated from each other and bundle their own application, tools, libraries, and configuration files and can communicate with each other by way of well-defined channels. All containers are run by a single operating system kernel and are thus more lightweight than virtual machines. Containers are created from images that specify their precise content. A container image is a self-contained piece of software that includes everything that it needs in order to run, like code, tools, and resources. Containers hold significant benefits for both developers and operators as well as for the organization itself. The traditional approach to infrastructure isolation was that of static partitioning, the allocation of a separate, fixed slice of resources, like a physical server or a virtual machine, to each workload. Static partitions made it easier to troubleshoot issues, but at the significant cost of delivering substantially underutilized hardware. web servers, for example, would consume on average only about 10 percent of the total computational power available. The great benefit of container technology is its ability to create a new type of isolation. Those who least understand containers might believe they can achieve the same benefits by using automation tools like Ansible, Puppet, or Chef, but in fact these technologies are missing vital capabilities. No matter how hard you try, those automation tools cannot create the isolation required to move workloads freely between different infrastructure and hardware setups. The same container can run on bare-metal hardware in an on-premises data center or in a virtual machine in the public cloud. No changes are necessary. That is what true workload mobility is all about. Data science strategy Container repositories A container image repository is a collection of related container images, usually providing different versions of the same application or service. It’s critical to maintaining agility in your data architecture. Without a DevOps process with continuous deliveries for building container images and a repository for storing them, each container would have to be built on every machine in which that container could run. With the repository, container images can be launched on any machine configured to read from that repository. Where this gets even more complicated is when dealing with multiple data centers. If a container image is built in one data center, how do you move the image to another data center? Ideally, by leveraging a converged data platform, you will have the ability to mirror the repository between data centers. A critical detail here is that mirroring capabilities between on-premises and the cloud might be vastly different than between your on-premises data centers. A converged data platform will solve this problem for you by offering those capabilities regardless of the physical or cloud infrastructure you use in your organization. Data science strategy: Container orchestration Instead of static hardware partitions, each container appears to be entirely its own private operating system. Unlike virtual machines, containers don’t require a static partition of data computation and memory. This enables administrators to launch large numbers of containers on servers without having to worry so much about exact amounts of memory in their data architecture. With container orchestration tools like Kubernetes, it becomes easy to launch containers, kill them, move them, and relaunch them elsewhere in an environment. Assuming that you have the new infrastructure components in place (a document database such as MapR-DB or MongoDB, for example) and an event streaming platform (maybe MapR-ES or Apache Kafka) with an orchestration tool (perhaps Kubernetes) in place, what is the next step? You'll certainly have to implement a DevOps process for coming up with continuous software builds that can then be deployed as Docker containers. The bigger question, however, is what you should actually deploy in those containers you've created. This brings us to microservices. Data science strategy: Microservices Microservices are a software development technique that structures your data architecture using an application as a collection of services that Are easy to maintain and test Are loosely coupled Are organized around business capabilities Can be deployed independently As such, microservices come together to form a microservice architecture, one that enables the continuous delivery/deployment of large, complex applications and also enables an organization to evolve its technology stack — the set of software that provides the infrastructure for a computer or a server. The benefit of breaking down an application into different, smaller services is that it improves modularity, which then makes the application easier to understand, develop, and test and to become more resilient to architecture erosion — the violations of a system’s data architecture that lead to significant problems in the system and contribute to its increasing fragility. With a microservices data architecture, small autonomous teams can run in parallel to develop, deploy, and scale their respective services independently. It also allows the architecture of an individual service to emerge through continuous refactoring — a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior (thus ensuring that it continues to fit within the architectural setting). The concept of microservices is nothing new. The difference today is that the enabling technologies like NoSQL databases, event streaming, and container orchestration can scale with the creation of thousands of microservices. Without these new approaches to data storage, event streaming, and infrastructure orchestration, large-scale microservices deployments would not be possible. The infrastructure needed to manage the vast quantities of data, events, and container instances would not be able to scale to the required levels. Microservices are all about delivering agility. A service that is micro in nature generally consists of either a single function or a small group of related functions. The smaller and more focused the functional unit of the work, the easier it will be to create, test, and deploy the service. These services must be decoupled, meaning you can make changes to any one service without having an effect on any other service. If this is not the case, you lose the agility promised by the microservices concept. Admittedly, the decoupling must not be absolute — microservices can, of course, rely on other services — but the reliance should be based on either balanced REST APIs or event streams. (Using event streams allows you to leverage request-and-response topics so that you can easily keep track of the history of events; this approach is a major plus when it comes to troubleshooting, because the entire request flow and all the data in the requests can be replayed at any time.) Data science strategy: Function as a service Just as the microservices idea has attracted a lot of interest when it comes to data architecture, so has the rise of server-less computing — perhaps more accurately referred to as function as a service (FaaS). Amazon Lambda is an example of a FaaS framework, where it lets you run code without provisioning or managing servers, and you pay only for the computing time you consume. FaaS enables the creation of microservices in such a way that the code can be wrapped in a lightweight framework built into a container, executed on demand based on some trigger, and then automatically load-balanced, thanks to the aforementioned lightweight framework. The main benefit of FaaS is that it allows the developer to focus almost exclusively on the function itself, making FaaS the logical conclusion of the microservices approach. The triggering event is a critical component of FaaS. Without it, there’s no way for the functions to be invoked (and resources consumed) on demand. This ability to automatically request functions when needed is what makes FaaS truly valuable. Imagine, for a moment, that someone reading a user’s profile triggers an audit event, a function that must run to notify a security team. More specifically, maybe it filters out only certain types of records that are to be marked as prompting a trigger. It can be selective, in other words, which plays up the fact that, as a business function, it is completely customizable. The magic behind a triggering service is really nothing more than working with events in an event stream. Certain types of events are used as triggers more often than others, but any event you want can be made into a trigger. The event could be a document update, or maybe running an OCR process over the new document and then adding the text from the OCR process to a NoSQL database. The possibilities here are endless. FaaS is also an excellent area for creative uses of machine learning — perhaps machine learning as a service or, more specifically, “a machine learning function aaS.” Consider that whenever an image is uploaded, it could be run through a machine learning framework for image identification and scoring. There’s no fundamental limitation here. A trigger event is defined, something happens, the event triggers the function, and the function does its job. FaaS is already an important part of microservices adoption, but you must consider one major factor when approaching FaaS: vendor lock-in. The idea behind FaaS is that it's designed to hide the specific storage mechanisms, the specific hardware infrastructure, and the software component orchestration — all great features, if you're a software developer. But because of this abstraction, a hosted FaaS offering is one of the greatest vendor lock-in opportunities the software industry has ever seen. Because the APIs aren’t standardized, migrating from one FaaS offering in the public cloud to another is difficult without throwing away a substantial part of the work that has been performed. If FaaS is approached in a more methodical way — by leveraging events from a converged data platform, for example — it becomes easier to move between cloud providers.

View Article