Data Science Articles
Data science is what happens when you let the world's brightest minds loose on a big dataset. It gets crazy. Our articles will walk you through what data science is and what it does.
Articles From Data Science
Filter Results
Article / Updated 09-24-2024
Both linear and logistic regression see a lot of use in data science but are commonly used for different kinds of problems. You need to know and understand both types of regression to perform a full range of data science tasks. Of the two, logistic regression is harder to understand in many respects because it necessarily uses a more complex equation model. The following information gives you a basic overview of how linear and logistic regression differ. The equation model Any discussion of the difference between linear and logistic regression must start with the underlying equation model. The equation for linear regression is straightforward. y = a + bx You may see this equation in other forms and you may see it called ordinary least squares regression, but the essential concept is always the same. Depending on the source you use, some of the equations used to express logistic regression can become downright terrifying unless you’re a math major. However, the start of this discussion can use one of the simplest views of logistic regression: p = f(a + bx) >p, is equal to the logistic function, f, applied to two model parameters, a and b, and one explanatory variable, x. When you look at this particular model, you see that it really isn’t all that different from the linear regression model, except that you now feed the result of the linear regression through the logistic function to obtain the required curve. The output (dependent variable) is a probability ranging from 0 (not going to happen) to 1 (definitely will happen), or a categorization that says something is either part of the category or not part of the category. (You can also perform multiclass categorization, but focus on the binary response for now.) The best way to view the difference between linear regression output and logistic regression output is to say that the following: Linear regression is continuous. A continuous value can take any value within a specified interval (range) of values. For example, no matter how closely the height of two individuals matches, you can always find someone whose height fits between those two individuals. Examples of continuous values include: Height Weight Waist size Logistic regression is discrete. A discrete value has specific values that it can assume. For example, a hospital can admit only a specific number of patients in a given day. You can’t admit half a patient (at least, not alive). Examples of discrete values include: Number of people at the fair Number of jellybeans in the jar Colors of automobiles produced by a vendor The logistic function Of course, now you need to know about the logistic function. You can find a variety of forms of this function as well, but here’s the easiest one to understand: f(x) = e<sup>x</sup> / e<sup>x</sup> + 1 You already know about f, which is the logistic function, and x equals the algorithm you want to use, which is a + bx in this case. That leaves e, which is the natural logarithm and has an irrational value of 2.718, for the sake of discussion (check out a better approximation of the whole value). Another way you see this function expressed is f(x) = 1 / (1 + e<sup>-x</sup>) Both forms are correct, but the first form is easier to use. Consider a simple problem in which a, the y-intercept, is 0, and ">b, the slope, is 1. The example uses x values from –6 to 6. Consequently, the first f(x) value would look like this when calculated (all values are rounded): (1) e<sup>-6</sup> / (1 + e<sup>-6</sup>) (2) 0.00248 / 1 + 0.00248 (3) 0.002474 As you might expect, an xvalue of 0 would result in an f(x) value of 0.5, and an x value of 6 would result in an f(x) value of 0.9975. Obviously, a linear regression would show different results for precisely the same x values. If you calculate and plot all the results from both logistic and linear regression using the following code, you receive a plot like the one below. import matplotlib.pyplot as plt %matplotlib inline from math import exp x_values = range(-6, 7) lin_values = [(0 + 1*x) / 13 for x in range(0, 13)] log_values = [exp(0 + 1*x) / (1 + exp(0 + 1*x)) for x in x_values] plt.plot(x_values, lin_values, 'b-^') plt.plot(x_values, log_values, 'g-*') plt.legend(['Linear', 'Logistic']) plt.show() This example relies on list comprehension to calculate the values because it makes the calculations clearer. The linear regression uses a different numeric range because you must normalize the values to appear in the 0 to 1 range for comparison. This is also why you divide the calculated values by 13. The exp(x) call used for the logistic regression raises e to the power of x, e<sup>x</sup>, as needed for the logistic function. The model discussed here is simplified, and some math majors out there are probably throwing a temper tantrum of the most profound proportions right now. The Python or R package you use will actually take care of the math in the background, so really, what you need to know is how the math works at a basic level so that you can understand how to use the packages. This section provides what you need to use the packages. However, if you insist on carrying out the calculations the old way, chalk to chalkboard, you’ll likely need a lot more information. The problems that logistic regression solves You can separate logistic regression into several categories. The first is simple logistic regression, in which you have one dependent variable and one independent variable, much as you see in simple linear regression. However, because of how you calculate the logistic regression, you can expect only two kinds of output: Classification: Decides between two available outcomes, such as male or female, yes or no, or high or low. The outcome is dependent on which side of the line a particular data point falls. Probability: Determines the probability that something is true or false. The values true and false can have specific meanings. For example, you might want to know the probability that a particular apple will be yellow or red based on the presence of yellow and red apples in a bin. Fit the curve As part of understanding the difference between linear and logistic regression, consider this grade prediction problem, which lends itself well to linear regression. In the following code, you see the effect of trying to use logistic regression with that data: x1 = range(0,9) y1 = (0.25, 0.33, 0.41, 0.53, 0.59, 0.70, 0.78, 0.86, 0.98) plt.scatter(x1, y1, c='r') lin_values = [0.242 + 0.0933*x for x in x1] log_values = [exp(0.242 + .9033*x) / (1 + exp(0.242 + .9033*x)) for x in range(-4, 5)] plt.plot(x1, lin_values, 'b-^') plt.plot(x1, log_values, 'g-*') plt.legend(['Linear', 'Logistic', 'Org Data']) plt.show() The example has undergone a few changes to make it easier to see precisely what is happening. It relies on the same data that was converted from questions answered correctly on the exam to a percentage. If you have 100 questions and you answer 25 of them correctly, you have answered 25 percent (0.25) of them correctly. The values are normalized to produce values between 0 and 1 percent. As you can see from the image above, the linear regression follows the data points closely. The logistic regression doesn’t. However, logistic regression often is the correct choice when the data points naturally follow the logistic curve, which happens far more often than you might think. You must use the technique that fits your data best, which means using linear regression in this case. A pass/fail example An essential point to remember is that logistic regression works best for probability and classification. Consider that points on an exam ultimately predict passing or failing the course. If you get a certain percentage of the answers correct, you pass, but you fail otherwise. The following code considers the same data used for the example above, but converts it to a pass/fail list. When a student gets at least 70 percent of the questions correct, success is assured. y2 = [0 if x < 0.70 else 1 for x in y1] plt.scatter(x1, y2, c='r') lin_values = [0.242 + 0.0933*x for x in x1] log_values = [exp(0.242 + .9033*x) / (1 + exp(0.242 + .9033*x)) for x in range(-4, 5)] plt.plot(x1, lin_values, 'b-^') plt.plot(x1, log_values, 'g-*') plt.legend(['Linear', 'Logistic', 'Org Data']) plt.show() This is an example of how you can use list comprehensions in Python to obtain a required dataset or data transformation. The list comprehension for y2 starts with the continuous data in y1 and turns it into discrete data. Note that the example uses precisely the same equations as before. All that has changed is the manner in which you view the data, as you can see below. Because of the change in the data, linear regression is no longer the option to choose. Instead, you use logistic regression to fit the data. Take into account that this example really hasn’t done any sort of analysis to optimize the results. The logistic regression fits the data even better if you do so.
View ArticleCheat Sheet / Updated 04-12-2024
A wide range of tools is available that are designed to help big businesses and small take advantage of the data science revolution. Among the most essential of these tools are Microsoft Power BI, Tableau, SQL, and the R and Python programming languages.
View Cheat SheetArticle / Updated 12-01-2023
Getting the most out of your unstructured data is an essential task for any organization these days, especially when considering the disparate storage systems, applications, and user locations. So, it’s not an accident that data orchestration is the term that brings everything together. Bringing all your data together shares similarities with conducting an orchestra. Instead of combining the violin, oboe, and cello, this brand of orchestration combines distributed data types from different places, platforms, and locations working as a cohesive entity presented to applications or users anywhere. That’s because historically, accessing high-performance data outside of your computer network was inefficient. Because the storage infrastructure existed in a silo, systems like HPC Parallel (which lets users store and access shared data across multiple networked storage nodes), Enterprise NAS (which allows large-scale storage and access to other networks), and Global Namespace (virtually simplifies network file systems) were limited when it came to sharing. Because each operated independently, the data within each system was siloed making it a problem collaborating with data sets over multiple locations. Collaboration was possible, but too often you lost the ability to have high performance. This Boolean logic decreased potential because having an IT architecture that supported both high performance and collaboration with data sets from different storage silos typically became an either/or decision: You were forced to choose one but never both. What is data orchestration? Data orchestration is the automated process of taking siloed data from multiple data storage systems and locations, combining and organizing it into a single namespace. Then a high-performance file system can place data in the edge service, data center, or cloud service most optimal for the workload. The recent rise of data analytic applications and artificial intelligence (AI) capabilities has accelerated the use of data across different locations and even different organizations. In the next data cycle, organizations will need both high-performance and agility with their data to compete and thrive in a competitive environment. That means data no longer has a 1:1 relationship with the applications and compute environment that generated it. It needs to be used, analyzed, and repurposed with different AI models and alternate workloads, and across a remote, collaborative environment. Hammerspace’s technology makes data available to different foundational models, remote applications, decentralized compute clusters, and remote workers to automate and streamline data-driven development programs, data insights, and business decision making. This capability enables a unified, fast, and efficient global data environment for the entire workflow — from data creation to processing, collaboration, and archiving across edge devices, data centers, and public and private clouds. Control of enterprise data services for governance, security, data protection, and compliance can now be implemented globally at a file-granular level across all storage types and locations. Applications and AI models can access data stored in remote locations while using automated orchestration tools to provide high-performance local access when needed for processing. Organizations can grow their talent pools with access to team members no matter where they reside. Decentralizing the data center Data collection has become more prominent, and the traditional system of centralized data management has limitations. Issues of centralized data storage can limit the amount of data available to applications. Then, there are the high infrastructure costs when multiple applications are needed to manage and move data, multiple copies of data are retained in different storage systems, and more headcount is needed to manage the complex, disconnected infrastructure environment. Such setbacks suggest that the data center is no longer the center of data and storage system constraints should no longer define data architectures. Hammerspace specializes in decentralized environments, where data may need to span two or more sites and possibly one or more cloud providers and regions, and/or where a remote workforce needs to collaborate in real time. It enables a global data environment by providing a unified, parallel global file system. Enabling a global data environment Hammerspace completely revolutionizes previously held notions of how unstructured data architectures should be designed, delivering the performance needed across distributed environments to Free workloads from data silos. Eliminate copy proliferation. Provide direct data access through local metadata to applications and users, no matter where the data is stored. This technology allows organizations to take full advantage of the performance capabilities of any server, storage system, and network anywhere in the world. This capability enables a unified, fast, and efficient global data environment for the entire workflow, from data creation to processing, collaboration, and archiving across edge devices, data centers, and public and private clouds. The days of enterprises struggling with a siloed, distributed, and inefficient data environment are over. It’s time to start expecting more from data architectures with automated data orchestration. Find out how by downloading Unstructured Data Orchestration For Dummies, Hammerspace Special Edition, here.
View ArticleArticle / Updated 07-27-2023
In growth, you use testing methods to optimize your web design and messaging so that it performs at its absolute best with the audiences to which it's targeted. Although testing and web analytics methods are both intended to optimize performance, testing goes one layer deeper than web analytics. You use web analytics to get a general idea about the interests of your channel audiences and how well your marketing efforts are paying off over time. After you have this information, you can then go in deeper to test variations on live visitors in order to gain empirical evidence about what designs and messaging your visitors actually prefer. Testing tactics can help you optimize your website design or brand messaging for increased conversions in all layers of the funnel. Testing is also useful when optimizing your landing pages for user activations and revenue conversions. Checking out common types of testing in growth When you use data insights to increase growth for e-commerce businesses, you're likely to run into the three following testing tactics: A/B split testing, multivariate testing, and mouse-click heat map analytics. An A/B split test is an optimization tactic you can use to split variations of your website or brand messaging between sets of live audiences in order to gauge responses and decide which of the two variations performs best. A/B split testing is the simplest testing method you can use for website or messaging optimization. Multivariate testing is, in many ways, similar to the multivariate regression analysis that I discuss in Chapter 5. Like multivariate regression analysis, multivariate testing allows you to uncover relationships, correlations, and causations between variables and outcomes. In the case of multivariate testing, you're testing several conversion factors simultaneously over an extended period in order to uncover which factors are responsible for increased conversions. Multivariate testing is more complicated than A/B split testing, but it usually provides quicker and more powerful results. Lastly, you can use mouse-click heat map analytics to see how visitors are responding to your design and messaging choices. In this type of testing, you use the mouse-click heat map to help you make optimal website design and messaging choices to ensure that you're doing everything you can to keep your visitors focused and converting. Landing pages are meant to offer visitors little to no options, except to convert or to exit the page. Because a visitor has so few options on what he can do on a landing page, you don't really need to use multivariate testing or website mouse-click heat maps. Simple A/B split tests suffice. Data scientists working in growth hacking should be familiar with (and know how to derive insight from) the following testing applications: Webtrends: Offers a conversion-optimization feature that includes functionality for A/B split testing and multivariate testing. Optimizely: A popular product among the growth-hacking community. You can use Optimizely for multipage funnel testing, A/B split testing, and multivariate testing, among other things. Visual Website Optimizer: An excellent tool for A/B split testing and multivariate testing. Testing for acquisitions Acquisitions testing provides feedback on how well your content performs with prospective users in your assorted channels. You can use acquisitions testing to help compare your message's performance in each channel, helping you optimize your messaging on a per-channel basis. If you want to optimize the performance of your brand's published images, you can use acquisition testing to compare image performance across your channels as well. Lastly, if you want to increase your acquisitions through increases in user referrals, use testing to help optimize your referrals messaging for the referrals channels. Acquisition testing can help you begin to understand the specific preferences of prospective users on a channel-by-channel basis. You can use A/B split testing to improve your acquisitions in the following ways: Social messaging optimization: After you use social analytics to deduce the general interests and preferences of users in each of your social channels, you can then further optimize your brand messaging along those channels by using A/B split testing to compare your headlines and social media messaging within each channel. Brand image and messaging optimization: Compare and optimize the respective performances of images along each of your social channels. Optimized referral messaging: Test the effectiveness of your email messaging at converting new user referrals. Testing for activations Activation testing provides feedback on how well your website and its content perform in converting acquired users to active users. The results of activation testing can help you optimize your website and landing pages for maximum sign-ups and subscriptions. Here's how you'd use testing methods to optimize user activation growth: Website conversion optimization: Make sure your website is optimized for user activation conversions. You can use A/B split testing, multivariate testing, or a mouse-click heat map data visualization to help you optimize your website design. Landing pages: If your landing page has a simple call to action that prompts guests to subscribe to your email list, you can use A/B split testing for simple design optimization of this page and the call-to-action messaging. Testing for retentions Retentions testing provides feedback on how well your blog post and email headlines are performing among your base of activated users. If you want to optimize your headlines so that active users want to continue active engagements with your brand, test the performance of your user-retention tactics. Here's how you can use testing methods to optimize user retention growth: Headline optimization: Use A/B split testing to optimize the headlines of your blog posts and email marketing messages. Test different headline varieties within your different channels, and then use the varieties that perform the best. Email open rates and RSS view rates are ideal metrics to track the performance of each headline variation. Conversion rate optimization: Use A/B split testing on the messaging within your emails to decide which messaging variety more effectively gets your activated users to engage with your brand. The more effective your email messaging is at getting activated users to take a desired action, the greater your user retention rates. Testing for revenue growth Revenue testing gauges the performance of revenue-generating landing pages, e-commerce pages, and brand messaging. Revenue testing methods can help you optimize your landing and e-commerce pages for sales conversions. Here's how you can use testing methods to optimize revenue growth: Website conversion optimization: You can use A/B split testing, multivariate testing, or a mouse-click heat map data visualization to help optimize your sales page and shopping cart design for revenue-generating conversions. Landing page optimization: If you have a landing page with a simple call to action that prompts guests to make a purchase, you can use A/B split testing for design optimization.
View ArticleCheat Sheet / Updated 07-24-2023
Blockchain technology is much more than just another way to store data. It's a radical new method of storing validated data and transaction information in an indelible, trusted repository. Blockchain has the potential to disrupt business as we know it, and in the process, provide a rich new source of behavioral data. Data analysts have long found valuable insights from historical data, and blockchain can expose new and reliable data to drive business strategy. To best leverage the value that blockchain data offers, become familiar with blockchain technology and how it stores data, and learn how to extract and analyze this data.
View Cheat SheetArticle / Updated 07-24-2023
In 2008, Bitcoin was the only blockchain implementation. At that time, Bitcoin and blockchain were synonymous. Now hundreds of different blockchain implementations exist. Each new blockchain implementation emerges to address a particular need and each one is unique. However, blockchains tend to share many features with other blockchains. Before examining blockchain applications and data, it helps to look at their similarities. Check out this article to learn how blockchains work. Categorizing blockchain implementations One of the most common ways to evaluate blockchains is to consider the underlying data visibility, that is, who can see and access the blockchain data. And just as important, who can participate in the decision (consensus) to add new blocks to the blockchain? The three primary blockchain models are public, private, and hybrid. Opening blockchain to everyone Nakamoto’s original blockchain proposal described a public blockchain. After all, blockchain technology is all about providing trusted transactions among untrusted participants. Sharing a ledger of transactions among nodes in a public network provides a classic untrusted network. If anyone can join the network, you have no criteria on which to base your trust. It’s almost like throwing s $20 bill out your window and trusting that only the person you intend to pick it up will do so. Public blockchain implementations, including Bitcoin and Ethereum, depend on a consensus algorithm that makes it hard to mine blocks but easy to validate them. PoW is the most common consensus algorithm in use today for public blockchains, but that may change. Ethereum is in the process of transitioning to the Proof of Stake (PoS) consensus algorithm, which requires less computation and depends on how much blockchain currency a node holds. The idea is that a node with more blockchain currency would be affected negatively if it participates in unethical behavior. The higher the stake you have in something, the greater the chance that you’ll care about its integrity. Because public blockchains are open to anyone (anyone can become a node on the network), no permission is needed to join. For this reason, a public blockchain is also called a permissionless blockchain. Public (permissionless) blockchains are most often used for new apps that interact with the public in general. A public blockchain is like a retail store, in that anyone can walk into the store and shop. Limiting blockchain access The opposite of a public blockchain is a private blockchain, such as Hyperledger Fabric. In a private blockchain, also called a permissioned blockchain, the entity that owns and controls the blockchain grants and revokes access to the blockchain data. Because most enterprises manage sensitive or private data, private blockchains are commonly used because they can limit access to that data. The blockchain data is still transparent and readily available but is subject to the owning entity’s access requirements. Some have argued that private blockchains violate data transparency, the original intent of blockchain technology. Although private blockchains can limit data access (and go against the philosophy of the original blockchain in Bitcoin), limited transparency also allows enterprises to consider blockchain technology for new apps in a private environment. Without the private blockchain option, the technology likely would never be considered for most enterprise applications. Combining the best of both worlds A classic blockchain use case is a supply chain app, which manages a product from its production all the way through its consumption. The beginning of the supply chain is when a product is manufactured, harvested, caught, or otherwise provisioned to send to an eventual customer. The supply chain app then tracks and manages each transfer of ownership as the product makes its way to the physical location where the consumer purchases it. Supply chain apps manage product movement, process payment at each stage in the movement lifecycle, and create an audit trail that can be used to investigate the actions of each owner along the supply chain. Blockchain technology is well suited to support the transfer of ownership and maintain an indelible record of each step in the process. Many supply chains are complex and consist of multiple organizations. In such cases, data suffers as it is exported from one participant, transmitted to the next participant, and then imported into their data system. A single blockchain would simplify the export/transport/import cycle and auditing. An additional benefit of blockchain technology in supply chain apps is the ease with which a product’s provenance (a trace of owners back to its origin) is readily available. Many of today’s supply chains are made up of several enterprises that enter into agreements to work together for mutual benefit. Although the participants in a supply chain are business partners, they do not fully trust one another. A blockchain can provide the level of transactional and data trust that the enterprises need. The best solution is a semi-private blockchain – that is, the blockchain is public for supply chain participants but not to anyone else. This type of blockchain (one that is owned by a group of entities) is called a hybrid, or consortium, blockchain. The participants jointly own the blockchain and agree on policies to govern access. Describing basic blockchain type features Each type of blockchain has specific strengths and weaknesses. Which one to use depends on the goals and target environment. You have to know why you need blockchain and what you expect to get from it before you can make an informed decision as to what type of blockchain would be best. The best solution for one organization may not be the best solution for another. The table below shows how blockchain types compare and why you might choose one over the other. Differences in Types of Blockchain Feature Public Private Hybrid Permission Permissionless Permissioned (limited to organization members) Permissioned (limited to consortium members) Consensus PoW, PoS, and so on Authorized participants Varies; can use any method Performance Slow (due to consensus) Fast (relatively) Generally fast Identity Virtually anonymous Validated identity Validated identity The primary differences between each type of blockchain are the consensus algorithm used and whether participants are known or anonymous. These two concepts are related. An unknown (and therefore completely untrusted) participant will require an environment with a more rigorous consensus algorithm. On the other hand, if you know the transaction participants, you can use a less rigorous consensus algorithm. Contrasting popular enterprise blockchain implementations Dozens of blockchain implementations are available today, and soon there will be hundreds. Each new blockchain implementation targets a specific market and offers unique features. There isn’t room in this article to cover even a fair number of blockchain implementations, but you should be aware of some of the most popular. Remember that you’ll be learning about blockchain analytics in this book. Although organizations of all sizes are starting to leverage the power of analytics, enterprises were early adopters and have the most mature approach to extracting value from data. The What Matrix website provides a comprehensive comparison of top enterprise blockchains. Visit whatmatrix.com for up-to-date blockchain information. Following are the top enterprise blockchain implementations and some of their strengths and weaknesses (ranking is based on the What Matrix website): Hyperledger Fabric: The flagship blockchain implementation from the Linux Foundation. Hyperledger is an open-source project backed by a diverse consortium of large corporations. Hyperledger’s modular-based architecture and rich support make it the highest rated enterprise blockchain. VeChain: Currently more popular that Hyperledger, having the highest number of enterprise use cases among products reviewed by What Matrix. VeChain includes support for two native cryptocurrencies and states that its focus is on efficient enterprise collaboration. Ripple Transaction Protocol: A blockchain that focuses on financial markets. Instead of appealing to general use cases, Ripple caters to organizations that want to implement financial transaction blockchain apps. Ripple was the first commercially available blockchain focused on financial solutions. Ethereum: The most popular general-purpose, public blockchain implementation. Although Ethereum is not technically an enterprise solution, it's in use in multiple proof of concept projects. The preceding list is just a brief overview of a small sample of blockchain implementations. If you’re just beginning to learn about blockchain technology in general, start out with Ethereum, which is one of the easier blockchain implementations to learn. After that, you can progress to another blockchain that may be better aligned with your organization. Want to learn more? Check out our Blockchain Data Analytic Cheat Sheet.
View ArticleArticle / Updated 06-09-2023
If statistics has been described as the science of deriving insights from data, then what’s the difference between a statistician and a data scientist? Good question! While many tasks in data science require a fair bit of statistical know how, the scope and breadth of a data scientist’s knowledge and skill base is distinct from those of a statistician. The core distinctions are outlined below. Subject matter expertise: One of the core features of data scientists is that they offer a sophisticated degree of expertise in the area to which they apply their analytical methods. Data scientists need this so that they’re able to truly understand the implications and applications of the data insights they generate. A data scientist should have enough subject matter expertise to be able to identify the significance of their findings and independently decide how to proceed in the analysis. In contrast, statisticians usually have an incredibly deep knowledge of statistics, but very little expertise in the subject matters to which they apply statistical methods. Most of the time, statisticians are required to consult with external subject matter experts to truly get a firm grasp on the significance of their findings, and to be able to decide the best way to move forward in an analysis. Mathematical and machine learning approaches: Statisticians rely mostly on statistical methods and processes when deriving insights from data. In contrast, data scientists are required to pull from a wide variety of techniques to derive data insights. These include statistical methods, but also include approaches that are not based in statistics — like those found in mathematics, clustering, classification, and non-statistical machine learning approaches. Seeing the importance of statistical know-how You don't need to go out and get a degree in statistics to practice data science, but you should at least get familiar with some of the more fundamental methods that are used in statistical data analysis. These include: Linear regression: Linear regression is useful for modeling the relationships between a dependent variable and one or several independent variables. The purpose of linear regression is to discover (and quantify the strength of) important correlations between dependent and independent variables. Time-series analysis: Time series analysis involves analyzing a collection of data on attribute values over time, in order to predict future instances of the measure based on the past observational data. Monte Carlo simulations: The Monte Carlo method is a simulation technique you can use to test hypotheses, to generate parameter estimates, to predict scenario outcomes, and to validate models. The method is powerful because it can be used to very quickly simulate anywhere from 1 to 10,000 (or more) simulation samples for any processes you are trying to evaluate. Statistics for spatial data: One fundamental and important property of spatial data is that it’s not random. It’s spatially dependent and autocorrelated. When modeling spatial data, avoid statistical methods that assume your data is random. Kriging and krige are two statistical methods that you can use to model spatial data. These methods enable you to produce predictive surfaces for entire study areas based on sets of known points in geographic space. Working with clustering, classification, and machine learning methods Machine learning is the application of computational algorithms to learn from (or deduce patterns in) raw datasets. Clustering is a particular type of machine learning —unsupervised machine learning, to be precise, meaning that the algorithms must learn from unlabeled data, and as such, they must use inferential methods to discover correlations. Classification, on the other hand, is called supervised machine learning, meaning that the algorithms learn from labeled data. The following descriptions introduce some of the more basic clustering and classification approaches: k-means clustering: You generally deploy k-means algorithms to subdivide data points of a dataset into clusters based on nearest mean values. To determine the optimal division of your data points into clusters, such that the distance between points in each cluster is minimized, you can use k-means clustering. Nearest neighbor algorithms: The purpose of a nearest neighbor analysis is to search for and locate either a nearest point in space or a nearest numerical value, depending on the attribute you use for the basis of comparison. Kernel density estimation: An alternative way to identify clusters in your data is to use a density smoothing function. Kernel density estimation (KDE) works by placing a kernel a weighting function that is useful for quantifying density — on each data point in the data set, and then summing the kernels to generate a kernel density estimate for the overall region. Keeping mathematical methods in the mix Lots gets said about the value of statistics in the practice of data science, but applied mathematical methods are seldom mentioned. To be frank, mathematics is the basis of all quantitative analyses. Its importance should not be understated. The two following mathematical methods are particularly useful in data science. Multi-criteria decision making (MCDM): MCDM is a mathematical decision modeling approach that you can use when you have several criteria or alternatives that you must simultaneously evaluate when making a decision. Markov chains: A Markov chain is a mathematical method that chains together a series of randomly generated variables that represent the present state in order to model how changes in present state variables affect future states.
View ArticleArticle / Updated 06-09-2023
Blockchain technology alone cannot provide rich analytics results. For all that blockchain is, it can’t magically provide more data than other technologies. Before selecting blockchain technology for any new development or analytics project, clearly justify why such a decision makes sense. If you already depend on blockchain technology to store data, the decision to use that data for analysis is a lot easier to justify. Here, you examine some reasons why blockchain-supported analytics may allow you to leverage your data in interesting ways. Leveraging newly accessible decentralized tools to analyze blockchain data You’ll want to learn how to manually access and analyze blockchain data. But, it's also important to understand how to exercise granular control over your data throughout the analytics process, higher-level tools make the task easier. The growing number of decentralized data analytics solutions means more opportunities to build analytics models with less effort. Third-party tools may reduce the amount of control you have over the models you deploy, but they can dramatically increase analytics productivity. The following list of blockchain analytics solutions is not exhaustive and is likely to change rapidly. Take a few minutes to conduct your own internet search for blockchain analytics tools. You’ll likely find even more software and services: Endor: A blockchain-based AI prediction platform that has the goal of making the technology accessible to organizations of all sizes. Endor is both a blockchain analytics protocol and a prediction engine that integrates on-chain and off-chain data for analysis. Crystal: A blockchain analytics platform that integrates with the Bitcoin and Ethereum blockchains and focuses on cryptocurrency transaction analytics. Different Crystal products cater to small organizations, enterprises, and law enforcement agencies. OXT: The most focused of the three products listed, OXT is an analytics and visualization explorer tool for the Bitcoin blockchain. Although OXT doesn’t provide analytics support for a variety of blockchains, it attempts to provide a wide range of analytics options for Bitcoin. Monetizing blockchain data Today’s economy is driven by data, and the amount of data being collected about individuals and their behavior is staggering. Think of the last time you accessed your favorite shopping site. Chances are, you saw an ad that you found relevant. Those targeted ads seem to be getting better and better at figuring out what would interest you. The capability to align ads with user preferences depends on an analytics engine acquiring enough data about the user to reliably predict products or services of interest. Blockchain data can represent the next logical phase of data’s value to the enterprise. As more and more consumers realize the value of their personal data, interest is growing in the capability to control that data. Consumers now want to control how their data is being used and demand incentives or compensation for the use of their data. Blockchain technology can provide a central point of presence for personal data and the ability for the data’s owner to authorize access to that data. Removing personal data from common central data stores, such as Google and Facebook, has the potential to revolutionize marketing and advertising. Smaller organizations could access valuable marketing information by asking permission from the data owner as opposed to the large data aggregators. Circumventing big players such as Google and Facebook could reduce marketing costs and allow incentives to flow directly to individuals. There is a long way to go to move away from current personal data usage practices, but blockchain technology makes it possible. This process may be accelerated by emerging regulations that protect individual rights to control private data. For example, the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) both strengthen an individual’s ability to control access to, and use of, their personal data. Exchanging and integrating blockchain data effectively Much of the value of blockchain data is in its capability to relate to off-chain data. Most blockchain apps refer to some data stored in off-chain repositories. It doesn’t make sense to store every type of data in a blockchain. Reference data, which is commonly data that gets updated to reflect changing conditions, may not be good candidates for storing in a blockchain. Blockchain technology excels at recording value transfers between owners. All applications define and maintain additional information that supports and provides details for transactions but doesn’t directly participate in transactions. Such information, such as product description or customer notes, may make more sense to store in an off-chain repository. Any time blockchain apps rely on on-chain and off-chain data, integration methods become a concern. Even if your app uses only on-chain data, it is likely that analytics models will integrate with off-chain data. For example, owners in blockchain environments are identified by addresses. These addresses have no context external to the blockchain. Any association between an address and a real-world identity is likely stored in an off-chain repository. Another example of the need for off-chain data is when analyzing aircraft safety trends. Perhaps your analysis correlates blockchain-based incident and accident data with weather conditions. Although each blockchain transaction contains a timestamp, you’d have to consult an external weather database to determine prevailing weather conditions at the time of the transaction. Many examples of the need to integrate off-chain data with on-chain transactions exist. Part of the data acquisition phase of any analytics project is to identify data sources and access methods. In a blockchain analytics project, that process means identifying off-chain data you need to satisfy the goals of your project and how to get that data. Want to learn more? Check out our Blockchain Data Analytics Cheat Sheet.
View ArticleCheat Sheet / Updated 06-05-2023
Tableau is not a single application but rather a collection of applications that create a best-in-class business intelligence platform. You may want to dive right in and start trying to create magnificent visualizations, but there are a few concepts you should know about to refine your data and optimize visualizations. You’ll need to determine whether your data set requires data cleansing. In that case, you’ll utilize Tableau Prep. If you want to collaborate and share your data, reports, and visualizations, you’ll use either Tableau Cloud or Tableau Server. Central to the Tableau solution suite is Tableau Desktop; it’s at the heart of the creative engine for virtually all users at some point in time to create visualization renderings from workbooks, dashboards, and stories. Keep reading for tips about data layout and cleansing data in Tableau Prep.
View Cheat SheetArticle / Updated 08-04-2022
A common question from management when first considering data analytics and again in the specific context of blockchain is “Why do we need this?” Your organization will have to answer that question, in general, and you’ll need to explain why building and executing analytics models on your blockchain data will benefit your organization. Without an expected return on investment (ROI), management probably won't authorize and fund any analytics efforts. The good news is that you aren’t the pioneer in blockchain analytics. Other organizations of all sizes have seen the value of formal analysis of blockchain data. Examining what other organizations have done can be encouraging and insightful. You’ll probably find some fresh ideas as you familiarize yourself with what others have accomplished with their blockchain analytics projects. Here, you learn about ten ways in which blockchain analytics can be useful to today’s (and tomorrow’s) organizations. Blockchain analytics focuses on analyzing what happened in the past, explaining what's happening now, and even preparing for what's expected to come in the future. Analytics can help any organization react, understand, prepare, and lower overall risk. Accessing public financial transaction data The first blockchain implementation, Bitcoin, is all about cryptocurrency, so it stands to reason that examining financial transactions would be an obvious use of blockchain analytics. If tracking transactions was your first thought of how to use blockchain analytics, you’d be right. Bitcoin and other blockchain cryptocurrencies used to be viewed as completely anonymous methods of executing financial transactions. The flawed perception of complete anonymity enticed criminals to use the new type of currency to conduct illegal business. Since cryptocurrency accounts aren’t directly associated with real-world identities (at least on the blockchain), any users who wanted to conduct secret business warmed up to Bitcoin and other cryptocurrencies. When law enforcement noticed the growth in cryptocurrency transactions, they began looking for ways to re-identify transactions of interest. It turns out that with a little effort and proper legal authority, it isn’t that hard to figure out who owns a cryptocurrency account. When a cryptocurrency account is converted and transferred to a traditional account, many criminals are unmasked. Law enforcement became an early adopter of blockchain analytics and still uses models today to help identify suspected criminal and fraudulent activity. Chainalysis is a company that specializes in cryptocurrency investigations. Their product, Chainalysis Reactor, allows users to conduct cryptocurrency forensics to connect transactions to real-world identities. The image shows the Chainalysis Reactor tool. But blockchain technology isn’t just for criminals, and blockchain analytics isn’t just to catch bad guys. The growing popularity of blockchain and cryptocurrencies could lead to new ways to evaluate entire industries, P2P transactions, currency flow, the wealth of nation-states, and a variety of other market valuations with this new area of analysis. For example, Ethereum has emerged as a major avenue of fundraising for tech startups, and its analysis could lend a deeper look into the industry. Connecting with the Internet of Things (IoT) The Internet of Things (IoT) is loosely defined as the collection of devices of all sizes that are connected to the internet and operate at some level with little human interaction. IoT devices include doorbell cameras, remote temperature sensors, undersea oil leak detectors, refrigerators, and vehicle components. The list is almost endless, as is the number of devices connecting to the internet. Each IoT device has a unique identity and produces and consumes data. All of these devices need some entity that manages data exchange and the device’s operation. Although most IoT devices are autonomous (they operate without the need for external guidance), all devices eventually need to request or send data to someone. But that someone doesn’t have to be a human. Currently, the centralized nature of traditional IoT systems reduces their scalability and can create bottlenecks. A central management entity can handle only a limited number of devices. Many companies working in the IoT space are looking to leverage the smart contracts in blockchain networks to allow IoT devices to work more securely and autonomously. These smart contracts are becoming increasingly attractive as the number of IoT devices exceeds 20 billion worldwide in 2020. The figure below shows how IoT has matured from a purely centralized network in the past to a distributed network (which still had some central hubs) to a vision of the future without the need for central managers. The applications of IoT data are endless, and if the industry does shift in this direction, knowing and understanding blockchain analytics will be necessary to truly unlock its potential. Using blockchain technology to manage IoT devices is only the beginning. Without the application of analytics to really understand the huge volume of data IoT devices will be generating, much of the value of having so many autonomous devices will be lost. Ensuring data and document authenticity The Lenovo Group is a multinational technology company that manufactures and distributes consumer electronics. During a business process review, Lenovo identified several areas of inefficiency in their supply chain. After analyzing the issues, they decided to incorporate blockchain technology to increase visibility, consistency, and autonomy, and to decrease waste and process delays. Lenovo published a paper, “Blockchain Technology for Business: A Lenovo Point of View,” detailing their efforts and results. In addition to describing their supply chain application of blockchain technology in their paper, Lenovo cited examples of how the New York Times uses blockchain to prove that photos are authentic. They also described how the city of Dubai is working to have all its government documents on blockchain by the end of 2020 in an effort to crack down on corruption and the misuse of funds. In the era of deep fakes, manipulated photos and consistently evolving methods of corruption and misappropriation of funds, blockchain can help identify cases of data fraud and misuse. Blockchain’s inherent transparency and immutability means that data cannot be retroactively manipulated to support a narrative. Facts in a blockchain are recorded as unchangeable facts. Analytics models can help researchers understand how data of any type originated, who the original owner was, how it gets amended over time, and if any amendments are coordinated. Controlling secure document integrity As just mentioned, blockchain technology can be used to ensure document authenticity, but it can be used also to ensure document integrity. In areas where documents should not be able to be altered, such as the legal and healthcare industries, blockchain can help make documents and changes to them transparent and immutable, as well as increase the power the owner of the data has to control and manage it. Documents do not have to be stored in the blockchain to benefit from the technology. Documents can be stored in off-chain repositories, with a hash stored in a block on the blockchain. Each transaction (required to write to a new block) contains the owner’s account and a timestamp of the action. The integrity of any document at a specific point in time can be validated simply by comparing the on-chain hash with the calculated hash value of the document. If the hash values match, the document has not changed since the blockchain transaction was created. The company DocStamp has implemented a novel use for blockchain document management. Using DocStamp, shown below, anyone can self-notarize any document. The document owner maintains control of the document while storing a hash of the document on an Ethereum blockchain. Services such as DocStamp provide the capability to ensure document integrity using blockchain technology. However, assessing document integrity and its use is up to analytics models. The DocStamp model is not generally recognized by courts of law to be as strong as a traditional notary. For that to change, analysts will need to provide model results that show how the approach works and how blockchain can help provide evidence that document integrity is ensured. Tracking supply chain items In the Lenovo blockchain paper, the author described how Lenovo replaced printed paperwork in its supply chain with processes managed through smart contracts. The switch to blockchain-based process management greatly decreased the potential for human error and removed many human-related process delays. Replacing human interaction with electronic transaction increased auditability and gave all parties more transparency in the movement of goods. The Lenovo supply chain became more efficient and easier to investigate. Blockchain-based supply chain solutions are one of the most popular ways to implement blockchain technology. Blockchain technology makes it easy to track items along the supply chain, both forward and backward. The capability to track an item makes it easy to determine where an item is and where that item has been. Tracing an item’s provenance, or origin, makes root cause analysis possible. Because the blockchain keeps all history of movement through the supply chain, many types of analysis are easier than traditional data stores which can overwrite data. The US Food and Drug Administration is working with several private firms to evaluate using blockchain technology supply chain applications to identify, track, and trace prescription drugs. Analysis of the blockchain data can provide evidence for identifying counterfeit drugs and delivery paths criminals use to get those drugs to market. Empowering predictive analytics You can build several models that allow you to predict future behavior based on past observations. Predictive analytics is often one of the goals of an organization’s analytics projects. Large organizations may already have a collection of data that supports prediction. Smaller organizations, however, probably lack enough data to make accurate predictions. Even large organizations would still benefit from datasets that extend beyond their own customers and partners. In the past, a common approach to acquiring enough data for meaningful analysis was to purchase data from an aggregator. Each data acquisition request costs money, and the data you receive may still be limited in scope. The prospect of using public blockchains has the potential to change the way we all access public data. If a majority of supply chain interactions, for example, use a public blockchain, that data is available to anyone — for free. As more organizations incorporate blockchains into their operations, analysts could leverage the additional data to empower more companies to use predictive analytics with less reliance on localized data. Analyzing real-time data Blockchain transactions happen in real time, across intranational and international borders. Not only are banks and innovators in financial technology pursuing blockchain for the speed it offers to transactions, but data scientists and analysts are observing blockchain data changes and additions in real time, greatly increasing the potential for fast decision-making. To view how dynamic blockchain data really is, visit the Ethviewer Ethereum blockchain monitor’s website. The following image shows the Ethviewer website. Each small circle in the blob near the lower-left corner of the web page is a distinct transaction waiting to make it into a new block. You can see how dynamic the Ethereum blockchain is — it changes constantly. And when the blockchain changes, so does the blockchain data that your models use to provide accurate results. Supercharging business strategy Companies big and small — marketing firms, financial technology giants, small local retailers, and many more — can fine-tune their strategies to keep up with, and even get ahead of, shifts in the market, the economy, and their customer base. How? By utilizing the results of analytics models built on the organization’s blockchain data. The ultimate goal for any analytics project is to provide ROI for the sponsoring organization. Blockchain analytics projects provide a unique opportunity to provide value. New blockchain implementations are only recently becoming common in organizations, and now is the time to view those sources of data as new opportunities to provide value. Analytics can help identify potential sources of ROI. Managing data sharing Blockchain technology is often referred to as a disruptive technology, and there is some truth to that characterization. Blockchain does disrupt many things. In the context of data analytics, blockchain changes the way analysts acquire at least some of their data. If a public or consortium blockchain is the source for an analytics model, it's a near certainty that the sponsoring organization does not own all the data. Much of the data in a non-private blockchain comes from other entities that decided to place the data in a shared repository, the blockchain. Blockchain can aid in the storage of data in a distributed network and make that data easily accessible to project teams. Easy access to data makes the whole analytics process easier. There still may be a lot of work to do, but you can always count on the facts that blockchain data is accessible and it hasn’t changed since it was written. Blockchain makes collaboration among data analysts and other data consumers easier than with more traditional data repositories. Standardizing collaboration forms Blockchain technology empowers analytics in more ways than just providing access to more data. Regardless of whether blockchain technology is deployed in the healthcare, legal, government, or other organizational domain, blockchain can lead to more efficient process automation. Also, blockchain’s revolutionary approach to how data is generated and shared among parties can lead to better and greater standardization in how end users populate forms and how other data gets collected. Blockchains can help encourage adherence to agreed-upon standards for data handling. The use of data-handling standards will greatly decrease the amount of time necessary for data cleaning and management. Because cleansing data commonly requires a large time investment in the analytics process, standardization through the use of blockchain can make it easier to build and modify models with a short time-to-market.
View Article