Data Science Articles
Data science is what happens when you let the world's brightest minds loose on a big dataset. It gets crazy. Our articles will walk you through what data science is and what it does.
Articles From Data Science
Filter Results
Cheat Sheet / Updated 05-22-2025
The field of SAS and SAS programming has evolved over nearly 50 years, leading to the development of various shorthand techniques. These techniques may not be immediately apparent to new SAS users, but they become clear with learning and practice. Use the tips here to get a head-start and accelerate your initiation to SAS.
View Cheat SheetArticle / Updated 09-24-2024
Both linear and logistic regression see a lot of use in data science but are commonly used for different kinds of problems. You need to know and understand both types of regression to perform a full range of data science tasks. Of the two, logistic regression is harder to understand in many respects because it necessarily uses a more complex equation model. The following information gives you a basic overview of how linear and logistic regression differ. The equation model Any discussion of the difference between linear and logistic regression must start with the underlying equation model. The equation for linear regression is straightforward. y = a + bx You may see this equation in other forms and you may see it called ordinary least squares regression, but the essential concept is always the same. Depending on the source you use, some of the equations used to express logistic regression can become downright terrifying unless you’re a math major. However, the start of this discussion can use one of the simplest views of logistic regression: p = f(a + bx) >p, is equal to the logistic function, f, applied to two model parameters, a and b, and one explanatory variable, x. When you look at this particular model, you see that it really isn’t all that different from the linear regression model, except that you now feed the result of the linear regression through the logistic function to obtain the required curve. The output (dependent variable) is a probability ranging from 0 (not going to happen) to 1 (definitely will happen), or a categorization that says something is either part of the category or not part of the category. (You can also perform multiclass categorization, but focus on the binary response for now.) The best way to view the difference between linear regression output and logistic regression output is to say that the following: Linear regression is continuous. A continuous value can take any value within a specified interval (range) of values. For example, no matter how closely the height of two individuals matches, you can always find someone whose height fits between those two individuals. Examples of continuous values include: Height Weight Waist size Logistic regression is discrete. A discrete value has specific values that it can assume. For example, a hospital can admit only a specific number of patients in a given day. You can’t admit half a patient (at least, not alive). Examples of discrete values include: Number of people at the fair Number of jellybeans in the jar Colors of automobiles produced by a vendor The logistic function Of course, now you need to know about the logistic function. You can find a variety of forms of this function as well, but here’s the easiest one to understand: f(x) = e<sup>x</sup> / e<sup>x</sup> + 1 You already know about f, which is the logistic function, and x equals the algorithm you want to use, which is a + bx in this case. That leaves e, which is the natural logarithm and has an irrational value of 2.718, for the sake of discussion (check out a better approximation of the whole value). Another way you see this function expressed is f(x) = 1 / (1 + e<sup>-x</sup>) Both forms are correct, but the first form is easier to use. Consider a simple problem in which a, the y-intercept, is 0, and ">b, the slope, is 1. The example uses x values from –6 to 6. Consequently, the first f(x) value would look like this when calculated (all values are rounded): (1) e<sup>-6</sup> / (1 + e<sup>-6</sup>) (2) 0.00248 / 1 + 0.00248 (3) 0.002474 As you might expect, an xvalue of 0 would result in an f(x) value of 0.5, and an x value of 6 would result in an f(x) value of 0.9975. Obviously, a linear regression would show different results for precisely the same x values. If you calculate and plot all the results from both logistic and linear regression using the following code, you receive a plot like the one below. import matplotlib.pyplot as plt %matplotlib inline from math import exp x_values = range(-6, 7) lin_values = [(0 + 1*x) / 13 for x in range(0, 13)] log_values = [exp(0 + 1*x) / (1 + exp(0 + 1*x)) for x in x_values] plt.plot(x_values, lin_values, 'b-^') plt.plot(x_values, log_values, 'g-*') plt.legend(['Linear', 'Logistic']) plt.show() This example relies on list comprehension to calculate the values because it makes the calculations clearer. The linear regression uses a different numeric range because you must normalize the values to appear in the 0 to 1 range for comparison. This is also why you divide the calculated values by 13. The exp(x) call used for the logistic regression raises e to the power of x, e<sup>x</sup>, as needed for the logistic function. The model discussed here is simplified, and some math majors out there are probably throwing a temper tantrum of the most profound proportions right now. The Python or R package you use will actually take care of the math in the background, so really, what you need to know is how the math works at a basic level so that you can understand how to use the packages. This section provides what you need to use the packages. However, if you insist on carrying out the calculations the old way, chalk to chalkboard, you’ll likely need a lot more information. The problems that logistic regression solves You can separate logistic regression into several categories. The first is simple logistic regression, in which you have one dependent variable and one independent variable, much as you see in simple linear regression. However, because of how you calculate the logistic regression, you can expect only two kinds of output: Classification: Decides between two available outcomes, such as male or female, yes or no, or high or low. The outcome is dependent on which side of the line a particular data point falls. Probability: Determines the probability that something is true or false. The values true and false can have specific meanings. For example, you might want to know the probability that a particular apple will be yellow or red based on the presence of yellow and red apples in a bin. Fit the curve As part of understanding the difference between linear and logistic regression, consider this grade prediction problem, which lends itself well to linear regression. In the following code, you see the effect of trying to use logistic regression with that data: x1 = range(0,9) y1 = (0.25, 0.33, 0.41, 0.53, 0.59, 0.70, 0.78, 0.86, 0.98) plt.scatter(x1, y1, c='r') lin_values = [0.242 + 0.0933*x for x in x1] log_values = [exp(0.242 + .9033*x) / (1 + exp(0.242 + .9033*x)) for x in range(-4, 5)] plt.plot(x1, lin_values, 'b-^') plt.plot(x1, log_values, 'g-*') plt.legend(['Linear', 'Logistic', 'Org Data']) plt.show() The example has undergone a few changes to make it easier to see precisely what is happening. It relies on the same data that was converted from questions answered correctly on the exam to a percentage. If you have 100 questions and you answer 25 of them correctly, you have answered 25 percent (0.25) of them correctly. The values are normalized to produce values between 0 and 1 percent. As you can see from the image above, the linear regression follows the data points closely. The logistic regression doesn’t. However, logistic regression often is the correct choice when the data points naturally follow the logistic curve, which happens far more often than you might think. You must use the technique that fits your data best, which means using linear regression in this case. A pass/fail example An essential point to remember is that logistic regression works best for probability and classification. Consider that points on an exam ultimately predict passing or failing the course. If you get a certain percentage of the answers correct, you pass, but you fail otherwise. The following code considers the same data used for the example above, but converts it to a pass/fail list. When a student gets at least 70 percent of the questions correct, success is assured. y2 = [0 if x < 0.70 else 1 for x in y1] plt.scatter(x1, y2, c='r') lin_values = [0.242 + 0.0933*x for x in x1] log_values = [exp(0.242 + .9033*x) / (1 + exp(0.242 + .9033*x)) for x in range(-4, 5)] plt.plot(x1, lin_values, 'b-^') plt.plot(x1, log_values, 'g-*') plt.legend(['Linear', 'Logistic', 'Org Data']) plt.show() This is an example of how you can use list comprehensions in Python to obtain a required dataset or data transformation. The list comprehension for y2 starts with the continuous data in y1 and turns it into discrete data. Note that the example uses precisely the same equations as before. All that has changed is the manner in which you view the data, as you can see below. Because of the change in the data, linear regression is no longer the option to choose. Instead, you use logistic regression to fit the data. Take into account that this example really hasn’t done any sort of analysis to optimize the results. The logistic regression fits the data even better if you do so.
View ArticleCheat Sheet / Updated 04-12-2024
A wide range of tools is available that are designed to help big businesses and small take advantage of the data science revolution. Among the most essential of these tools are Microsoft Power BI, Tableau, SQL, and the R and Python programming languages.
View Cheat SheetArticle / Updated 12-01-2023
Getting the most out of your unstructured data is an essential task for any organization these days, especially when considering the disparate storage systems, applications, and user locations. So, it’s not an accident that data orchestration is the term that brings everything together. Bringing all your data together shares similarities with conducting an orchestra. Instead of combining the violin, oboe, and cello, this brand of orchestration combines distributed data types from different places, platforms, and locations working as a cohesive entity presented to applications or users anywhere. That’s because historically, accessing high-performance data outside of your computer network was inefficient. Because the storage infrastructure existed in a silo, systems like HPC Parallel (which lets users store and access shared data across multiple networked storage nodes), Enterprise NAS (which allows large-scale storage and access to other networks), and Global Namespace (virtually simplifies network file systems) were limited when it came to sharing. Because each operated independently, the data within each system was siloed making it a problem collaborating with data sets over multiple locations. Collaboration was possible, but too often you lost the ability to have high performance. This Boolean logic decreased potential because having an IT architecture that supported both high performance and collaboration with data sets from different storage silos typically became an either/or decision: You were forced to choose one but never both. What is data orchestration? Data orchestration is the automated process of taking siloed data from multiple data storage systems and locations, combining and organizing it into a single namespace. Then a high-performance file system can place data in the edge service, data center, or cloud service most optimal for the workload. The recent rise of data analytic applications and artificial intelligence (AI) capabilities has accelerated the use of data across different locations and even different organizations. In the next data cycle, organizations will need both high-performance and agility with their data to compete and thrive in a competitive environment. That means data no longer has a 1:1 relationship with the applications and compute environment that generated it. It needs to be used, analyzed, and repurposed with different AI models and alternate workloads, and across a remote, collaborative environment. Hammerspace’s technology makes data available to different foundational models, remote applications, decentralized compute clusters, and remote workers to automate and streamline data-driven development programs, data insights, and business decision making. This capability enables a unified, fast, and efficient global data environment for the entire workflow — from data creation to processing, collaboration, and archiving across edge devices, data centers, and public and private clouds. Control of enterprise data services for governance, security, data protection, and compliance can now be implemented globally at a file-granular level across all storage types and locations. Applications and AI models can access data stored in remote locations while using automated orchestration tools to provide high-performance local access when needed for processing. Organizations can grow their talent pools with access to team members no matter where they reside. Decentralizing the data center Data collection has become more prominent, and the traditional system of centralized data management has limitations. Issues of centralized data storage can limit the amount of data available to applications. Then, there are the high infrastructure costs when multiple applications are needed to manage and move data, multiple copies of data are retained in different storage systems, and more headcount is needed to manage the complex, disconnected infrastructure environment. Such setbacks suggest that the data center is no longer the center of data and storage system constraints should no longer define data architectures. Hammerspace specializes in decentralized environments, where data may need to span two or more sites and possibly one or more cloud providers and regions, and/or where a remote workforce needs to collaborate in real time. It enables a global data environment by providing a unified, parallel global file system. Enabling a global data environment Hammerspace completely revolutionizes previously held notions of how unstructured data architectures should be designed, delivering the performance needed across distributed environments to Free workloads from data silos. Eliminate copy proliferation. Provide direct data access through local metadata to applications and users, no matter where the data is stored. This technology allows organizations to take full advantage of the performance capabilities of any server, storage system, and network anywhere in the world. This capability enables a unified, fast, and efficient global data environment for the entire workflow, from data creation to processing, collaboration, and archiving across edge devices, data centers, and public and private clouds. The days of enterprises struggling with a siloed, distributed, and inefficient data environment are over. It’s time to start expecting more from data architectures with automated data orchestration. Find out how by downloading Unstructured Data Orchestration For Dummies, Hammerspace Special Edition, here.
View ArticleArticle / Updated 07-27-2023
In growth, you use testing methods to optimize your web design and messaging so that it performs at its absolute best with the audiences to which it's targeted. Although testing and web analytics methods are both intended to optimize performance, testing goes one layer deeper than web analytics. You use web analytics to get a general idea about the interests of your channel audiences and how well your marketing efforts are paying off over time. After you have this information, you can then go in deeper to test variations on live visitors in order to gain empirical evidence about what designs and messaging your visitors actually prefer. Testing tactics can help you optimize your website design or brand messaging for increased conversions in all layers of the funnel. Testing is also useful when optimizing your landing pages for user activations and revenue conversions. Checking out common types of testing in growth When you use data insights to increase growth for e-commerce businesses, you're likely to run into the three following testing tactics: A/B split testing, multivariate testing, and mouse-click heat map analytics. An A/B split test is an optimization tactic you can use to split variations of your website or brand messaging between sets of live audiences in order to gauge responses and decide which of the two variations performs best. A/B split testing is the simplest testing method you can use for website or messaging optimization. Multivariate testing is, in many ways, similar to the multivariate regression analysis that I discuss in Chapter 5. Like multivariate regression analysis, multivariate testing allows you to uncover relationships, correlations, and causations between variables and outcomes. In the case of multivariate testing, you're testing several conversion factors simultaneously over an extended period in order to uncover which factors are responsible for increased conversions. Multivariate testing is more complicated than A/B split testing, but it usually provides quicker and more powerful results. Lastly, you can use mouse-click heat map analytics to see how visitors are responding to your design and messaging choices. In this type of testing, you use the mouse-click heat map to help you make optimal website design and messaging choices to ensure that you're doing everything you can to keep your visitors focused and converting. Landing pages are meant to offer visitors little to no options, except to convert or to exit the page. Because a visitor has so few options on what he can do on a landing page, you don't really need to use multivariate testing or website mouse-click heat maps. Simple A/B split tests suffice. Data scientists working in growth hacking should be familiar with (and know how to derive insight from) the following testing applications: Webtrends: Offers a conversion-optimization feature that includes functionality for A/B split testing and multivariate testing. Optimizely: A popular product among the growth-hacking community. You can use Optimizely for multipage funnel testing, A/B split testing, and multivariate testing, among other things. Visual Website Optimizer: An excellent tool for A/B split testing and multivariate testing. Testing for acquisitions Acquisitions testing provides feedback on how well your content performs with prospective users in your assorted channels. You can use acquisitions testing to help compare your message's performance in each channel, helping you optimize your messaging on a per-channel basis. If you want to optimize the performance of your brand's published images, you can use acquisition testing to compare image performance across your channels as well. Lastly, if you want to increase your acquisitions through increases in user referrals, use testing to help optimize your referrals messaging for the referrals channels. Acquisition testing can help you begin to understand the specific preferences of prospective users on a channel-by-channel basis. You can use A/B split testing to improve your acquisitions in the following ways: Social messaging optimization: After you use social analytics to deduce the general interests and preferences of users in each of your social channels, you can then further optimize your brand messaging along those channels by using A/B split testing to compare your headlines and social media messaging within each channel. Brand image and messaging optimization: Compare and optimize the respective performances of images along each of your social channels. Optimized referral messaging: Test the effectiveness of your email messaging at converting new user referrals. Testing for activations Activation testing provides feedback on how well your website and its content perform in converting acquired users to active users. The results of activation testing can help you optimize your website and landing pages for maximum sign-ups and subscriptions. Here's how you'd use testing methods to optimize user activation growth: Website conversion optimization: Make sure your website is optimized for user activation conversions. You can use A/B split testing, multivariate testing, or a mouse-click heat map data visualization to help you optimize your website design. Landing pages: If your landing page has a simple call to action that prompts guests to subscribe to your email list, you can use A/B split testing for simple design optimization of this page and the call-to-action messaging. Testing for retentions Retentions testing provides feedback on how well your blog post and email headlines are performing among your base of activated users. If you want to optimize your headlines so that active users want to continue active engagements with your brand, test the performance of your user-retention tactics. Here's how you can use testing methods to optimize user retention growth: Headline optimization: Use A/B split testing to optimize the headlines of your blog posts and email marketing messages. Test different headline varieties within your different channels, and then use the varieties that perform the best. Email open rates and RSS view rates are ideal metrics to track the performance of each headline variation. Conversion rate optimization: Use A/B split testing on the messaging within your emails to decide which messaging variety more effectively gets your activated users to engage with your brand. The more effective your email messaging is at getting activated users to take a desired action, the greater your user retention rates. Testing for revenue growth Revenue testing gauges the performance of revenue-generating landing pages, e-commerce pages, and brand messaging. Revenue testing methods can help you optimize your landing and e-commerce pages for sales conversions. Here's how you can use testing methods to optimize revenue growth: Website conversion optimization: You can use A/B split testing, multivariate testing, or a mouse-click heat map data visualization to help optimize your sales page and shopping cart design for revenue-generating conversions. Landing page optimization: If you have a landing page with a simple call to action that prompts guests to make a purchase, you can use A/B split testing for design optimization.
View ArticleCheat Sheet / Updated 07-24-2023
Blockchain technology is much more than just another way to store data. It's a radical new method of storing validated data and transaction information in an indelible, trusted repository. Blockchain has the potential to disrupt business as we know it, and in the process, provide a rich new source of behavioral data. Data analysts have long found valuable insights from historical data, and blockchain can expose new and reliable data to drive business strategy. To best leverage the value that blockchain data offers, become familiar with blockchain technology and how it stores data, and learn how to extract and analyze this data.
View Cheat SheetArticle / Updated 07-24-2023
In 2008, Bitcoin was the only blockchain implementation. At that time, Bitcoin and blockchain were synonymous. Now hundreds of different blockchain implementations exist. Each new blockchain implementation emerges to address a particular need and each one is unique. However, blockchains tend to share many features with other blockchains. Before examining blockchain applications and data, it helps to look at their similarities. Check out this article to learn how blockchains work. Categorizing blockchain implementations One of the most common ways to evaluate blockchains is to consider the underlying data visibility, that is, who can see and access the blockchain data. And just as important, who can participate in the decision (consensus) to add new blocks to the blockchain? The three primary blockchain models are public, private, and hybrid. Opening blockchain to everyone Nakamoto’s original blockchain proposal described a public blockchain. After all, blockchain technology is all about providing trusted transactions among untrusted participants. Sharing a ledger of transactions among nodes in a public network provides a classic untrusted network. If anyone can join the network, you have no criteria on which to base your trust. It’s almost like throwing s $20 bill out your window and trusting that only the person you intend to pick it up will do so. Public blockchain implementations, including Bitcoin and Ethereum, depend on a consensus algorithm that makes it hard to mine blocks but easy to validate them. PoW is the most common consensus algorithm in use today for public blockchains, but that may change. Ethereum is in the process of transitioning to the Proof of Stake (PoS) consensus algorithm, which requires less computation and depends on how much blockchain currency a node holds. The idea is that a node with more blockchain currency would be affected negatively if it participates in unethical behavior. The higher the stake you have in something, the greater the chance that you’ll care about its integrity. Because public blockchains are open to anyone (anyone can become a node on the network), no permission is needed to join. For this reason, a public blockchain is also called a permissionless blockchain. Public (permissionless) blockchains are most often used for new apps that interact with the public in general. A public blockchain is like a retail store, in that anyone can walk into the store and shop. Limiting blockchain access The opposite of a public blockchain is a private blockchain, such as Hyperledger Fabric. In a private blockchain, also called a permissioned blockchain, the entity that owns and controls the blockchain grants and revokes access to the blockchain data. Because most enterprises manage sensitive or private data, private blockchains are commonly used because they can limit access to that data. The blockchain data is still transparent and readily available but is subject to the owning entity’s access requirements. Some have argued that private blockchains violate data transparency, the original intent of blockchain technology. Although private blockchains can limit data access (and go against the philosophy of the original blockchain in Bitcoin), limited transparency also allows enterprises to consider blockchain technology for new apps in a private environment. Without the private blockchain option, the technology likely would never be considered for most enterprise applications. Combining the best of both worlds A classic blockchain use case is a supply chain app, which manages a product from its production all the way through its consumption. The beginning of the supply chain is when a product is manufactured, harvested, caught, or otherwise provisioned to send to an eventual customer. The supply chain app then tracks and manages each transfer of ownership as the product makes its way to the physical location where the consumer purchases it. Supply chain apps manage product movement, process payment at each stage in the movement lifecycle, and create an audit trail that can be used to investigate the actions of each owner along the supply chain. Blockchain technology is well suited to support the transfer of ownership and maintain an indelible record of each step in the process. Many supply chains are complex and consist of multiple organizations. In such cases, data suffers as it is exported from one participant, transmitted to the next participant, and then imported into their data system. A single blockchain would simplify the export/transport/import cycle and auditing. An additional benefit of blockchain technology in supply chain apps is the ease with which a product’s provenance (a trace of owners back to its origin) is readily available. Many of today’s supply chains are made up of several enterprises that enter into agreements to work together for mutual benefit. Although the participants in a supply chain are business partners, they do not fully trust one another. A blockchain can provide the level of transactional and data trust that the enterprises need. The best solution is a semi-private blockchain – that is, the blockchain is public for supply chain participants but not to anyone else. This type of blockchain (one that is owned by a group of entities) is called a hybrid, or consortium, blockchain. The participants jointly own the blockchain and agree on policies to govern access. Describing basic blockchain type features Each type of blockchain has specific strengths and weaknesses. Which one to use depends on the goals and target environment. You have to know why you need blockchain and what you expect to get from it before you can make an informed decision as to what type of blockchain would be best. The best solution for one organization may not be the best solution for another. The table below shows how blockchain types compare and why you might choose one over the other. Differences in Types of Blockchain Feature Public Private Hybrid Permission Permissionless Permissioned (limited to organization members) Permissioned (limited to consortium members) Consensus PoW, PoS, and so on Authorized participants Varies; can use any method Performance Slow (due to consensus) Fast (relatively) Generally fast Identity Virtually anonymous Validated identity Validated identity The primary differences between each type of blockchain are the consensus algorithm used and whether participants are known or anonymous. These two concepts are related. An unknown (and therefore completely untrusted) participant will require an environment with a more rigorous consensus algorithm. On the other hand, if you know the transaction participants, you can use a less rigorous consensus algorithm. Contrasting popular enterprise blockchain implementations Dozens of blockchain implementations are available today, and soon there will be hundreds. Each new blockchain implementation targets a specific market and offers unique features. There isn’t room in this article to cover even a fair number of blockchain implementations, but you should be aware of some of the most popular. Remember that you’ll be learning about blockchain analytics in this book. Although organizations of all sizes are starting to leverage the power of analytics, enterprises were early adopters and have the most mature approach to extracting value from data. The What Matrix website provides a comprehensive comparison of top enterprise blockchains. Visit whatmatrix.com for up-to-date blockchain information. Following are the top enterprise blockchain implementations and some of their strengths and weaknesses (ranking is based on the What Matrix website): Hyperledger Fabric: The flagship blockchain implementation from the Linux Foundation. Hyperledger is an open-source project backed by a diverse consortium of large corporations. Hyperledger’s modular-based architecture and rich support make it the highest rated enterprise blockchain. VeChain: Currently more popular that Hyperledger, having the highest number of enterprise use cases among products reviewed by What Matrix. VeChain includes support for two native cryptocurrencies and states that its focus is on efficient enterprise collaboration. Ripple Transaction Protocol: A blockchain that focuses on financial markets. Instead of appealing to general use cases, Ripple caters to organizations that want to implement financial transaction blockchain apps. Ripple was the first commercially available blockchain focused on financial solutions. Ethereum: The most popular general-purpose, public blockchain implementation. Although Ethereum is not technically an enterprise solution, it's in use in multiple proof of concept projects. The preceding list is just a brief overview of a small sample of blockchain implementations. If you’re just beginning to learn about blockchain technology in general, start out with Ethereum, which is one of the easier blockchain implementations to learn. After that, you can progress to another blockchain that may be better aligned with your organization. Want to learn more? Check out our Blockchain Data Analytic Cheat Sheet.
View ArticleArticle / Updated 06-09-2023
If statistics has been described as the science of deriving insights from data, then what’s the difference between a statistician and a data scientist? Good question! While many tasks in data science require a fair bit of statistical know how, the scope and breadth of a data scientist’s knowledge and skill base is distinct from those of a statistician. The core distinctions are outlined below. Subject matter expertise: One of the core features of data scientists is that they offer a sophisticated degree of expertise in the area to which they apply their analytical methods. Data scientists need this so that they’re able to truly understand the implications and applications of the data insights they generate. A data scientist should have enough subject matter expertise to be able to identify the significance of their findings and independently decide how to proceed in the analysis. In contrast, statisticians usually have an incredibly deep knowledge of statistics, but very little expertise in the subject matters to which they apply statistical methods. Most of the time, statisticians are required to consult with external subject matter experts to truly get a firm grasp on the significance of their findings, and to be able to decide the best way to move forward in an analysis. Mathematical and machine learning approaches: Statisticians rely mostly on statistical methods and processes when deriving insights from data. In contrast, data scientists are required to pull from a wide variety of techniques to derive data insights. These include statistical methods, but also include approaches that are not based in statistics — like those found in mathematics, clustering, classification, and non-statistical machine learning approaches. Seeing the importance of statistical know-how You don't need to go out and get a degree in statistics to practice data science, but you should at least get familiar with some of the more fundamental methods that are used in statistical data analysis. These include: Linear regression: Linear regression is useful for modeling the relationships between a dependent variable and one or several independent variables. The purpose of linear regression is to discover (and quantify the strength of) important correlations between dependent and independent variables. Time-series analysis: Time series analysis involves analyzing a collection of data on attribute values over time, in order to predict future instances of the measure based on the past observational data. Monte Carlo simulations: The Monte Carlo method is a simulation technique you can use to test hypotheses, to generate parameter estimates, to predict scenario outcomes, and to validate models. The method is powerful because it can be used to very quickly simulate anywhere from 1 to 10,000 (or more) simulation samples for any processes you are trying to evaluate. Statistics for spatial data: One fundamental and important property of spatial data is that it’s not random. It’s spatially dependent and autocorrelated. When modeling spatial data, avoid statistical methods that assume your data is random. Kriging and krige are two statistical methods that you can use to model spatial data. These methods enable you to produce predictive surfaces for entire study areas based on sets of known points in geographic space. Working with clustering, classification, and machine learning methods Machine learning is the application of computational algorithms to learn from (or deduce patterns in) raw datasets. Clustering is a particular type of machine learning —unsupervised machine learning, to be precise, meaning that the algorithms must learn from unlabeled data, and as such, they must use inferential methods to discover correlations. Classification, on the other hand, is called supervised machine learning, meaning that the algorithms learn from labeled data. The following descriptions introduce some of the more basic clustering and classification approaches: k-means clustering: You generally deploy k-means algorithms to subdivide data points of a dataset into clusters based on nearest mean values. To determine the optimal division of your data points into clusters, such that the distance between points in each cluster is minimized, you can use k-means clustering. Nearest neighbor algorithms: The purpose of a nearest neighbor analysis is to search for and locate either a nearest point in space or a nearest numerical value, depending on the attribute you use for the basis of comparison. Kernel density estimation: An alternative way to identify clusters in your data is to use a density smoothing function. Kernel density estimation (KDE) works by placing a kernel a weighting function that is useful for quantifying density — on each data point in the data set, and then summing the kernels to generate a kernel density estimate for the overall region. Keeping mathematical methods in the mix Lots gets said about the value of statistics in the practice of data science, but applied mathematical methods are seldom mentioned. To be frank, mathematics is the basis of all quantitative analyses. Its importance should not be understated. The two following mathematical methods are particularly useful in data science. Multi-criteria decision making (MCDM): MCDM is a mathematical decision modeling approach that you can use when you have several criteria or alternatives that you must simultaneously evaluate when making a decision. Markov chains: A Markov chain is a mathematical method that chains together a series of randomly generated variables that represent the present state in order to model how changes in present state variables affect future states.
View ArticleArticle / Updated 06-09-2023
Blockchain technology alone cannot provide rich analytics results. For all that blockchain is, it can’t magically provide more data than other technologies. Before selecting blockchain technology for any new development or analytics project, clearly justify why such a decision makes sense. If you already depend on blockchain technology to store data, the decision to use that data for analysis is a lot easier to justify. Here, you examine some reasons why blockchain-supported analytics may allow you to leverage your data in interesting ways. Leveraging newly accessible decentralized tools to analyze blockchain data You’ll want to learn how to manually access and analyze blockchain data. But, it's also important to understand how to exercise granular control over your data throughout the analytics process, higher-level tools make the task easier. The growing number of decentralized data analytics solutions means more opportunities to build analytics models with less effort. Third-party tools may reduce the amount of control you have over the models you deploy, but they can dramatically increase analytics productivity. The following list of blockchain analytics solutions is not exhaustive and is likely to change rapidly. Take a few minutes to conduct your own internet search for blockchain analytics tools. You’ll likely find even more software and services: Endor: A blockchain-based AI prediction platform that has the goal of making the technology accessible to organizations of all sizes. Endor is both a blockchain analytics protocol and a prediction engine that integrates on-chain and off-chain data for analysis. Crystal: A blockchain analytics platform that integrates with the Bitcoin and Ethereum blockchains and focuses on cryptocurrency transaction analytics. Different Crystal products cater to small organizations, enterprises, and law enforcement agencies. OXT: The most focused of the three products listed, OXT is an analytics and visualization explorer tool for the Bitcoin blockchain. Although OXT doesn’t provide analytics support for a variety of blockchains, it attempts to provide a wide range of analytics options for Bitcoin. Monetizing blockchain data Today’s economy is driven by data, and the amount of data being collected about individuals and their behavior is staggering. Think of the last time you accessed your favorite shopping site. Chances are, you saw an ad that you found relevant. Those targeted ads seem to be getting better and better at figuring out what would interest you. The capability to align ads with user preferences depends on an analytics engine acquiring enough data about the user to reliably predict products or services of interest. Blockchain data can represent the next logical phase of data’s value to the enterprise. As more and more consumers realize the value of their personal data, interest is growing in the capability to control that data. Consumers now want to control how their data is being used and demand incentives or compensation for the use of their data. Blockchain technology can provide a central point of presence for personal data and the ability for the data’s owner to authorize access to that data. Removing personal data from common central data stores, such as Google and Facebook, has the potential to revolutionize marketing and advertising. Smaller organizations could access valuable marketing information by asking permission from the data owner as opposed to the large data aggregators. Circumventing big players such as Google and Facebook could reduce marketing costs and allow incentives to flow directly to individuals. There is a long way to go to move away from current personal data usage practices, but blockchain technology makes it possible. This process may be accelerated by emerging regulations that protect individual rights to control private data. For example, the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) both strengthen an individual’s ability to control access to, and use of, their personal data. Exchanging and integrating blockchain data effectively Much of the value of blockchain data is in its capability to relate to off-chain data. Most blockchain apps refer to some data stored in off-chain repositories. It doesn’t make sense to store every type of data in a blockchain. Reference data, which is commonly data that gets updated to reflect changing conditions, may not be good candidates for storing in a blockchain. Blockchain technology excels at recording value transfers between owners. All applications define and maintain additional information that supports and provides details for transactions but doesn’t directly participate in transactions. Such information, such as product description or customer notes, may make more sense to store in an off-chain repository. Any time blockchain apps rely on on-chain and off-chain data, integration methods become a concern. Even if your app uses only on-chain data, it is likely that analytics models will integrate with off-chain data. For example, owners in blockchain environments are identified by addresses. These addresses have no context external to the blockchain. Any association between an address and a real-world identity is likely stored in an off-chain repository. Another example of the need for off-chain data is when analyzing aircraft safety trends. Perhaps your analysis correlates blockchain-based incident and accident data with weather conditions. Although each blockchain transaction contains a timestamp, you’d have to consult an external weather database to determine prevailing weather conditions at the time of the transaction. Many examples of the need to integrate off-chain data with on-chain transactions exist. Part of the data acquisition phase of any analytics project is to identify data sources and access methods. In a blockchain analytics project, that process means identifying off-chain data you need to satisfy the goals of your project and how to get that data. Want to learn more? Check out our Blockchain Data Analytics Cheat Sheet.
View ArticleCheat Sheet / Updated 06-05-2023
Tableau is not a single application but rather a collection of applications that create a best-in-class business intelligence platform. You may want to dive right in and start trying to create magnificent visualizations, but there are a few concepts you should know about to refine your data and optimize visualizations. You’ll need to determine whether your data set requires data cleansing. In that case, you’ll utilize Tableau Prep. If you want to collaborate and share your data, reports, and visualizations, you’ll use either Tableau Cloud or Tableau Server. Central to the Tableau solution suite is Tableau Desktop; it’s at the heart of the creative engine for virtually all users at some point in time to create visualization renderings from workbooks, dashboards, and stories. Keep reading for tips about data layout and cleansing data in Tableau Prep.
View Cheat Sheet