Data Science Articles
Data science is what happens when you let the world's brightest minds loose on a big dataset. It gets crazy. Our articles will walk you through what data science is and what it does.
Articles From Data Science
Filter Results
Cheat Sheet / Updated 09-12-2022
Tableau Desktop enables you to perform complex data analysis tasks and create powerful, interactive visualizations that communicate that analysis. In addition, Tableau allows you to share your analysis and visualizations across your organization, so everyone from coworkers to top management can dig into the data that matters to them. This truly is a tool that provides you with a huge competitive advantage, but you have to master its ins and outs. The following tips and techniques will help you in that task
View Cheat SheetArticle / Updated 08-04-2022
A common question from management when first considering data analytics and again in the specific context of blockchain is “Why do we need this?” Your organization will have to answer that question, in general, and you’ll need to explain why building and executing analytics models on your blockchain data will benefit your organization. Without an expected return on investment (ROI), management probably won't authorize and fund any analytics efforts. The good news is that you aren’t the pioneer in blockchain analytics. Other organizations of all sizes have seen the value of formal analysis of blockchain data. Examining what other organizations have done can be encouraging and insightful. You’ll probably find some fresh ideas as you familiarize yourself with what others have accomplished with their blockchain analytics projects. Here, you learn about ten ways in which blockchain analytics can be useful to today’s (and tomorrow’s) organizations. Blockchain analytics focuses on analyzing what happened in the past, explaining what's happening now, and even preparing for what's expected to come in the future. Analytics can help any organization react, understand, prepare, and lower overall risk. Accessing public financial transaction data The first blockchain implementation, Bitcoin, is all about cryptocurrency, so it stands to reason that examining financial transactions would be an obvious use of blockchain analytics. If tracking transactions was your first thought of how to use blockchain analytics, you’d be right. Bitcoin and other blockchain cryptocurrencies used to be viewed as completely anonymous methods of executing financial transactions. The flawed perception of complete anonymity enticed criminals to use the new type of currency to conduct illegal business. Since cryptocurrency accounts aren’t directly associated with real-world identities (at least on the blockchain), any users who wanted to conduct secret business warmed up to Bitcoin and other cryptocurrencies. When law enforcement noticed the growth in cryptocurrency transactions, they began looking for ways to re-identify transactions of interest. It turns out that with a little effort and proper legal authority, it isn’t that hard to figure out who owns a cryptocurrency account. When a cryptocurrency account is converted and transferred to a traditional account, many criminals are unmasked. Law enforcement became an early adopter of blockchain analytics and still uses models today to help identify suspected criminal and fraudulent activity. Chainalysis is a company that specializes in cryptocurrency investigations. Their product, Chainalysis Reactor, allows users to conduct cryptocurrency forensics to connect transactions to real-world identities. The image shows the Chainalysis Reactor tool. But blockchain technology isn’t just for criminals, and blockchain analytics isn’t just to catch bad guys. The growing popularity of blockchain and cryptocurrencies could lead to new ways to evaluate entire industries, P2P transactions, currency flow, the wealth of nation-states, and a variety of other market valuations with this new area of analysis. For example, Ethereum has emerged as a major avenue of fundraising for tech startups, and its analysis could lend a deeper look into the industry. Connecting with the Internet of Things (IoT) The Internet of Things (IoT) is loosely defined as the collection of devices of all sizes that are connected to the internet and operate at some level with little human interaction. IoT devices include doorbell cameras, remote temperature sensors, undersea oil leak detectors, refrigerators, and vehicle components. The list is almost endless, as is the number of devices connecting to the internet. Each IoT device has a unique identity and produces and consumes data. All of these devices need some entity that manages data exchange and the device’s operation. Although most IoT devices are autonomous (they operate without the need for external guidance), all devices eventually need to request or send data to someone. But that someone doesn’t have to be a human. Currently, the centralized nature of traditional IoT systems reduces their scalability and can create bottlenecks. A central management entity can handle only a limited number of devices. Many companies working in the IoT space are looking to leverage the smart contracts in blockchain networks to allow IoT devices to work more securely and autonomously. These smart contracts are becoming increasingly attractive as the number of IoT devices exceeds 20 billion worldwide in 2020. The figure below shows how IoT has matured from a purely centralized network in the past to a distributed network (which still had some central hubs) to a vision of the future without the need for central managers. The applications of IoT data are endless, and if the industry does shift in this direction, knowing and understanding blockchain analytics will be necessary to truly unlock its potential. Using blockchain technology to manage IoT devices is only the beginning. Without the application of analytics to really understand the huge volume of data IoT devices will be generating, much of the value of having so many autonomous devices will be lost. Ensuring data and document authenticity The Lenovo Group is a multinational technology company that manufactures and distributes consumer electronics. During a business process review, Lenovo identified several areas of inefficiency in their supply chain. After analyzing the issues, they decided to incorporate blockchain technology to increase visibility, consistency, and autonomy, and to decrease waste and process delays. Lenovo published a paper, “Blockchain Technology for Business: A Lenovo Point of View,” detailing their efforts and results. In addition to describing their supply chain application of blockchain technology in their paper, Lenovo cited examples of how the New York Times uses blockchain to prove that photos are authentic. They also described how the city of Dubai is working to have all its government documents on blockchain by the end of 2020 in an effort to crack down on corruption and the misuse of funds. In the era of deep fakes, manipulated photos and consistently evolving methods of corruption and misappropriation of funds, blockchain can help identify cases of data fraud and misuse. Blockchain’s inherent transparency and immutability means that data cannot be retroactively manipulated to support a narrative. Facts in a blockchain are recorded as unchangeable facts. Analytics models can help researchers understand how data of any type originated, who the original owner was, how it gets amended over time, and if any amendments are coordinated. Controlling secure document integrity As just mentioned, blockchain technology can be used to ensure document authenticity, but it can be used also to ensure document integrity. In areas where documents should not be able to be altered, such as the legal and healthcare industries, blockchain can help make documents and changes to them transparent and immutable, as well as increase the power the owner of the data has to control and manage it. Documents do not have to be stored in the blockchain to benefit from the technology. Documents can be stored in off-chain repositories, with a hash stored in a block on the blockchain. Each transaction (required to write to a new block) contains the owner’s account and a timestamp of the action. The integrity of any document at a specific point in time can be validated simply by comparing the on-chain hash with the calculated hash value of the document. If the hash values match, the document has not changed since the blockchain transaction was created. The company DocStamp has implemented a novel use for blockchain document management. Using DocStamp, shown below, anyone can self-notarize any document. The document owner maintains control of the document while storing a hash of the document on an Ethereum blockchain. Services such as DocStamp provide the capability to ensure document integrity using blockchain technology. However, assessing document integrity and its use is up to analytics models. The DocStamp model is not generally recognized by courts of law to be as strong as a traditional notary. For that to change, analysts will need to provide model results that show how the approach works and how blockchain can help provide evidence that document integrity is ensured. Tracking supply chain items In the Lenovo blockchain paper, the author described how Lenovo replaced printed paperwork in its supply chain with processes managed through smart contracts. The switch to blockchain-based process management greatly decreased the potential for human error and removed many human-related process delays. Replacing human interaction with electronic transaction increased auditability and gave all parties more transparency in the movement of goods. The Lenovo supply chain became more efficient and easier to investigate. Blockchain-based supply chain solutions are one of the most popular ways to implement blockchain technology. Blockchain technology makes it easy to track items along the supply chain, both forward and backward. The capability to track an item makes it easy to determine where an item is and where that item has been. Tracing an item’s provenance, or origin, makes root cause analysis possible. Because the blockchain keeps all history of movement through the supply chain, many types of analysis are easier than traditional data stores which can overwrite data. The US Food and Drug Administration is working with several private firms to evaluate using blockchain technology supply chain applications to identify, track, and trace prescription drugs. Analysis of the blockchain data can provide evidence for identifying counterfeit drugs and delivery paths criminals use to get those drugs to market. Empowering predictive analytics You can build several models that allow you to predict future behavior based on past observations. Predictive analytics is often one of the goals of an organization’s analytics projects. Large organizations may already have a collection of data that supports prediction. Smaller organizations, however, probably lack enough data to make accurate predictions. Even large organizations would still benefit from datasets that extend beyond their own customers and partners. In the past, a common approach to acquiring enough data for meaningful analysis was to purchase data from an aggregator. Each data acquisition request costs money, and the data you receive may still be limited in scope. The prospect of using public blockchains has the potential to change the way we all access public data. If a majority of supply chain interactions, for example, use a public blockchain, that data is available to anyone — for free. As more organizations incorporate blockchains into their operations, analysts could leverage the additional data to empower more companies to use predictive analytics with less reliance on localized data. Analyzing real-time data Blockchain transactions happen in real time, across intranational and international borders. Not only are banks and innovators in financial technology pursuing blockchain for the speed it offers to transactions, but data scientists and analysts are observing blockchain data changes and additions in real time, greatly increasing the potential for fast decision-making. To view how dynamic blockchain data really is, visit the Ethviewer Ethereum blockchain monitor’s website. The following image shows the Ethviewer website. Each small circle in the blob near the lower-left corner of the web page is a distinct transaction waiting to make it into a new block. You can see how dynamic the Ethereum blockchain is — it changes constantly. And when the blockchain changes, so does the blockchain data that your models use to provide accurate results. Supercharging business strategy Companies big and small — marketing firms, financial technology giants, small local retailers, and many more — can fine-tune their strategies to keep up with, and even get ahead of, shifts in the market, the economy, and their customer base. How? By utilizing the results of analytics models built on the organization’s blockchain data. The ultimate goal for any analytics project is to provide ROI for the sponsoring organization. Blockchain analytics projects provide a unique opportunity to provide value. New blockchain implementations are only recently becoming common in organizations, and now is the time to view those sources of data as new opportunities to provide value. Analytics can help identify potential sources of ROI. Managing data sharing Blockchain technology is often referred to as a disruptive technology, and there is some truth to that characterization. Blockchain does disrupt many things. In the context of data analytics, blockchain changes the way analysts acquire at least some of their data. If a public or consortium blockchain is the source for an analytics model, it's a near certainty that the sponsoring organization does not own all the data. Much of the data in a non-private blockchain comes from other entities that decided to place the data in a shared repository, the blockchain. Blockchain can aid in the storage of data in a distributed network and make that data easily accessible to project teams. Easy access to data makes the whole analytics process easier. There still may be a lot of work to do, but you can always count on the facts that blockchain data is accessible and it hasn’t changed since it was written. Blockchain makes collaboration among data analysts and other data consumers easier than with more traditional data repositories. Standardizing collaboration forms Blockchain technology empowers analytics in more ways than just providing access to more data. Regardless of whether blockchain technology is deployed in the healthcare, legal, government, or other organizational domain, blockchain can lead to more efficient process automation. Also, blockchain’s revolutionary approach to how data is generated and shared among parties can lead to better and greater standardization in how end users populate forms and how other data gets collected. Blockchains can help encourage adherence to agreed-upon standards for data handling. The use of data-handling standards will greatly decrease the amount of time necessary for data cleaning and management. Because cleansing data commonly requires a large time investment in the analytics process, standardization through the use of blockchain can make it easier to build and modify models with a short time-to-market.
View ArticleArticle / Updated 08-04-2022
The main purpose of data analytics is to uncover hidden meaning in data. If it were easy to look at raw data and interpret what it means, there wouldn’t be a need for sophisticated data analytics. Although a well-trained analyst can look at a model’s mathematical output and make inferences about the data, those inferences aren’t always easy to explain to others. To clearly explain the results of most models’ output, you need to draw a picture. Visualizing data isn’t just a nice thing to know; it's critical to conveying meaning to other people. Technical and non-technical people alike benefit from a good data visualization. Sometimes a bar chart most clearly explains data visually; other times a pie chart is better. Knowing how to visualize your data for the biggest effect is an important skill that improves with experience. One of the most critical parts of any analytics project is presenting the results. Choosing the right visualizations for presenting your results can make or break your presentation. In this article, you discover ten tips for visualizing data. These tips will help you assess your data and choose a visualization technique that will most clearly convey the story your data wants to tell. Checking the landscape around you Just like the great scientists of our age stand on the shoulders of the giants who came before them, you should take the opportunity to learn from existing visualizations. A quick Internet search on visualizing data will give you many ideas on what kinds visualizations others have used, pointers on how they were done, and even some potential pitfalls. In many cases, you can visualize a specific type of data in several ways, and seeing how others have done it might give you some ideas. And if you’ve already created visualizations of your data, seeing someone else's approach might inspire you to improve your work. To get started, look at an example from the king of data, Google. This image shows a visualization of the Ethereum blockchain from BigQuery, Google’s big data analytics platform. You can read about BigQuery and its blockchain visualizations. Regardless of the source, taking time to look over how others have visualized their data can be both instructive and enlightening. Leveraging the Blockchain community Many analysts and data scientists of all skill levels are online and willing to help point aspiring data visualizers to the right datasets and tools. Stack Overflow, Reddit (and appropriate subreddits, such as the one for data visualization and predictive analysis.), and Kaggle are all great places to network online, ask questions, and learn how to build first-rate visualizations quickly. Many tools have active communities. Don’t ignore the value of asking questions of people who are more experienced than you. Chances are, they had lots of questions at some point in the past as well. User communities are great places to learn. This image shows the results when the term techniques for visualizing data was searched for on Stack Overflow. The image you see below shows the community and subreddit results of searching for visualizing data on Reddit. The following image shows the Kaggle website. You’ll find lots of resources on Stack Overflow, Reddit, and Kaggle, and all are worth bookmarking for later reference. Make friends with network visualizations One of the many data visualizations in computer science is the directed acyclic graph (DAG). DAGs have many uses and indications, and it's easy to dive deep in a short period of time. For our use, let’s stick with a simple explanation of DAGs. A DAG, also sometimes called a network graph, is a directed graph of vertices and edges. Vertices are generally states, and edges are transitions from one state to another. If you’re wondering how DAGs remotely relate to blockchain data, remember that blockchain technology excels at handling transfers of ownership. You can represent a blockchain transaction as two vertices (from account and to account), and an edge (amount of transfer). Using a DAG (network graph), you can visually show how assets are transferred from one account to another. Network graphs make it possible to visualize any transfer, such as in a supply chain blockchain. Visualizing data using network graphs isn’t new. For example, the GIGRAPH application makes it easy to turn spreadsheet data into a network graph. You could do the same thing with any type of blockchain data. The following image shows an example of a network graph generated from tabular data in an Excel spreadsheet. Recognize subjectivity when visualizing Blockchain data Whenever you engage in cryptocurrency or other blockchain data analysis and visualizations, you should recognize that legacy systems often calculate value differently than new systems, especially new systems that incorporate cryptocurrency-based transactions. The value of transactions and the currency itself is subject to at least some degree of subjectivity. For instance, it's common to explain how blockchain transaction fees are far cheaper than the real-life processing fees they should replace. This may be true today, but if the value of cryptocurrency changes dramatically with respect to fiat currency, the relative values may change as well. A blockchain transaction fee today may seem very low, but worldwide financial turmoil coupled with a global strengthening of trust in cryptocurrency could invert today’s value perception. When you analyze and especially when you visualize, make sure you deal with any ambiguity that relative valuation may cause and communicate it clearly to the audience of your visualizations. Likewise, if your visualizations are built on any assumptions or constraints, be sure to note those as well. You want your visualizations to stand on their own as much as possible, not open to wildly different interpretations by the audience. Use scale, text, and the information you need to visualize your data Blockchain analysis is a data-rich environment, so you need to make sure you don’t overwhelm your audience with too much information. Providing too many nodes or colors or excessively specific visual markers can make visualizations confusing, which misses the point of visuals. Determining what is “too much” is a bit of an art form. In general, use your best judgement and make sure you include only the information you need and are presenting it clearly. Tableau Gurus published a nice article on how to avoid clutter in your visuals. The data visualization recommendations in this article are timeless and worth incorporating into your own work. The suggestions are simple but straightforward. The following image shows an example suggestion from Tableau Gurus to simplify visualizations. If your data is either isolated to a narrow band in your visualization or varies widely, consider changing the scale. Decreasing the scale can cause narrowly depicted data to show more variance, and a log scale can show relative changes more clearly. If your data doesn’t tell a story clearly, try changing its scale to see if that exposes interesting information. Consider frequent updates for volatile blockchain data Although it's true that data in a blockchain block never changes, new blocks are added every few minutes or seconds. Regardless of when you execute an analytics model on blockchain data, the volatility of the blockchain makes your analysis stale almost immediately. New transactions are submitted in a nearly continuous stream, and any of those transactions could affect your models. Your choice is to either frequently update your model and its associated datasets to be relatively current with the live blockchain or clearly state the highest block represented in your model. The latter approach tends to be easier but more confusing. Just reminding your audience that a model is based on outdated data generally doesn’t communicate the potential risk of relying on old data. In most cases, frequent updates mean more accurate results. To get an idea of the dynamic nature of blockchains, visit Ethviewer, a real-time Ethereum blockchain monitor shown below. You don’t have to look at the Ethviewer web page long to get an appreciation of how quickly transactions are submitted and make it into a new block. Get ready for big data Blockchain analysis gives analysts access to massive amounts of information. If you want to successfully analyze and visualize large sets of data in compelling ways, both your visualization tools and the hardware that runs them must be capable of handling the load. Hadoop is one of the most popular options for big-data analysis. On the visualization side, Jupyter, Tableau, D3.js, and Google Charts can help. A little research into the right tools goes a long way. As far as hardware, make sure your CPU and memory are up to the task — you’ll want at least a quad core CPU and 16 GB of RAM. You can run analytics on big data with less, but your performance might suffer. Visit the following websites to get more information on visualization tools that are ready to handle big-data analysis: Jupyter: This extremely useful toolset supports visualizations of datasets from small to extremely large. Learn about the products from the Jupyter Project; you’ll be glad you did. Tableau: Tableau is a market leader in big data analysis and visualization. This product is mature and integrates with most large-scale data-handling and high-performance processing platforms. For an enterprise class analytics framework, Tableau is hard to beat. Google Charts: The Google Charts website says it all: “Google chart tools are powerful, simple to use, and free.” js: The Data Driven Document JavaScript library (D3.js) provides the capability to visualize big data using many techniques in JavaScript programs. If you’re using JavaScript to build analytics models, D3.js should be on your evaluation list. Protect privacy in your data visualizations In today’s hyper-regulated and privacy-sensitive business environment, you must ensure that you're using a large enough dataset or partitions to avoid the possibility of associating any unique individual with the data your audience views. To make matters worse, even large datasets or partitions may not be enough to protect privacy. Sophisticated re-identification capabilities can infer unique identities with what seems to be a minimal amount of data. In addition to taking care to preserve privacy when you build datasets, your models must also be built to preserve privacy in the results they produce. Blockchain might seem immune to privacy issues because no real-life identities are associated with transactions. But Peter Szilagyi, a core Ethereum developer, has talked about various sites capable of creating links between a user’s IP address and an Ethereum transaction address. Although many the ability he describes has generally been blocked in many apps, other attacks on privacy will arise. As with all data analysis and visualization efforts, it’s better to be safe than sorry. Always pay attention to privacy as you build datasets and the models that analyze your data. Let your data visualizations tell your story Any time you attempt to digest a large amount of data and present results, it’s easy to overwhelm your audience with too much information and complex visualizations. Just as important as creating easy-to-understand visualizations is ensuring that they contribute to what you are trying to say. This point is true for any visualizations, not just those associated with blockchain. Keep in mind the big picture you’re creating. Go back to the beginning of your analytics project. Remind yourself of the original goals of the project. Then, as you work toward building visualizations for each model, revisit the goals for each model. As long as each visualization conveys the message you want to convey and meets one or more of the project’s goals, you've created a useful visualization. Only include useful visuals. Extra visuals, no matter how flashy they may be, detract from the project’s primary goal. Stay focused on what you've been asked to do. Challenge yourself! Blockchain is an emerging technology and its uses are still being discovered and fleshed out. Keep up with the latest research, papers, and competitions on sites such as Kaggle to keep your analysis and visualization skills sharp. Take online courses on visualization topics and tools and just keep learning! Remember that if a picture really is worth a thousand words, strive to use those thousand words better with each new project. Want to learn more? Check out this article to learn what makes a good data visualization.
View ArticleArticle / Updated 08-04-2022
The information age offers many new opportunities and just as many (if not more) challenges. The vast amount of data available to organizations of all types empowers advanced decision-making and raises new questions of privacy and ethics. Whether you are undertaking a blockchain data analytics project or engaging with data in any way, there are certain regulation and data privacy laws you should be aware of. Consumer protection groups have long been voicing concerns about how personal data is being used. In response to discovered abuses and the recognition of potential future abuses, governing bodies around the world have passed regulations and legislation to limit how data is collected and used. Although collecting a few pieces of information about a customer may seem innocent, it doesn’t take long for accumulated data to paint a picture of an individual’s personal characteristics and behavior. Knowing the past behavior of someone makes it relatively easy to predict the person's future actions and choices. Predicting actions has value for marketing but also poses a danger to an individual’s privacy. Classifying individuals in data The concern is that personal data has been, and will continue to be, used to classify individuals based on their past behavior. Classifying individuals can be great for marketing and sales purposes. For example, any retailer that can identify engaged couples can target them with ads and coupons for wedding-related items. This type of targeted advertising is generally more productive than general marketing. Advertising budget can be focused on target markets that provide the greatest ROI. On the other hand, knowing too much about individuals may violate a person’s privacy. One instance of a privacy violation was a result of the Target Corporation’s astute data analysis. Target’s analysts were able to identify expectant mothers early in their pregnancy based on their changing purchasing habits. When a new expectant mother was identified, Target would send unsolicited coupons for baby-related items. In one case, the coupons arrived in the mail before the mother had shared that she was pregnant; her family found out about the pregnancy from a retailer. Privacy is such a difficult issue because legitimate actions can violate a person’s privacy. Identifying criminals Another aspect of privacy is when criminals, or other individuals who deliberately want to operate anonymously, hide their identities from exposure. Privacy may be important to the general population, but it's a necessity for criminal activity. The ability to deny, or repudiate, some action is crucial in avoiding discovery and capture, and to any subsequent defense. Money laundering and fraud are two activities in which privacy and anonymity are desired to obfuscate illegal activity. On the other hand, law enforcement needs the ability to associate actions with individuals. That’s why laws exist that protect the general public but allow law enforcement to conduct investigations and identify alleged perpetrators. Protecting the privacy of law-abiding individuals while identifying criminals has become important across a spectrum of organizations. To enable law enforcement to deal with online privacy issues, legislative bodies have passed various laws to address those issues directly. Common privacy laws Here are a few of the most important privacy-related laws you’ll likely encounter and may be compelled to satisfy: Children’s Online Privacy Protection Act (COPPA): Passed in 1998, COPPA requires parental or guardian consent before collecting or using private information about children under the age of 13. Health Insurance Portability and Accountability Act (HIPAA): Passed in 1996, HIPAA modernized the flow of health care information and contains specific stipulations on protecting the privacy of personal health information (PHI). Family Educational Rights and Privacy Act (FERPA): Passed in 1974, FERPA protects access to educational information, including protection for the privacy of student records. General Data Protection Regulation (GDPR): Passed in 2016 (and implemented in 2018), GDPR is a comprehensive regulation from the European Union (EU) protecting the private data of EU citizens. Every organization, regardless of location, must comply with GDPR to conduct business with EU citizens. The EU citizen must retain control over his or her own data, its collection, and its use. California Consumer Protection Act (CCPA): Passed in 2018, CCPA has been called “GDPR lite” to imply that it includes many of the requirements of GDPR. CCPA requires any organization that conducts business to protect consumer data privacy. Anti-Money Laundering Act (AML): AML is a set of laws and regulations that assists law enforcement investigations by requiring financial transactions to be associated with validated identities. AML imposes requirements and procedures on financial institutions that essentially make it very difficult to transfer money without leaving a clear audit trail. Know Your Customer (KYC): KYC laws and regulations work with AML to ensure that businesses expend reasonable effort to verify the identity of each customer and business partner. KYC helps to discourage money laundering, bribery, and other financial-based criminal activities that rely on anonymity. Want to learn more? Read our article to learn how to prevent data privacy disasters.
View ArticleArticle / Updated 08-04-2022
Although understanding the blockchain data available through transactions, events, and contract state is important, you must understand what that data represents before you can make much sense out of it. An important part of any blockchain (or traditional) data analytics project is to align data with the real world. In a blockchain environment, that understanding starts with smart contracts. Understanding smart contract functions You can think of smart contracts as programs that contain data and the functions to manipulate that data. One way to help understand smart contracts is to think of state data as nouns and functions as verbs. Associating smart contract elements with parts of speech helps to understand each element’s purpose. You store data that represents something in the real world, such as an order, a product, or a letter of credit. Functions provide the actions that applications take on data, such as creating an order, createOrder(), shipping a product, shipProduct(), or requesting a letter of credit, requestLoC(). Data analytics is focused on extracting meaningful and actionable information from data. It is important to understand the data available to you, along with how that data was created and what real-world things and processes it represents. Smart contract functions provide the roadmap to how data gets added to the blockchain and what that data means. Assessing smart contract event logs One process early in any data analytics project is assessing your available data. In a blockchain environment, that step should include assessing any events related to the smarts contracts you’ll examine. One way to view events is as documentation of internal operations. These microtransaction artifacts often provide a level of granular data that you can’t get anywhere else. Don’t ignore the event logs — they may provide your best description of blockchain data and what it really represents. Ranking blockchain transaction and event data by its effect After you have a catalog of the data available to you, rank each data item’s importance by its effect. A data item has greater effect when it corresponds to some entity attribute or action in the real world. Data that represents a letter of credit’s approval status change is likely more important than the field that records the page count of the letter of credit document. All data is not equal. It is always up to you, the data analyst, to focus on the important data and not spend too much time on data with little value. Properly ranking data value by its effect is a learned skill, and one that takes practice. Want to learn more? Check out our Blockchain Data Analytics Cheat Sheet.
View ArticleArticle / Updated 08-04-2022
Blockchain technology is viewed as a disruptive technology due to the promise of removing intermediaries and changing the way business is conducted. That promise is a big one for blockchain, but it is possible. Removing even some of the intermediaries in existing business processes has the potential of streamlining and economizing workflows at all levels. On the other hand, changing a business process to blockchain technology is not a simple switch. For widespread implementation of blockchain technology, new business and software products that integrate with existing software and data are required. The challenge of moving from concept to deployment poses the greatest current difficulty for blockchain adoption. Finding a good blockchain fit for your business The first step in successfully implementing blockchain technology in any environment is finding a good-fit use case. It doesn’t make any sense to jump into blockchain just because it’s new and cool. It has to make sense for you and your organization. That statement sounds obvious, but you’d be surprised how many organizations want to chase the shiny object that is blockchain. Blockchain has many benefits, but three of the most common are data transparency, process disintermediation (removing middlemen), and persistent transaction history. The best-fit use cases for blockchain generally focus on one of these benefits. If you have to look hard at how blockchain technology can meet the needs of your organization, it may be best to wait until there is a clear need. The most successful blockchain implementations are those that start with clear goals that align with blockchain. For example, suppose a seafood supplier wants to be able to trace their seafood back to the source to determine if it were caught or harvested in the wild using humane and sustainable methods. A blockchain app would make it possible to manage seafood from the point of collection all the way to the consumer’s purchase. Any participant along the way, including the consumer, can scan a tag on the seafood and find out when and where it was originally caught. To increase the probability of a successful blockchain project, start with a clear description of how the technology aligns with project goals. Trying to fit blockchain to an ill-suited use case leads to frustration and ultimate failure. Integrating blockchain technology with legacy artifacts After you determine that blockchain is a good fit for your environment, the next step is to determine where it fits in the workflow. Unless you're building a new app and workflow, you’ll have to integrate with existing software and infrastructure. If you are creating something new, the only considerations revolve around how your app stores the data it needs. Will you store everything on the blockchain? It may not make sense to do that. For example, blockchain does a great job at handling transactional data and keeping permanent audit trails of changes to data. Do you need that for customer information? You may find that only part of your app data should be stored on the blockchain. It may make more sense to store supporting data in off-chain data repositories. (Now that we’re in the blockchain era, legacy databases are called off-chain repositories.) If this is the case, your app will have to integrate with the blockchain and the off-chain repository. In many cases, people are integrating new blockchain functionality with legacy applications and data. This integration effort could include introducing both new blockchain functionality and moving existing functionality to a blockchain environment. Although this task may sound straightforward, integrating with legacy systems involves many subtle implications. Legacy systems define notions of identity, transaction scoping (defining how much work is accomplished in a single transaction), and performance expectations. Some questions to consider: How will your new app associate legacy identities with blockchain accounts? How will you adhere to your existing application’s notion of traditional transactions? If your application supports rolling back a transaction, how will your blockchain do this? Will the legacy application’s users have to wait for blockchain transactions, or will they be able to carry out work like they did before the blockchain implementation? And lastly, will the integration of blockchain maintain sufficient performance or will it slow down the legacy application? Scaling blockchain to the enterprise The last question above leads well into one of the biggest current obstacles to blockchain adoption. Scaling performance to an enterprise scale is an ongoing pursuit that hasn’t been completely resolved. Most enterprise applications use legacy database management systems to store and retrieve data. These data repositories have been around for decades and have become efficient at handling vast amounts of data. According to Chengpeng Zhao (CEO of the cryptocurrency exchange Binance), a blockchain implementation must be able to support 40,000 transactions per second to be viable as a core technology in a global cryptocurrency exchange. Currently, only four popular blockchain implementations claim to be capable of more than 1,000 transactions per second: Futurepia, EOS, Ripple, and NEO. The most popular public blockchain, Ethereum, currently can handle about 25 transactions per second. Future releases of Ethereum, however, are focusing on raising the transaction throughput substantially. The technology is getting better but has a long way to go to be ready for the volume that enterprises require. Performance isn’t the only limiting factor when assessing blockchain for the enterprise. Integration with legacy artifacts and the ease with which the blockchain infrastructure fits into the existing enterprise IT infrastructure are concerns as well. Do all blockchain nodes require new virtual or physical hardware? Can the new nodes run on existing servers? What about network connectivity? Will existing network infrastructure support the new blockchain network? These are only a few of the many questions that enterprises must answer before deploying a blockchain integration project. Want to learn more? Check out our Blockchain Data Analytics Cheat Sheet.
View ArticleArticle / Updated 08-04-2022
Knowing how to access blockchain data and use it in analytics models are only the first steps toward creating useful results. The next step is to actually do these tasks. Although you can develop models using a simple text editor, having the right tools will speed the process and make you far more productive. The right tool for each part of the blockchain data analytics project can dramatically increase the probability that your results will have value to your organization. No single tool, framework, or package works well in every blockchain situation. You must define your project’s requirements, consider the resources available to you, and then select the best collection of tools for your analytics project toolbox. Here, you learn about ten common tools that analysts use for blockchain analytics projects. This article includes an assortment of tools that address a wide range of requirements. These tools will help you get a jumpstart toward delivering quality blockchain analytics results. Develop blockchain data analytics models with Anaconda You should download and install the Anaconda environment because of its value in any analytics project. Anaconda is the first tool you should be using because of the many ways it makes analytics easier. You can get Anaconda for small teams or for enterprise analytics development and deployment. The team and enterprise Anaconda licenses aren’t free, but in exchange for the licensing fee you get lots of collaboration capabilities that will make team analytics development easier, including tools to extract and organize data, prototype models, develop analytics solutions, and deploy those solutions. The Anaconda environment promotes “an integrated, end-to-end data experience,” where analytics project team members can easily collaborate and share project artifacts. Anaconda Navigator, shown below, is the default user interface, but you can use the conda command-line interface if you prefer a text-based interface. In the image above, note that only some tools are installed. When you install Anaconda, the install process searches your computer to see if any tools in Anaconda Navigator are already installed. Any tools that are recommended as part of Anaconda environment haven't been installed have an Install button under their icons. To install any new tool, just click or tap the Install button. Anaconda is far more than just a collection of tools. One of the most valuable aspects of Anaconda is that it automatically installs many of the analytics libraries you’ll use when building models. And if highly productive tools and pre-installed libraries aren’t enough, Anaconda also provides lots of entry points for product documentation and tutorials to help you get up to speed in record time. If you choose only one tool to install to supercharge your analytics projects, choose Anaconda. Write code in Visual Studio Code When writing software for nearly any environment (in nearly any language), try using Visual Studio Code Integrated Development Environment (IDE). Visual Studio Code, commonly called VS Code, is a freely available code editor and IDE from Microsoft that includes support for debugging, task execution, and version control. Microsoft provides VS Code for Windows, Linux, and MacOS. Although technically a lightweight alternative to the flagship product, Visual Studio IDE, VS Code brings a ton of functionality to the table. VS Code is free for private and commercial use and gives developers a great environment for developing code. In addition to being free, VS Code is extremely functional and developer friendly. VS Code has its own marketplace with hundreds of free extensions. VS Code extensions provide support for multiple languages (syntax checking and inline help), handling different types of file formats, and integration with many other tools. If you use VS Code and want some additional feature, there’s a good chance you can find an extension that does what you want. The following image shows VS Code in the editor window. This version of VS Code includes a Python extension, so VS Code automatically checks any Python code for syntax errors. Because you don’t see any red squiggly underlines in the following image, the code you see is syntactically correct. Although other good IDEs for code development are available, VS Code is one of the most popular choices for software developers, which is why it's one of the default tools in the Anaconda Navigator. Prototype blockchain data analytics models with Jupyter Jupyter Notebook and JupyterLab are popular products from Project Jupyter, an open-source and open-standards group dedicated to providing interactive programming support for many languages. Jupyter Notebook and JupyterLab are both included in the default Anaconda Navigator due to their popularity with data analysts and machine-learning model developers. Both tools are web applications that allow developers and analysts to build and populate models in a shared environment. Jupyter tools are popular choices when learning about data analytics and machine learning because the online design of the tools makes it easy to share code and data, called notebooks, with others. Anyone who wants to share a model, data, or any examples can just share a notebook. This next image shows the kmeans.py Python program in Jupyter Notebook. Building on the popularity of Jupyter Notebook, JupyterLab is the next generation of Jupyter’s web interface for notebooks, code, and data. The image below shows the kmeans.py Python program in JupyterLab. Jupyter products support over 40 languages. Develop blockchain data models in the R language with RStudio Throughout this book, you learn about building analytics models with the Python language. But Python isn’t the only language commonly used to build analytics models. The R language is another popular language for data modeling and analysis. Like Python, R can import many libraries, called packages in R, to provide access to hundreds of analytics functions. One of the most popular IDEs for working with the R language is RStudio. You can use VS Code for R development, but RStudio is a strong alternative and a favorite of R developers. In fact, you can use RStudio for both R and Python code development. RStudio is available as a standalone IDE and a web-based server interface. Both are open-source products. RStudio also offers a range of professional for-fee products designed for teams of analysts and developers who need collaboration features. The following image shows an R program that analyzes a dataset of income records by zip code. The RStudio IDE displays the R code, console messages, a list of items in memory, and the final visual output. Before you install RStudio, you must install the R language. If you try to install and then launch RStudio and get a message that R needs to be installed, you forgot to install the R language first. Interact with blockchain data with web3.py You need a blockchain client to interact with data stored in your blockchain. Each blockchain implementation is different, but the overall concepts are similar. After you learn how to access and analyze data from one blockchain implementation, mapping that knowledge to another environment is relatively easy. You can use the web3.py Ethereum blockchain client to access blockchain data. You’ll need this critical library to examine and extract the blockchain data required by your analytics models. This image shows the web3.py project website and several options you can use to install the web3.py library. But web3.py isn’t the only option. There are a few options for the Ethereum blockchain, and a quick Internet search will show you multiple options for other blockchains. Extract blockchain data to a database Throughout this book you learn how to identify blockchain data of interest and extract that data for use in analytics models. In some cases, you might need to extract blockchain data first and explore it later. Because you may not know what data you’ll need up front, you may find it more efficient to extract blockchain data to an off-chain repository for later analysis. By extracting blockchain data and storing it in a high-performance database management system, you can decrease data access times. You can write your own extraction code, but several generic products are already available to extract blockchain data and store it in a database. Extracting blockchain data with EthereumDB EthereumDB is an open-source product that extracts Ethereum blockchain data and stores it in a SQLite database. EthereumDB is a quick and simple method for extracting summary data, transaction details, and block information into separate relational database tables. You can use EthereumDB as is or as a tutorial on how to extract Ethereum blockchain data. Storing blockchain data in a database using Ethereum-etl Ethereum-etl is another open-source product you can use to extract Ethereum blockchain data. Ethereum-etl is more complex and flexible than EthereumDB. Using Ethereum-etl, you can output extracted data to text files or database tables. You also have a wider range of blockchain data you can extract, including block data, token transfers, and event logs. If you want to be able to tailor the data you extract from an Ethereum blockchain, Ethereum-etl is a good option to explore. Access Ethereum networks at scale with Infura All examples in this book use local blockchains provided by Ganache. Although Ganache is a great tool for learning blockchain concepts and developing your own blockchain code, it isn’t a live blockchain network. Real analytics projects will need to interact with real blockchain networks. Your organization may implement its own blockchain network; if not, you’ll need to interact with Ethereum’s mainnet or some other public blockchain. Interacting with a public blockchain comes with some constraints and obstacles. First, to get to all of a blockchain’s data, you need to connect to a full node. Running a full blockchain node requires an investment of infrastructure. Specifically, you need to dedicate disk space to store the blockchain data, a device to run the blockchain client, and sufficient network access to initially download all the blockchain data and then to process new blocks. Interacting with one blockchain may be feasible, but as you add more public blockchains to your data universe, the infrastructure requirements may become untenable. One common solution to increasing infrastructure investment is to use someone else’s infrastructure, and one of the most popular services for Ethereum blockchain access is Infura. An Infura account provides API access over HTTPS and webSockets to multiple Ethereum networks and InterPlanetary File System (IPFS) resources as well. Using Infura can take one large obstacle (setting up your own Ethereum node) off the table and let you focus on building analytics models. The next image shows Infura’s architecture for accessing Ethereum and IPFS resources. Analyze very large blockchain datasets in Python with Vaex Regardless where you get your data, there is likely to be lots of it. One common obstacle to operationalizing data analytics models is the size of datasets you need to analyze. Most model types increase accuracy with more data. But at some point, datasets become so large that they become difficult to manage. Even though your organization’s infrastructure may have lots of servers with lots of memory, you may not always be able to provision huge amounts of resources every time you need to run a model. To scale models to available hardware, many developers or analysts run models on partitions of their data or employ distributed processing. Partitioning your data can cut out important information and distributing analytics can take a lot of work. However, another choice is available. Vaex is an open-source library that implements out-of-core dataframes, which allows you to write code that explores and visualizes datasets far bigger than your computer’s memory. With Vaex, shown below, you can run analytics models on datasets hundreds of gigabytes in size, even on a laptop computer! Examine blockchain data One of the most important early steps in any analytics project is to identify the data your models need. You must take inventory of the data available to you and then explore sources for other data that your models require. When working in blockchain environments, the most common tool used to examine available data is a blockchain explorer. Most blockchain explorers are web applications that provide an easy interface for accessing data stored in a blockchain. Many blockchain explorer options are available, and each blockchain implementation has its own options. Here, you discover three popular options for exploring data on Ethereum and Bitcoin blockchains. Explore Ethereum with Etherscan.io Etherscan.io is the most popular blockchain explorer for Ethereum networks. Using Etherescan.io, you can explore blockchain data from Ethereum’s mainnet or any of the most popular test Ethereum networks. You can look at blocks, transactions, event logs, or any data related to your selected network. Etherescan.io makes it easy to examine your blockchain data to identify the source data your models require. The following image shows the main Etherescan.io web page. Peruse multiple blockchains with Blockchain.com Some blockchain explorers support access to multiple blockchain networks. For example, Block Explorer from Blockchain.com implements similar visibility as Etherscan.io but to more blockchain network types. Block Explorer provides an interface to block data from the main nets of Bitcoin, Bitcoin Cash, and Ethereum, as well as the test nets for Bitcoin and Bitcoin Cash. This next image shows the main Block Explorer interface for the Bitcoin network. View cryptocurrency details with ColossusXT Some blockchain explorers, such as ColossusXT, focus on cryptocurrency transactions. Instead of providing generic block access, ColossusXT identifies blocks that contain specific cryptocurrency transactions. If your analytics queries focus on cryptocurrency transactions, ColossusXT may help you find the data you need. The image you see below shows the ColussusXT main interface for Bitcoin cryptocurrency transactions. Preserve privacy in blockchain analytics with MADANA A core concern for handling data, including in the context of analytics projects, is maintaining compliance with privacy regulations. Privacy is a growing concern with governing bodies. The old, naive perception that encryption enforces privacy has been shown to be false. Privacy isn’t about the data — privacy is about the individual. Data analytics queries often provide aggregate results that simplify classification or prediction. If your models enable the audience to associate an individual with its results, you've violated that individual’s privacy. To avoid publishing any data that might inadvertently leak granular data that could be used to identify an individual, you have two main options. The first option is to apply good privacy-preserving techniques to your models. You’ll have to learn about k-anonymity, l-diversity, t-closeness, and differential privacy. Or you can use a framework such as MADANA, which does it for you. MADANA provides a framework that helps you protect confidentiality and privacy. If compliance is a concern for your organization, a framework like MADANA can help you stay compliant without having to design privacy-preserving models yourself. The image below shows the MADANA website, with some of its benefits. Want to learn more? Check out our Blockchain Data Analytics Cheat Sheet.
View ArticleArticle / Updated 08-04-2022
Blockchain has traditionally been associated with cryptocurrency, but there are use cases for blockchain that extend far beyond that. Many good examples of use cases for blockchain technology exist even today and the possibilities abound. Here, you look at just a few of those use cases. See if you can think of a good blockchain use case in your own organization. Using blockchain technology to manage physical items in cyberspace One of the earliest large-scale blockchain use cases was the management of supply chains. The process of managing products from the original producer all the way to the consumer is expensive and time consuming. With today’s product-tracking applications, it can be difficult for consumers to know much about the products they consume. Some products, such as electronics and appliances, may have descriptive tags that identify places and times of manufacture, but most products we consume don’t provide that type of information. Implementing supply chain management provides multiple benefits. The first is transparency. Producers, consumers, and anyone in-between can see how each product traveled from the place it was manufactured or acquired to where it was finally purchased and the time it took to get there. Inspectors and regulatory auditors can ensure that each participant in the supply chain met required standards. This increased transparency occurs while eliminating unnecessary middlemen. Each transfer in the process occurs between active participants, not brokers. Proper tracking of physical products in the blockchain depends on accurately associating the physical product with the digital identifier. For example, recently a flyer checked his bag on a commercial airline. The agent was busily engaged in a conversation with another agent, and swapped tags from his bag with those of another traveler. His tag was attached to the flyer’s bag, and vice versa. When the flyer arrived, the airline discovered that the flyer’s bag, with the other person's tag attached, had flown to Mexico. Always remember that the blockchain only represents the physical world — it isn’t the physical world. Using the blockchain to handle sensitive information Health care has become one of the most popular topics of conversations, ranging from politics to research, to spending. It seems that everyone is interested in increasing the quality of health care while reducing its cost. The availability of large amounts of digital data have made advances in health care possible. Researchers can analyze large amounts of data to explore new treatment plans, increase the overall effectiveness of existing drugs and procedures, and identify cost-saving opportunities. This type of data analysis is possible only with access to vast amounts of patient medical history. The main problem for researchers is that a patient’s electronic health record (EHR) is likely stored as fragments across multiple practices and databases. Although ongoing efforts to combine these records exist, privacy is a growing concern (we’re back to the trust problem) and progress is slow. EHR management is a good fit for a blockchain app. Storing a patient’s EHR in an Ethereum blockchain can remove the silos of fragmented data without having to trust each entity that provides or modified parts of the EHR. Storing the EHR in this way also helps clarify the billing and payment for medical services. With comprehensive medical procedure history all in one place, medical service providers and insurance companies can see the same view of a patient’s treatment. Full history makes it easier to figure out what should be billed. Another advantage that blockchain apps can provide in the health care domain is in managing pharmaceuticals. Blockchain EHRs provide the information for medical practitioners to see a full history and current snapshot of a patient’s prescription medications. It also allows researchers, auditors, and even pharmaceutical manufacturers to examine the effect and possible real side effects of their products. Having EHRs available, yet protected, can provide valuable information to increase the quality of health care services. Using blockchain technology to conduct financial transactions Financial services are interactions that involve some exchange of currency. The currency can be legal tender, also called fiat currency, or it can be cryptocurrency, such as Bitcoin or Ethereum’s default currency, ether (ETH). Blockchain apps do a great job of handling pure currency exchanges, or exchanging some currency for a product or service. Financial services may center on handling payments, but there are more nuances to the many transactions that involve money. Another rich field for blockchain in the financial services domain is real estate transactions. As with banking transactions, Ethereum makes it possible to conduct transactions without a broker. Buyers and sellers can exchange currency for legal title directly. Smart contracts can validate all aspects of the transaction as it occurs. The steps that normally require an attorney or a loan processor can happen automatically. A buyer can transfer funds to purchase a property after legal requirements are met, such as validating the title’s availability, and filing required government documents. The seller receives payment for the property at the same time the title transfers to the buyer. Want to learn more? Check out our Blockchain Cheat Sheet.
View ArticleCheat Sheet / Updated 04-27-2022
A predictive analytics project combines execution of details with big-picture thinking. These handy tips and checklists will help keep your project on the rails and out of the woods.
View Cheat SheetCheat Sheet / Updated 04-25-2022
Data science affects many different technologies in a profound manner. Our society runs on data today, so you can’t do many things that aren’t affected by it in some way. Even the timing of stoplights depends on data collected by the highway department. Your food shopping experience depends on data collected from Point of Sale (POS) terminals, surveys, farming data, and sources you can’t even begin to imagine. No matter how you use data, this cheat sheet will help you use it more effectively.
View Cheat Sheet