Data Science For Dummies book cover

Data Science For Dummies

By: Lillian Pierson Published: 09-15-2021

Monetize your company’s data and data science expertise without spending a fortune on hiring independent strategy consultants to help

What if there was one simple, clear process for ensuring that all your company’s data science projects achieve a high a return on investment? What if you could validate your ideas for future data science projects, and select the one idea that’s most prime for achieving profitability while also moving your company closer to its business vision? There is.

Industry-acclaimed data science consultant, Lillian Pierson, shares her proprietary STAR Framework – A simple, proven process for leading profit-forming data science projects.

Not sure what data science is yet? Don’t worry! Parts 1 and 2 of Data Science For Dummies will get all the bases covered for you. And if you’re already a data science expert? Then you really won’t want to miss the data science strategy and data monetization gems that are shared in Part 3 onward throughout this book.

Data Science For Dummies demonstrates:

  • The only process you’ll ever need to lead profitable data science projects
  • Secret, reverse-engineered data monetization tactics that no one’s talking about
  • The shocking truth about how simple natural language processing can be
  • How to beat the crowd of data professionals by cultivating your own unique blend of data science expertise 

Whether you’re new to the data science field or already a decade in, you’re sure to learn something new and incredibly valuable from Data Science For Dummies. Discover how to generate massive business wins from your company’s data by picking up your copy today.

Articles From Data Science For Dummies

page 1
page 2
page 3
30 results
30 results
Data Science For Dummies Cheat Sheet

Cheat Sheet / Updated 09-24-2021

"Data science" is the big buzzword these days, and most folks who have come across the term realize that data science is a powerful force that is in the process of revolutionizing scores of major industries. Not many folks, however, are aware of the range of tools currently available that are designed to help big businesses and small take advantage of the data science revolution. Take a peek at these tools and see how they fit in to the broader context of data science.

View Cheat Sheet
Data Journalism: Collecting Data for Your Story

Article / Updated 04-18-2017

A data-journalism piece is only as good as the data that supports it. To publish a compelling story, you must find compelling data on which to build. That isn't always easy, but it's easier if you know how to use scraping and autofeeds to your advantage. Scraping data Web-scraping involves setting up automated programs to scour and extract the exact and custom datasets that you need straight from the Internet so you don't have to do it yourself. The data you generate from this process is commonly called scraped data. Most data journalists scrape source data for their stories because it's the most efficient way to get datasets for unique stories. Datasets that are easily accessible have usually already been exploited and mined by teams of data journalists who were looking for stories. To generate unique data sources for your data-driven story, scrape the data yourself. If you find easy-to-access data, beware that most of the stories in that dataset have probably been told by a journalist who discovered that data before you. To illustrate how you'd use data scraping in data journalism, imagine the following example: You're a data journalist living in a U.S. state that directly borders Mexico. You've heard rumors that the local library's selection of Spanish-language children's books is woefully inadequate. You call the library, but its staff fear negative publicity and won't share any statistics with you about the topic. Because the library won't budge on its data-sharing, you're forced to scrape the library's online catalog to get the source data you need to support this story. Your scraping tool is customized to iterate over all possible searches and keep track of the results. After scraping the site, you discover that 25 percent of children's books at the library are Spanish-language books. Spanish-speakers make up 45 percent of the primary-school population; is this difference significant enough to form the basis of a story? Maybe, maybe not. To dig a little deeper and possibly discover a reason behind this difference, you decide to scrape the catalog once a week for several weeks, and then compare patterns of borrowing. When you find that a larger proportion of Spanish books are being checked out, this indicates that there is, indeed, a high demand for children's books in Spanish. This finding, coupled with the results from your previous site scrape, give you all the support you need to craft a compelling article around the issue. Setting up data alerts To generate hot stories, data journalists must have access to the freshest, newest data releases that are coming from the most credible organizations. To stay on top of what datasets are being released where, data journalists subscribe to alert systems that send them notifications every time potentially important data is released. These alert systems often issue notifications via RSS feeds or via email. It's also possible to set up a custom application like DataStringer to send push notifications when significant modifications or updates are made to source databases. After you subscribe to data alerts and form a solid idea about the data-release schedule, you can begin planning for data releases in advance. For example, if you're doing data journalism in the business analytics niche and know that a particularly interesting quarterly report is to be released in one week, you can use the time you have before its release to formulate a plan on how you'll analyze the data when it does become available. Many times, after you're alerted to important new data releases, you still need to scrape the source site in order to get that data. In particular, if you're pulling data from a government department, you're likely to need to scrape the source site. Although most government organizations in western countries are legally obligated to release data, they aren't required to release it in a format that's readily consumable. Don't expect them to make it easy for you to get the data you need to tell a story about their operations.

View Article
Data Journalism: How to Develop, Tell, and Present the Story

Article / Updated 04-18-2017

By thinking through the how of a story, you are putting yourself in position to craft better data-driven stories. Looking at your data objectively and considering factors like how it was created helps you to discover interesting insights that you can include in your story. Also, knowing how to quickly find stories in potential data sources helps you to quickly sift through the staggering array of options. And, how you present your data-driven story determines much about how well that story is received by your target audience. You could have done everything right — really taken the time to get to know who your audience is, boiled your story down so that it says exactly what you intend, published it at just the right time, crafted your story around what you know about why people care, and even published it to just the right venue — but if your data visualization looks bad, or if your story layout makes it difficult for readers to quickly gather useful information, then your positive response rates are likely to be low. Integrating how as a source of data and story context You need to think about how your data was generated because that line of thinking often leads you into more interesting and compelling storylines. Before drawing up a final outline for your story, brainstorm about how your source data was generated. If you find startling or attention-grabbing answers that are relevant to your story, consider introducing those in your writing or data visualization. Finding stories in your data If you know how to quickly and skillfully find stories in datasets, you can use this set of skills to save time when you're exploring the array of stories that your datasets offer. If you want to quickly analyze, understand, and evaluate the stories in datasets, then you need to have solid data analysis and visualization skills. With these skills, you can quickly discover which datasets to keep and which to discard. Getting up to speed in relevant data science skills also helps you quickly find the most interesting, relevant stories in the datasets you select to support your story. Presenting a data-driven story How you present your data-driven story determines much about whether it succeeds or fails with your target audience. Should you use an infographic? A chart? A map? Should your visualization be static or interactive? You have to consider countless aspects when deciding how to best present your story.

View Article
Data Journalism: Why the Story Matters

Article / Updated 04-18-2017

The human capacity to question and understand why things are the way they are is a clear delineation point between the human species and other highly cognitive mammals. Answers to questions about why help you to make better-informed decisions. These answers help you to better structure the world around you and help you develop reasoning beyond what you need for mere survival. In data journalism, as in all other types of business, answers to the question why help you predict how people and markets respond. These answers help you know how to proceed to achieve an outcome of most probable success. Knowing why your story matters helps you write and present it in a way that achieves the most favorable outcomes — presumably, that your readers enjoy and take tremendous value from consuming your content. Asking why in order to generate and augment a storyline No matter what topic you're crafting a story around, it's incredibly important to generate a storyline around the wants and needs of your target audience. After you know who your audience is and what needs they most often try to satisfy by consuming content, use that knowledge to help you craft your storyline. If you want to write a story and design a visualization that precisely targets the needs and wants of your readership, take the time to pinpoint why people would be interested in your story, and create a story that directly meets that desire in as many ways as possible. Why your audience should care People care about things that matter to them and that affect their lives. Generally, people want to feel happy and safe. They want to have fulfilling relationships. They want to have good status among their peers. People like to learn things, particularly things that help them earn more money. People like possessions and things that bring them comfort, status, and security. People like to feel good about themselves and what they do. This is all part of human nature. The desires just described summarize why people care about anything — from the readers of your story to the person down the street. People care because it does something for them, it fills one of their core desires. Consequently, if your goal is to publish a high-performing, well-received data journalism piece, make sure to craft it in a way that fulfills one or two core desires of your target readership.

View Article
The Where in Data Journalism

Article / Updated 04-18-2017

Data and stories are always more relevant to some places than others. From where is a story derived, and where is it going? If you keep these important facts in mind, the publications you develop are more relevant to their intended audience. The where aspect in data journalism is a bit ambiguous because it can refer to a geographical location or a digital location, or both. Where is the story relevant? You need to focus on where your story is most relevant so that you can craft the most compelling story by reporting on the most relevant trends. If your story is location independent — you're reporting on a trend that's irrelevant to location — of course you want to use data sources that most clearly demonstrate the trend on which you're reporting. Likewise, if you're reporting a story that's tied to a specific geographic location, you probably want to report statistics that are generated from regional areas demonstrating the greatest degree of extremes — either as greatest value fluxes or as greatest value differences for the parameters on which you're reporting. Sometimes you find multiple geographic or digital locations that exemplify extreme trends and unusual outliers. In other words, you find more than one excellent information source. In these cases, consider using all of them by creating and presenting a data mashup — a combination of two or more data sources that are analyzed together in order to provide readers with a more complete view of the situation at hand. Where should the story be published? Another important question to consider in data journalism is, "Where do you intend to publish your story?" This where can be a geographical place, a particular social media platform, or certain series of digital platforms that are associated with a particular brand — Facebook, Twitter, Pinterest, and Instagram accounts, as well as blogs, that are all tied together to stream data from one branded source. Just as you need to have a firm grasp on who your audience is, you should clearly understand the implications of where your publication is distributed. Spelling out where you'll be publishing helps you conceptualize to whom you're publishing, what you should publish, and how you should present that publication. If your goal is to craft high-performing data journalism articles, your headlines and storylines should cater to the interests of the people that are subscribed to the channels in which you're distributing. Since the collective interest of the people at each channel may slightly differ, make sure to adapt to those differences before posting your work.

View Article
The When in Data Journalism

Article / Updated 04-18-2017

As the old adage goes, timing is everything. It's a valuable skill to know how to refurbish old data so that it's interesting to a modern readership. Likewise, in data journalism, it's imperative to keep an eye on contextual relevancy and know when is the optimal time to craft and publish a particular story. When as the context to your story If you want to craft a data journalism piece that really garners a lot of respect and attention from your target audience, consider when — over what time period — your data is relevant. Stale, outdated data usually doesn't help the story make breaking news, and unfortunately you can find tons of old data out there. But if you're skillful with data, you can create data mashups that take trends in old datasets and present them in ways that are interesting to your present-day readership. For example, take gender-based trends in 1940s employment data and do a mashup — integration, comparison, or contrast — of that data and employment data trends from the five years just previous to the current one. You could then use this combined dataset to support a truly dramatic story about how much things have changed or how little things have changed, depending on the angle you're after with your piece. Returning once again to the issue of ethical responsibilities in journalism, as a data journalist you walk a fine line between finding datasets that most persuasively support your storyline and finding facts that support a factually challenged story you're trying to push. Journalists have an ethical responsibility to convey an honest message to their readers. When building a case to support your story, don't take things too far — in other words, don't take the information into the realm of fiction. There are a million facts that could be presented in countless ways to support any story you're looking to tell. Your story should be based in reality, and not be some divisive or fabricated story that you're trying to promote because you think your readers will like it. You may sometimes have trouble finding interesting or compelling datasets to support your story. In these situations, look for ways to create data mashups that tie your less-interesting data into some data that's extremely interesting to your target audience. Use the combined dataset as a basis for your data-driven story. When does the audience care the most? If your goal is to publish a data journalism piece that goes viral, then you certainly want to consider the story's timeliness: When would be the prime time to publish an article on this particular topic? For obvious reasons, you're not going to do well by publishing a story in 2017 about who won the 1984 election for U.S. president; everyone knows, and no one cares. Likewise, if a huge, present-day media scandal has already piqued the interest of your readership, it's not a bad idea to ride the tailwinds of that media hype and publish a related story. The story would likely perform pretty well, if it's interesting. As a recent example, you could have created a data journalism piece on Internet user privacy assumptions and breaches thereof and then published it in the days just after news of the Edward Snowden/NSA controversy broke. Keeping relevant and timely publishing schedules is one way to ensure that your stories garner the attention they need to keep you employed.

View Article
Bringing Data Journalism to Life: The Black Budget

Article / Updated 04-18-2017

The Washington Post story "The Black Budget" is an incredible example of data science in journalism. When former NSA contractor Edward Snowden leaked a trove of classified documents, he unleashed a storm of controversy not only among the public but also among the data journalists who were tasked with analyzing the documents for stories. The challenge for data journalists in this case was to discover and disclose data insights that were relevant to the public without compromising the safety of ordinary citizens. Among the documents leaked by Snowden was the so-called Black Budget for fiscal year 2013, a 178-page line-by-line breakdown of the funds that were earmarked for 16 various U.S. federal intelligence agencies. Through the Washington Post's "The Black Budget," the American public was informed that $52.6 billion taxpayer dollars had been spent on mostly covert federal intelligence services in 2013 alone. The Washington Post did a phenomenal job in its visual presentation of the data. The opening title is a somber visual pun: The words The Black Budget are written in a huge black box contrasted only with gray and white. This layout visually implies the serious and murky nature of the subject matter. The only touch of color is a navy blue, which conjures a vaguely military image and barely contrasts with the black. This limited palette is continued throughout the visual presentation of the data. Washington Post data journalists used unusual blocky data graphics — an unsettling, strangely horizontal hybrid of a pie chart, a bar graph, and a tree map — to hint at the surreptitious and dangerous nature of the topic, as well as the shady manner in which the information was obtained. The data graphics used in the piece exhibited a low data-to-ink ratio — in other words, only a little information is conveyed with a lot of screen space. Although normally a low data-to-ink ratio indicates bad design, the data-to-ink ratio here effectively hints that mountains of data lie underneath the layers being shown, and that these layers remain undisclosed so as not to endanger intelligence sources and national security. Traditional infographic elements used in this piece include stark, light gray seals of the top five intelligence agencies, only three of which the average person would have ever seen. Simple bar charts outlined funding trends, and people-shaped icons represented the army of personnel involved in intelligence gathering. A lot of thought went into the collection, analysis, and presentation of this story. Its ensemble is an unsettling, yet overwhelmingly informative, piece of data journalism. Although this sort of journalism was in its infancy even just a decade ago, now the data and tools required for this type of work are widely available for journalists to use to quickly develop high-quality data journalism articles.

View Article
The What in Data Journalism

Article / Updated 04-18-2017

The what, in data journalism, refers to the gist of the story. In all forms of journalism, a journalist absolutely must be able to get straight to the point. Keep it clear, concise, and easy to understand. When crafting data visualizations to accompany your data journalism piece, make sure that the visual story is easy to discern at a moment's glance. If it takes longer than that, the data visualization is not focused enough. The same principle applies to your writing. No one wants to drag through loads of words trying to figure out what you're trying to say. Readers appreciate it when you make their lives easier by keeping your narrative clear, direct, and to the point. The more people have to work to understand your content, the less they tend to like it. If you want to provide readers with information they enjoy consuming, make your writing and data visualizations clear and to the point.

View Article
Data Journalism: Who Is the Audience of Your Data?

Article / Updated 04-18-2017

When most people think of data, questions about who (as it relates to the data) don't readily come to mind. In data journalism, however, answers to questions about who are profoundly important to the success of any data-driven story. You must consider who created and maintains the sources of your datasets to determine whether those datasets are a credible basis for a story. If you want to write a story that appeals to your target readership, you must consider who comprises that readership and the most pressing needs and concerns of those people. Who made the data The answer to the question "Who made your data?" is the most fundamental and important answer to any of the five W questions of journalism — the who, what, when, where, and why. No story can pass the litmus test unless it's been built upon highly credible sources. If your sources aren't valid and accurate, you could spend countless hours producing what, in the end, amounts to a worthless story. You must be scrupulous about knowing who made your data because you need to be able to validate those sources' accuracy and credibility. You definitely don't want to go public with a story you generated from noncredible sources, because if anyone questions the story's validity, you have no ground on which to stand. News is only as good as its source, so protect your own credibility by reporting on data from only credible sources. Also, it's important to use as many relevant data sources as can be acquired, to avoid bias or accusations of cherry-picking. If you want to create a meaningful, credible data-driven story that attracts a maximum amount of attention from your audience, you can use the power and clout of reputable data sources to make your stories and headlines that much more compelling. In any type of data journalism piece you publish, it's critical that you disclose your data sources. You don't have to provide a live web link back to those sources, but you should at least make a statement about where you found your information, in case people want to investigate further on their own. Who comprises the audience Research your target audience and get to know their interests, reading preferences, and even their aesthetics preferences (for choosing the best images to include in your story) before planning your story so that you can craft something that's of maximum interest and usefulness to them. You can present the same interesting, high-quality story in countless different lights — with some lights beaming in a much more compelling way than others. To present your story in a way that most attracts readers' attention, spend some serious time researching your target audience and evaluating what presentation styles work well with readers of that group. One way to begin getting to know your readers is to gather data on stories that have performed well with that audience in the recent past. If you search social bookmarking sites — StumbleUpon, for example, or Digg or Delicious — or if you just mine some Twitter data, you can quickly generate a list of headlines that perform well with your target audience. Just get in there and start searching for content that's based on the same topic as that of your own. Identify what headlines seem to have the best performance — the highest engagement counts, in other words — among them. After you have a list of related headlines that perform well with your target audience, note any similarities between them. Identify any specific keywords or hashtags that are getting the most user engagement. Leverage those as main draws to generate interest in your article. Lastly, examine the emotional value of the headlines — the emotional pull that draws people in to read the piece, in other words. Speaking of emotions, news articles generally satisfy at least one of the following core human desires: Knowledge: Often, but not always, closely tied to a desire for profit. Safety: The desire to protect one's property, income, and well-being, or that of friends and family. Personal property: A person's innate desire to have things that bring him comfort, safety, security, and status. Self-esteem: People are sometimes interested in knowing about topics that help them feel good about themselves. These topics often include ideas about philanthropy, charity, service, or grassroots causes for social change. Ask yourself what primary desires your headlines promise to satisfy. Then craft your headlines in a way designed to appeal most strongly to that desire. Try to determine what type of articles perform the best with your target audience or what your target audience most strongly seeks when looking for new content to consume. With that info in hand, make sure to exact-target your writing and headlines in a way that clearly meets a core desire among your target audience.

View Article
Using Spatial Statistics to Predict for Environmental Variation across Space

Article / Updated 04-18-2017

By their very nature, environmental variables are location-dependent: They change with changes in geospatial location. The purpose of modeling environmental variables with spatial statistics is to enable accurate spatial predictions so that you can use those predictions to solve problems related to the environment. Spatial statistics is distinguished from natural-resource modeling because it focuses on predicting how changes in space affect environmental phenomenon. Naturally, the time variable is considered as well, but spatial statistics is all about using statistics to model the inner workings of spatial phenomenon. The difference is in the manner of approach. Addressing environmental issues with spatial predictive analytics You can use spatial statistics to model environmental variables across space and time so that you can predict changes in environmental variables across space. The following list describes the types of environmental issues that you can model and predict using spatial statistical modeling: Epidemiology and environmental human health: Disease patterns and distributions Meteorology: Weather phenomenon Fire science: The spread of a fire (by channeling your inner Smokey the Bear!) Hydraulics: Aquifer conductivity Ecology: Microorganism distribution across a sedimentary lake bottom If your goal is to build a model that you can use to predict how change in space will affect environmental variables, you can use spatial statistics to help you do this. Describing the data science that's involved Because spatial statistics involves modeling the x-, y-, z-parameters that comprise spatial datasets, the statistics involved can get rather interesting and unusual. Spatial statistics is, more or less, a marriage of GIS spatial analysis and advanced predictive analytics. The following list describes a few data science processes that are commonly deployed when using statistics to build predictive spatial models: Spatial statistics: Spatial statistics often involves krige and kriging, as well as variogram analysis. The terms "kriging" and "krige" denote different things. Kriging methods are a set of statistical estimation algorithms that curve-fit known point data and produce a predictive surface for an entire study area. Krige represents an automatic implementation of kriging algorithms, where you use simple default parameters to help you generate predictive surfaces. A variogram is a statistical tool that measures how different spatial data becomes as the distance between data points increases. The variogram is a measure of "spatial dissimilarity". When you krige, you use variogram models with internally defined parameters to generate interpolative, predictive surfaces. Statistical programming: This one involves probability distributions, time series analyses, regression analyses, and Monte Carlo simulations, among other processes. Clustering analysis: Processes can include nearest-neighbor algorithms, k-means clustering, or kernel density estimations. GIS technology: GIS technology pops up a lot in this chapter, but that's to be expected because its spatial analysis and map-making offerings are incredibly flexible. Coding requirements: Programming for a spatial statistics project could entail using R, SPSS, SAS, MATLAB, and SQL, among other programming languages. Addressing environmental issues with spatial statistics A great example of using spatial statistics to generate predictions for location-dependent environmental variables can be seen in the recent work of Dr. Pierre Goovaerts. Dr. Goovaerts uses advanced statistics, coding, and his authoritative subject-matter expertise in agricultural engineering, soil science, and epidemiology to uncover correlations between spatial disease patterns, mortality, environmental toxin exposure, and sociodemographics. In one of Dr. Goovaerts recent projects, he used spatial statistics to model and analyze data on groundwater arsenic concentrations, location, geologic properties, weather patterns, topography, and land cover. Through his recent environmental data science studies, he discovered that the incidence of bladder, breast, and prostate cancers is spatially correlated to long-term arsenic exposure. With respect to data science technologies and methodologies, Dr. Goovaerts commonly implements the following: Spatial statistical programming: Once again, kriging and variogram analysis top the list. Statistical programming: Least squares regression and Monte Carlo (a random simulation method) are central to Dr. Goovaerts's work. GIS technologies: If you want map-making functionality and spatial data analysis methodologies, you're going to need GIS technologies.

View Article
page 1
page 2
page 3