Seeing What You Need to Know When Getting Started in Data Science

By Lillian Pierson

Part of Data Science For Dummies Cheat Sheet

Traditionally, big data is the term for data that has incredible volume, velocity, and variety. Traditional database technologies aren’t capable of handling big data — more innovative data-engineered solutions are required. To evaluate your project for whether it qualifies as a big data project, consider the following criteria:

  • Volume: Between 1 terabytes/year and10 petabytes/year

  • Velocity: Between 30 kilobytes/second and 30 gigabytes/second

  • Variety: Combined sources of unstructured, semi-structured, and structured data

Data science and data engineering are not the same

Hiring managers tend to confuse the roles of data scientist and data engineer. While it is possible to find someone who does a little of both, each field is incredibly complex. It’s unlikely that you’ll find someone with robust skills and experience in both areas. For this reason, it’s important to be able to identify what type of specialist is most appropriate for helping you achieve your specific goals. The descriptions below should help you do that.

  • Data scientists: Data scientists use coding, quantitative methods (mathematical, statistical, and machine learning), and highly specialized expertise in their study area to derive solutions to complex business and scientific problems.

  • Data engineers: Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big data sets.

Data science and business intelligence are also not the same

Business-centric data scientists and business analysts who do business intelligence are like cousins. Both types of specialist use data to achieve the same business goals, but their approaches, technologies, and functions are different. The descriptions below spell out the differences between the two roles.

  • Business intelligence (BI): BI solutions are generally built using datasets generated internally — from within an organization rather than from without, in other words. Common tools and technologies include online analytical processing, extract transform and load, and data warehousing. Although BI sometimes involves forward-looking methods like forecasting, these methods are based on simple mathematical inferences from historical or current data.

  • Business-centric data science: Business-centric data science solutions are built using datasets that are both internal and external to an organization. Common tools, technologies, and skillsets include cloud-based analytics platforms, statistical and mathematical programming, machine learning, data analysis using Python and R, and advanced data visualization. Business-centric data scientists use advanced mathematical or statistical methods to analyze and generate predictions from vast amounts of business data.