Looking at the Mechanics Involved in Doing Data Science - dummies

Looking at the Mechanics Involved in Doing Data Science

By Lillian Pierson

If you’re truly interested in data science, you should really make the effort to master Python, definitely the easiest programming language for data science. Python is an object-oriented programming language that’s perfect for easy data processing, analysis, and visualization.

Python is one of the more popular programming languages. That’s because it’s relatively easy to master and because it allows users to accomplish several tasks with only a few lines of code. The following is a list of three Python libraries that are most useful and relevant in the practice of data science.

  • NumPy: The NumPy package is at the root of almost all numerical computations in Python. That’s because NumPy offers users a way to create multi-dimensional array objects in Python.

  • SciPy: SciPy is built on top of, and extends the capabilities of, the NumPy package. SciPy is a collection of mathematical algorithms and sophisticated functions that you can use for vector quantization, statistical functions, n-dimensional image operations, integration routines, interpolation tools, sparse linear algebra, linear solvers, optimization tools, signal-processing tools, sparse matrices, and many other utilities that are not served by other Python libraries.

  • MatPlotLib: MatPlotLib is built on top of NumPy and SciPy. Use the MatPlotLib library when you want to create visual representations of your dataset or data analysis findings.

Working with R

For those not in the know, R is an open source, free statistical software system that’s widely adopted across the data science sector. Yes, it’s not as easy to learn as Python, but it can be much more powerful for certain types of advanced statistical analyses. It also has particularly advanced data visualization capabilities. The following is a list of three R packages that are particularly useful in the practice of data science.

  • Forecast: The forecast package contains various forecasting functions that you can adapt to use for ARIMA, or for other types of univariate time series forecasts.

  • Mlogit: A multinomial logit model is one in which observations of a known class are used to “train” the software so that it can identify classes of other observations whose classes are unknown. If you want to carry out logistic regression in R, you can use the multinomial logit package.

  • ggplot2: The ggplot2 package is the fundamental data visualization package in R. It offers you a way to create all different types of data graphics, including histograms, scatterplots, bar charts, box plots, and density plots. It offers a wide variety of design options — including choices in colors, layout, transparency, and line density.

Using SQL in a data science context

Structured Query Language (SQL) is a set of rules that you can use to quickly and efficiently query, update, modify, add, or remove data from large and complex databases. It’s helpful in data science when you need to do some quick querying and data manipulation.

  • Querying data and filtering records: In SQL, you use the SELECT function to query a dataset. If you then use the WHERE argument, you can limit the query output to only the records that meet the criteria you’ve specified. This is one way of using SQL to query and filter data.

  • Aggregating data: If you want to aggregate your data using SQL, you can use the GROUP BY statement to group your dataset according to shared attribute values.

Keeping coding to a minimum

If you’re not up for coding things for yourself, you can try to complete a project using off-the-shelf software applications instead. You can use the following two desktop applications to perform advanced data science tasks without having to learn to code.

  • Microsoft Excel: Although it’s a somewhat simple software application, Microsoft Excel can be rather useful in the practice of data science. If you want to do a quick spot-check for trends and outliers in your dataset, you can use Excel filters, conditional formatting, and charting options to get the job done fast. Excel pivot tables are another great option if you need to quickly reformat and summarize your data tables. Lastly, if you want to automate data manipulation or analysis tasks within Excel, you can use Excel macros to get the job done.

  • KNIME: KNIME is data-mining software that you can use for code-free predictive analytics. The software is simple enough that even data science beginners can use it, but it offers plug-ins to extend capabilities for the needs of more advanced users. KNIME analytics are useful for doing things like upsell and cross-sell, customer churn reduction, sentiment analysis, and social network analysis.