# Python for Data Science For Dummies, 2nd Edition

#### Description

#### Table of Contents

#### Author Information

#### Downloads

**Introduction**** 1**

About This Book 1

Foolish Assumptions 3

Icons Used in This Book 4

Beyond the Book 4

Where to Go from Here 5

**Part 1: Getting Started With Data Science and Python**** 7**

**Chapter 1: Discovering the Match between Data Science and Python**** 9**

Defining the Sexiest Job of the 21st Century 11

Considering the emergence of data science 12

Outlining the core competencies of a data scientist 12

Linking data science, big data, and AI 13

Understanding the role of programming 14

Creating the Data Science Pipeline 14

Preparing the data 15

Performing exploratory data analysis 15

Learning from data 15

Visualizing 15

Obtaining insights and data products 16

Understanding Python’s Role in Data Science 16

Considering the shifting profile of data scientists 16

Working with a multipurpose, simple, and efficient language 17

Learning to Use Python Fast 18

Loading data 19

Training a model 19

Viewing a result 19

**Chapter 2: Introducing Python’s Capabilities and Wonders** 21

Why Python? 22

Grasping Python’s Core Philosophy 23

Contributing to data science 23

Discovering present and future development goals 24

Working with Python 25

Getting a taste of the language 25

Understanding the need for indentation 26

Working at the command line or in the IDE 27

Performing Rapid Prototyping and Experimentation 31

Considering Speed of Execution 32

Visualizing Power 33

Using the Python Ecosystem for Data Science 35

Accessing scientific tools using SciPy 35

Performing fundamental scientific computing using NumPy 36

Performing data analysis using pandas 36

Implementing machine learning using Scikit-learn 36

Going for deep learning with Keras and TensorFlow 37

Plotting the data using matplotlib 38

Creating graphs with NetworkX 38

Parsing HTML documents using Beautiful Soup 38

**Chapter 3: Setting Up Python for Data Science** 39

Considering the Off-the-Shelf Cross-Platform Scientific Distributions 40

Getting Continuum Analytics Anaconda 40

Getting Enthought Canopy Express 41

Getting WinPython 42

Installing Anaconda on Windows 42

Installing Anaconda on Linux 46

Installing Anaconda on Mac OS X 47

Downloading the Datasets and Example Code 48

Using Jupyter Notebook 49

Defining the code repository 50

Understanding the datasets used in this book 57

**Chapter 4: Working with Google Colab**** 59**

Defining Google Colab 60

Understanding what Google Colab does 60

Considering the online coding difference 61

Using local runtime support 63

Getting a Google Account 63

Creating the account 64

Signing in 64

Working with Notebooks 65

Creating a new notebook 65

Opening existing notebooks 66

Saving notebooks 68

Downloading notebooks 71

Performing Common Tasks 71

Creating code cells 71

Creating text cells 72

Creating special cells 73

Editing cells 74

Moving cells 75

Using Hardware Acceleration 75

Executing the Code 76

Viewing Your Notebook 76

Displaying the table of contents 77

Getting notebook information 77

Checking code execution 78

Sharing Your Notebook 79

Getting Help 80

**Part 2: Getting Your Hands Dirty With Data**** 81**

**Chapter 5: Understanding the Tools**** 83**

Using the Jupyter Console 84

Interacting with screen text 84

Changing the window appearance 86

Getting Python help 87

Getting IPython help 89

Using magic functions 90

Discovering objects 91

Using Jupyter Notebook 93

Working with styles 93

Restarting the kernel 94

Restoring a checkpoint 95

Performing Multimedia and Graphic Integration 96

Embedding plots and other images 96

Loading examples from online sites 96

Obtaining online graphics and multimedia 96

**Chapter 6: Working with Real Data**** 99**

Uploading, Streaming, and Sampling Data 100

Uploading small amounts of data into memory 101

Streaming large amounts of data into memory 102

Generating variations on image data 103

Sampling data in different ways 104

Accessing Data in Structured Flat-File Form 105

Reading from a text file 106

Reading CSV delimited format 107

Reading Excel and other Microsoft Office files 109

Sending Data in Unstructured File Form 111

Managing Data from Relational Databases 113

Interacting with Data from NoSQL Databases 115

Accessing Data from the Web 116

**Chapter 7: Conditioning Your Data ****121**

Juggling between NumPy and pandas 122

Knowing when to use NumPy 122

Knowing when to use pandas 122

Validating Your Data 124

Figuring out what’s in your data 124

Removing duplicates 126

Creating a data map and data plan 126

Manipulating Categorical Variables 129

Creating categorical variables 130

Renaming levels 131

Combining levels 132

Dealing with Dates in Your Data 133

Formatting date and time values 134

Using the right time transformation 135

Dealing with Missing Data 136

Finding the missing data 136

Encoding missingness 137

Imputing missing data 138

Slicing and Dicing: Filtering and Selecting Data 139

Slicing rows 140

Slicing columns 140

Dicing 141

Concatenating and Transforming 142

Adding new cases and variables 142

Removing data 144

Sorting and shuffling 145

Aggregating Data at Any Level 146

**Chapter 8: Shaping Data**** 149**

Working with HTML Pages 150

Parsing XML and HTML 150

Using XPath for data extraction 151

Working with Raw Text 153

Dealing with Unicode 153

Stemming and removing stop words 153

Introducing regular expressions 155

Using the Bag of Words Model and Beyond 158

Understanding the bag of words model 159

Working with n-grams 161

Implementing TF-IDF transformations 162

Working with Graph Data 165

Understanding the adjacency matrix 165

Using NetworkX basics 166

**Chapter 9: Putting What You Know in Action**** 169**

Contextualizing Problems and Data 170

Evaluating a data science problem 171

Researching solutions 173

Formulating a hypothesis 174

Preparing your data 175

Considering the Art of Feature Creation 175

Defining feature creation 175

Combining variables 176

Understanding binning and discretization 177

Using indicator variables 177

Transforming distributions 178

Performing Operations on Arrays 178

Using vectorization 179

Performing simple arithmetic on vectors and matrices 179

Performing matrix vector multiplication 180

Performing matrix multiplication 181

**Part 3: Visualizing Information**** 183**

**Chapter 10: Getting a Crash Course in MatPlotLib**** 185**

Starting with a Graph 186

Defining the plot 186

Drawing multiple lines and plots 187

Saving your work to disk 188

Setting the Axis, Ticks, Grids 189

Getting the axes 189

Formatting the axes 190

Adding grids 191

Defining the Line Appearance 192

Working with line styles 193

Using colors 194

Adding markers 195

Using Labels, Annotations, and Legends 197

Adding labels 198

Annotating the chart 198

Creating a legend 199

**Chapter 11: Visualizing the Data**** 201**

Choosing the Right Graph 202

Showing parts of a whole with pie charts 202

Creating comparisons with bar charts 203

Showing distributions using histograms 205

Depicting groups using boxplots 206

Seeing data patterns using scatterplots 208

Creating Advanced Scatterplots 209

Depicting groups 209

Showing correlations 211

Plotting Time Series 212

Representing time on axes 212

Plotting trends over time 214

Plotting Geographical Data 216

Using an environment in Notebook 217

Getting the Basemap toolkit 218

Dealing with deprecated library issues 218

Using Basemap to plot geographic data 220

Visualizing Graphs 221

Developing undirected graphs 222

Developing directed graphs 224

**Part 4: Wrangling Data**** 227**

**Chapter 12: Stretching Python’s Capabilities**** 229**

Playing with Scikit-learn 230

Understanding classes in Scikit-learn 230

Defining applications for data science 231

Performing the Hashing Trick 234

Using hash functions 235

Demonstrating the hashing trick 235

Working with deterministic selection 239

Considering Timing and Performance 240

Benchmarking with timeit 241

Working with the memory profiler 244

Running in Parallel on Multiple Cores 247

Performing multicore parallelism 248

Demonstrating multiprocessing 248

**Chapter 13: Exploring Data Analysis ****251**

The EDA Approach 252

Defining Descriptive Statistics for Numeric Data 253

Measuring central tendency 254

Measuring variance and range 255

Working with percentiles 256

Defining measures of normality 257

Counting for Categorical Data 259

Understanding frequencies 259

Creating contingency tables 261

Creating Applied Visualization for EDA 261

Inspecting boxplots 262

Performing t-tests after boxplots 263

Observing parallel coordinates 264

Graphing distributions 265

Plotting scatterplots 266

Understanding Correlation 268

Using covariance and correlation 268

Using nonparametric correlation 270

Considering the chi-square test for tables 271

Modifying Data Distributions 272

Using different statistical distributions 272

Creating a Z-score standardization 273

Transforming other notable distributions 273

**Chapter 14: Reducing Dimensionality**** 275**

Understanding SVD 276

Looking for dimensionality reduction 277

Using SVD to measure the invisible 279

Performing Factor Analysis and PCA 280

Considering the psychometric model 280

Looking for hidden factors 281

Using components, not factors 282

Achieving dimensionality reduction 282

Squeezing information with t-SNE 283

Understanding Some Applications 285

Recognizing faces with PCA 285

Extracting topics with NMF 289

Recommending movies 291

**Chapter 15: Clustering**** 295**

Clustering with K-means 297

Understanding centroid-based algorithms 298

Creating an example with image data 299

Looking for optimal solutions 301

Clustering big data 304

Performing Hierarchical Clustering 305

Using a hierarchical cluster solution 307

Using a two-phase clustering solution 308

Discovering New Groups with DBScan 310

**Chapter 16: Detecting Outliers in Data**** 313**

Considering Outlier Detection 314

Finding more things that can go wrong 315

Understanding anomalies and novel data 316

Examining a Simple Univariate Method 317

Leveraging on the Gaussian distribution 319

Making assumptions and checking out 320

Developing a Multivariate Approach 322

Using principal component analysis 322

Using cluster analysis for spotting outliers 324

Automating detection with Isolation Forests 325

**Part 5: Learning From Data**** 327**

**Chapter 17: Exploring Four Simple and Effective Algorithms ****329**

Guessing the Number: Linear Regression 329

Defining the family of linear models 330

Using more variables 331

Understanding limitations and problems 333

Moving to Logistic Regression 334

Applying logistic regression 335

Considering when classes are more 336

Making Things as Simple as Naïve Bayes 337

Finding out that Naïve Bayes isn’t so naïve 339

Predicting text classifications 340

Learning Lazily with Nearest Neighbors 342

Predicting after observing neighbors 343

Choosing your k parameter wisely 344

**Chapter 18: Performing Cross-Validation, Selection, and Optimization**** 347**

Pondering the Problem of Fitting a Model 348

Understanding bias and variance 349

Defining a strategy for picking models 350

Dividing between training and test sets 354

Cross-Validating 356

Using cross-validation on k folds 357

Sampling stratifications for complex data 358

Selecting Variables Like a Pro 360

Selecting by univariate measures 360

Using a greedy search 362

Pumping Up Your Hyperparameters 363

Implementing a grid search 364

Trying a randomized search 368

**Chapter 19: Increasing Complexity with Linear and Nonlinear Tricks**** 371**

Using Nonlinear Transformations 372

Doing variable transformations 372

Creating interactions between variables 375

Regularizing Linear Models 379

Relying on Ridge regression (L2) 380

Using the Lasso (L1) 381

Leveraging regularization 382

Combining L1 & L2: Elasticnet 382

Fighting with Big Data Chunk by Chunk 383

Determining when there is too much data 383

Implementing Stochastic Gradient Descent 383

Understanding Support Vector Machines 387

Relying on a computational method 387

Fixing many new parameters 390

Classifying with SVC 392

Going nonlinear is easy 398

Performing regression with SVR 399

Creating a stochastic solution with SVM 401

Playing with Neural Networks 406

Understanding neural networks 407

Classifying and regressing with neurons 408

**Chapter 20: Understanding the Power of the Many**** 411**

Starting with a Plain Decision Tree 412

Understanding a decision tree 412

Creating trees for different purposes 415

Making Machine Learning Accessible 418

Working with a Random Forest classifier 420

Working with a Random Forest regressor 421

Optimizing a Random Forest 422

Boosting Predictions 424

Knowing that many weak predictors win 424

Setting a gradient boosting classifier 425

Running a gradient boosting regressor 426

Using GBM hyperparameters 427

**Part 6: The Part of Tens**** 429**

**Chapter 21: Ten Essential Data Resources**** 431**

Discovering the News with Subreddit 432

Getting a Good Start with KDnuggets 432

Locating Free Learning Resources with Quora 432

Gaining Insights with Oracle’s Data Science Blog 433

Accessing the Huge List of Resources on Data Science Central 433

Learning New Tricks from the Aspirational Data Scientist 434

Obtaining the Most Authoritative Sources at Udacity 435

Receiving Help with Advanced Topics at Conductrics 435

Obtaining the Facts of Open Source Data Science from Masters 436

Zeroing In on Developer Resources with Jonathan Bower 436

**Chapter 22: Ten Data Challenges You Should Take**** 437**

Meeting the Data Science London + Scikit-learn Challenge 438

Predicting Survival on the Titanic 438

Finding a Kaggle Competition that Suits Your Needs 439

Honing Your Overfit Strategies 440

Trudging Through the MovieLens Dataset 440

Getting Rid of Spam E-mails 441

Working with Handwritten Information 442

Working with Pictures 443

Analyzing Amazon.com Reviews 444

Interacting with a Huge Graph 444

Index 447