Computational Thinking in Data Science
Computational thinking is about mastering the logical reasoning and flow of the project, irrespective of programming language used
There are several platforms and programming languages for data science and machine learning project implementation (see Figure 1 below).
Even though Python and R are considered the top two programming languages for data science and machine learning, the fundamental skill in data science is mastering the logical reasoning and flow of the process, not the programming language used.
Computational thinking involves breaking a project down into smaller steps, and understanding how these steps are interrelated with one another. Thus computational thinking has nothing to do with the programming language that will finally be selected for project implementation. The premise is that once you understanding the logical reasoning and flow of the process, then you can implement your project using any programming language of your choice.
In this article, we consider the workflow for two data science projects that demonstrate the logical reasoning and flow of the process, irrespective of the programming language employed for implementation.
II. Data Visualization Workflow
A data visualization workflow (see Figure 2) typically consists of the following components. This workflow can be implemented using any programming language such as R, Python, Matlab, C++, etc.
Data Component: An important first step in deciding how to visualize data is to know what type of data it is, e.g. categorical data, discrete data, continuous data, time-series data, etc.
Geometric Component: Here is where you decide what kind of visualization is suitable for your data, e.g. scatter plot, line graphs, bar plots, histograms, Q-Q plots, smooth densities, boxplots, pair plots, heatmaps, etc.
Mapping Component: Here you need to decide what variable to use as your x-variable (independent or predictor variable) and what to use as your y-variable (dependent or target variable). This is important especially when your dataset is multi-dimensional with several features.
Scale Component: Here you decide what kind of scales to use, e.g. linear scale, log scale, etc.
Labels Component: This includes things like axes labels, titles, legends, font size to use, etc.
Ethical Component: Here, you want to make sure your visualization tells the true story. You need to be aware of your actions when cleaning, summarizing, manipulating, and producing a data visualization and ensure you aren’t using your visualization to mislead or manipulate your audience.
An example of a data visualization workflow with Python and R implementations can be found here:
III. Machine Learning Workflow
A machine learning workflow (see Figure 3) would consist of the following steps that are independent of the programming language used for implementation.
Problem Framing: This is where you decide what kind of problem are you trying to solve e.g. model to classify emails as spam or not spam, model to classify tumor cells as a malignant or benign, model to improve customer experience by routing calls into different categories so that calls can be answered by personnel with the right expertise, model to predict if a loan will charge off after the duration of the loan, model to predict the price of a house based on different features or predictors, and so on.
Data Analysis: This is where you handle the data available for building the model. It includes data visualization of features, handling missing data, handling categorical data, encoding class labels, normalization and standardization of features, feature engineering, dimensionality reduction, data partitioning into training, validation and testing sets, etc.
Model Building: This is where you select the model that you would like to use, e.g. linear regression, logistic regression, KNN, SVM, k-means, Monte Carlo simulation, time series analysis, etc. The data set has to be divided into training, validation, and test sets. Hyperparameter tuning is used to fine-tune the model in order to prevent overfitting. Cross-validation is performed to ensure the model performs well on the validation set. After fine-tuning model parameters, the model is applied to the test data set. The model’s performance on the test data set is approximately equal to what would be expected when the model is used for making predictions on unseen data.
Application: In this stage, the final machine learning model is put into production to start improving the customer experience or increasing productivity or deciding if a bank should approve credit to a borrower, etc. The model is evaluated in a production setting in order to assess its performance. This can be done by comparing the performance of the machine learning solution against a baseline or control solution using methods such as A/B testing. Any mistakes encountered when transforming from an experimental model to its actual performance on the production line has to be analyzed. This can then be used in fine-tuning the original model.
An example of a machine learning workflow with Python implementation can be found here:
IV. Summary and Conclusion
In summary, we’ve examined two case studies of data science projects, highlighting the logical steps and processes involved, irrespective of the programming language used for implementation. The fundamental skill in data science is mastering logical reasoning and flow of the process, not the programming language used.