Getting Started With Machine Learning!

An introduction to the ‘Data’ world and Machine Learning

Kavita Anant
Towards Data Science

--

In the recent years there has been a lot of buzz about the terms ‘Machine Learning’,’Data Analytics’, ‘Artificial Intelligence’, ‘Deep Learning’, ‘Big Data’ etc. as Data Science has been named as the ‘Sexiest Job of 21st century’. Most of these terms are seldom used interchangeably. However, each term, although related, is different with what they mean and each opens a plethora of opportunities with respect to the various fields they can be applied to.

Breaking Down the ‘Data’ buzzwords

Data buzzwords

The meaning of each term can be broken down with the help of the following example-

Consider a Robot specially designed to clean the floor. This robot is being used in a house which is next to a mill. Due to this mill, a lot of dust enters the house during the day. At night the mill is closed and hence dust is comparatively less. Also on weekends the mill is closed. Hence dust is less in the weekends too. Now the Robot must understand how, when and where the floor should be cleaned.

Robot for cleaning the floor

The Robot has to learn that the dust is more during the daytime of the weekdays and less during the night and weekends. This process of robot learning through a given set of inputs and outputs is Machine Learning. [7]

Now that the robot has learnt when the dust is more it has to make an appropriate decision of cleaning more during the day and less during the night and weekends. This is Artificial Intelligence.

Robot with the help of its sensors collects data about the area where dust gets accumulated more. The sensors captures data through the voice inputs of the members of the house (in the form of speech), data through photos of various parts of the house (in the form of images) etc. Processing these structured and unstructured data in order to derive meaningful information is Big Data. [6]

Curating meaningful insights obtained from the data captured by the Robot is Data analytics. An eye for certain details like different battery consumption patterns at different times of a day, corners of the house having more dust etc. forms the basis of Data Analytics which is important from business perspective.

Say, through Data Analytics we understand about different battery consumption cycles of the Robot. Modelling the business as per this information is important. This data pattern forms the basis for crucial decisions like introducing battery saver which triggers automatically in the night , which ultimately helps the business. This will help the Robot with a longer battery life. The Robot can be promoted in the market with this as a USP. Thus making business decisions based on the curated insights so as to get maximum benefit is called Data Science. [4]

Diving into machine learning-

What is Machine Learning?

Typically machine learning is defined as the technology which allows the machine (computer) to understand or act without having to hard code the system or program explicitly. [11]

Consider the following information given to a human to read-

In India it rains from June to September. Rainfall is heavy in the coastal areas where as it is moderate in the interiors. Most of the heavy rainfall is in the month of July. In the last 4 years (2013–2017) heavy rainfall has been observed in the first 2 weeks of July

Now if I ask you, as a human, to reply to the following question –

What is the probability that it will rain today, i.e. 4th July 2018 in the city of Mumbai?

The obvious answer is that the probability of rainfall is high. Now we want Machine to understand and learn this too. Having fed a certain set of past data to the machine (Input as well as output), machine should be able to ‘think’ what the output of a new input data can be.

Typically in a machine learning problem we have a set of training data. This training data has both the inputs as well as the corresponding outputs. Now, the machine will be fed with a machine learning algorithm to learn this data set based on which it forms a hypothesis. Based on this hypothesis, the machine should be able to predict the output for an unseen input.

Basic Block diagram of a machine learning algorithm [8]

Steps to solve a machine learning problem

Machine Learning is not a one step procedure. The following steps are generally followed to solve a Machine Learning Algorithm[14]. I have used the simple ‘Iris Data Set’ to explain the same. [1]

I have modified some values in the data set to explain the following steps. Check out this Iris Data set HERE

  1. Problem Definition-

Before learning from the past data set, it is important to first understand what we want to exactly learn. The problem can be a classification or a regression problem. The entire Machine learning modelling depends on the aim or the problem definition. In our data set we have the petal and sepal lengths and widths. We have to first understand what we are expected to solve for this data set. Do we have to build a hypothesis to predict the class of a new flower whose petal length, sepal length, sepal width and petal width is known, or do we have to only analyse whether or not classification is possible etc.

2. Frame

Next step is to understand what are our features or attribute (usually represented as X)and what is our target or output(usually represented by y) for a given training data set. Consider the Iris data set-

Features X=sepal length (slen), sepal width (swid), petal length (plen) and petal width (pwid)

Target y=class

Iris Data Set [1]

3. Import or Acquire the data set

We must understand how can we import/acquire the data into our program. The data can be structured or unstructured and can be a Comma Separated Values (.csv) file or an Excel (.xlsx) sheet etc. For a .csv file, data set can be imported using the following command-

dataSet=pandas.read_csv(‘DataFrame_name.csv’)

Acquiring Data

4. Clean and Refine

The data set may have values missing or garbage values due to which the data cannot be processed. Hence pre-processing of data is required. We must check for the following in an acquired data set-

  • Check for missing values- some values in the given data set are not available
  • Check for garbage values-some values in the given data set are logically incorrect (for eg. if we are talking about the sepal length and the value is ‘purple’)
  • Check for outliers (some values in the given data set are lying out of the range as compared to the other values. for eg. if the sepal length values are {1.5, 2.3, 3.2, 2.2,1.75,15.2, 3.2} then 15.2 is an outlier.)
Null values in the data set

Treatment for missing/garbage values:

  • Remove missing/garbage values
  • Convert interval values to categorical data
  • Use modelling techniques
  • Replace the data points(imputation)
Replacing the null values using statistical methods

Standardization- While refining the standardization can also be performed. For example if there is a list of city names then we can have ‘New York City’, ‘NY City’, ‘NYC’. Since all the three names are for the same city we can have a common representation for all the three names. Similarly in case of dates they may be written in DD/MM/YYYY format or MM/DD/YYYY or DD/MM/YY etc. As long as all the values are talking about the same date, they must be represented identically.

5. Exploring the data

On acquiring and processing the data it’s time to explore the attributes of the data set.This process is called as Exploratory Data Analysis (EDA). It is basically the process of summarizing the data set that we are dealing with. [2].

We must understand what kind of data are we dealing with as it is useful for further analysis. Data can be classified as follows-

Classification of data in a data set-

Numerical- Any value in a data set such as height of a person or number of items etc is a numerical value.

Categorical- Data points where a classification is possible into say class A and class B is called categorical data point.

a) Ordinal data- In the classification of data points if ranking of the classes is possible as class A is having a higher rank than class B then it is an ordinal data point.

b) Nominal- In the classification of data points if ranking of the classes is not possible or if all classes are considered to be of the same rank then it is called as nominal data point.

Balanced and Unbalanced data set-

A data set can be a balanced or an unbalanced data set. Suppose the data set can be classified into 2 classes- class 1 or class 2. If there are almost equal number of data points belonging to both class 1 as well as class 2 then the data set is called a Balanced data set.

The data sets that have the ratio of classes up to 70-30 can be considered as balanced data sets.

Example of a balanced data set-

If the data sets have the ratio of classes greater than 70–30 like say 90–10 then the data set is considered to be unbalanced data set

Example of an unbalanced data set-

While solving a machine learning algorithm it is extremely important to make sure that our data set is a balanced data set. Else, no matter which algorithm we use, we will never get the appropriate result.

Solutions for an unbalanced data set-

· Delete few samples of the class with greater number of data points

· Add few samples of the class with lesser number of data points by duplicating or using statistical methods

· Perform Machine Learning using batch process by taking repetitive samples of lesser number of data points

The Iris data set is balanced data set with all the 3 classes having almost the same number of data points-

Summarizing the data-

Data can be summarized using using statistical methods and visual methods.

Statistical Methods- It includes finding the mean, median, mode, standard deviation, covariance etc. of the data points. The statistical methods help us understanding the range, central value etc of the data set

Visual Methods- It includes plotting the histogram, box plots, scatter plots, Cumulative Density Function (CDF), Probability Density Function (PDF) etc. The visual methods clearly show the the way a data set is distributed, outliers, where a data is concentrated etc.

Plotting PDFs

6. Transform-

Once we are through with the data set we can transform the data set so that it can be used effectively. We can perform feature selection, create new features, drop redundant features etc.

Transformation of the data table includes-

A)Creating new features

Sometimes, new features may have to be created.This can include generating a new feature derived from insights curated by existing features or using statistical methods such as using mean or median of the existing features etc. This step is important when we have to reduce the number of features or columns in our data set. If we have two columns ‘a’ and ‘b’ whose data can be better represented by their mean ‘c’, then ‘a’ and ‘b’ can be dropped and the newly generated ‘c’ can be added to the data set, thus reducing the dimension of the data. New features can be created using the following techniques-

a) Feature tools- Automated Feature engineering tools like H2O, TPOT, Auto- sklearn etc can be used[13]

b) Manual creation- Through Grouping, Summarizing, Deriving etc. new features can be generated

B) Encoding (Categorical Data)

If we have a data set where the column names are in the form of strings then we have to encode these columns to convert them to numeric values for further calculations. This can be done using any of the following encoding techniques-

a) One hot encoding- Expands existing data

Example of One hot encoding

Since One-hot coding expands the table, it is not preferred when the number of features is large.

b) Label encoding

The following is an example of label encoding. I have used Sci-kit Learn’s LabelEncoder library for the same

Label Encoding

7. Modelling the Data-

This step includes selecting the right algorithm to train our data. An appropriate training algorithm must be selected based on whether the algorithm is a regression or classification problem, accuracy requirements etc.

Since ours is a classification problem, I have chosen the Decision Tree Algorithm.

While building the model, the data set must be split such that the data can be trained, validated and tested on the same data set.

Code snippet for splitting the data into training and testing data sets

One hold validation- In this method the data is split such that 70% of the data can be used to train the model, 20% can be used to validate the model and 10% can be used to test the model

One hold

K-fold cross validation- Cross Validation or K- fold validation is a popular validation method used in Machine Learning. In this method, training and validation are performed in iterations where number of iterations is defined by k. The entire data set is partitioned into complementary subsets with few folds as training, few validation and few test data where the test data is kept separate from the training and validation data. For every iteration different subset is used for validation. As a result each subset is used for training as well as validation. This reduces variability. [5]

Cross validation

8. Data validation

The data that has been trained on the training data set is tested against the validation data using validation techniques such as one hold validation and cross validation as mentioned in the previous step. The accuracy of the training and validation data set can be calculated. The training and testing accuracy should be high and closed to each other.

Accuracy when Max depth=5
Accuracy when Max depth=10

9. Hyper parameter tuning

On validating the data set, we can tune the hyper parameters in order to optimize the model. The model will be optimized till it is performing satisfactorily on both training as well as the validation data sets.

In our data set, we are getting a decent training (0.998) and validation accuracy (0.923)for max_depth=5. In case we do not get a good accuracy, we may change some hyper parameters such as the value of max_depth in Decision Tree or the value of k (number of folds or subsets). We are getting a perfect training set accuracy but a comparatively lower validation accuracy with max_depth of 10. So we can plot a graph to find out which max_depth will give the best accuracy and set the hyper parameter max_depth to the respective value. [10]

Training and testing accuracy for different max_depths

On observing the graph it can be concluded that the testing accuracy is high for max_depth=3 or 4. Hence we tune our hyper parameter max_depth accordingly.

10. Testing

Once the data is optimized it is ready for testing. The test data set is kept aside from the training and validation data set. It is not involved in the process of optimization. This is the final step to test whether the model is going to work for a general unseen data set.

Our model is giving an accuracy of 0.866 on the test data.

Test data output

Check out the complete code HERE

To summarize, following is the work flow of Machine Learning-

References-

[1] Iris data set-https://archive.ics.uci.edu/ml/datasets/iris

[2] www.appliedaicourse.com

[3] www.dummies.com

[4] https://youtu.be/w-8MTXT_N6A

[5] https://en.wikipedia.org/wiki/Cross-validation_(statistics)

[6] https://www.simplilearn.com/data-science-vs-big-data-vs-data-analytics- article

[7] https://www.machinecurve.com/index.php/2017/09/30/the-differences-between-artificial-intelligence-machine-learning-more/#now-what-is-a-data-scientist

[8] http://communicationpolice.blogspot.com/2016/06/machine-learning-supervisedunsupervised.html

[9]https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

[10]https://matplotlib.org/users/pyplot_tutorial.html

[11] https://www.coursera.org/lecture/machine-learning/what-is-machine-learning-Ujm7v

[12]https://sebastianraschka.com/Articles/2014_intro_supervised_learning.html

[13]https://www.featuretools.com/

[14]Applied Machine Learning Workshop organised during the three day event Data Hack Summit by Analytics Vidya. Date-24th Nov 2018, Speaker- Amit Kapoor

--

--

Graduate student at Columbia University|Pursuing MS in Electrical Engineering with a focus on Data Driven Analysis