Machine Learning with Python

Created: Introduction to Machine Learning with Python

Machine Learning with Python

Labs

The Labs for the course are located in the Labs folder are from CognitiveClass and are licensed under MIT

Intro to ML

Machine learning is a field of computer science that gives computers the ability to learn without being explicitly progammed

Some popular techniques are:

Regression for predicting continuous values
Classification for predicting a class/category
Clustering for finding structure of data and summarization
Associations for finding items/events that co-occur
Anomaly detection is used for finding abnormal/unusual cases
Sequence mining is for predicting next values
Dimension reduction for reducing the size of data
Recommendation systems

We have a few different buzzwords

AI
- Computer Vision
- Language processing
- Creativity
Machine learning
- Field of AI
- Experience based
- Classification
- Clustering
- Neural Networks
Deep Learning
- Specialized case of ML
- More automation than most ML

Python for Machine Learning

Python has many different libraries for machine learning such as

NumPy
SciPy
Matplotlib
Pandas
Scikit Learn

Supervised vs Unsupervised

Supervised learning involves us supervising a machine learning model. We do this by teaching the model with a labelled dataset

There are two types of supervised learning, namely Classification and Regression

Unsupervised learning is when the model works on its own to discover information about data using techniques such as Dimension Reduction, Density Estimation, Market Basket Analysis, and Clustering

Supervised
- Classification
- Regression
- More evaluation methods
- Controlled environment
Unsupervised
- Clustering
- Fewer evaluation methods
- Less controlled environment

Regression

Regression makes use of two different variables

Dependent - Predictors $X$
Independent - Target $Y$

With Regression our $X$ values need to be continuous, but the $Y$ values can be either continuous, discrete, or categorical

There are two types of regression:

Simple Regression
- Simple Linear Regression
- Simple Non-Linear Regression
- Single $X$
Multiple Regression
- Multiple Linear Regression
- Multiple Non-Linear Regression
- Multiple $X$

Regression is used when we have continuous data and is well suited to predicting continuous data

There are many regression algorithms such as

Ordinal regression
Poisson regression
Fast forest quartile regression
Linear, polynomial, lasso, stepwise, and ridge regression
Bayesian linear regression
Neural network regression
Decision forest regression
Boosted decision tree regression
K nearest neighbors (KNN)

Each of which are better suited to some circumstances than to others

Simple Linear Regression

In SLR we have two variables, one dependent, and one independent. The target variable ( $y$ ) can be either be continuous or categorical, but the predictor ( $x$ ) must be continuous

To get a better idea of whether SLR is appropriate we can simply do a plot of $x$ vs $y$ and find the line which will be the best fit for the data

The line is represented by the following equation

$y=\theta_0+\theta_1x_1$

The aim of SLE is to adjust the $\theta$ values to minimize the residual error in our data and find the best fit

$MSE=\frac{1}{n}\Sigma_{i=1}^n(y_i-\hat y_i)^2$

Estimating Parameters

We have two options to estimate our parameters, given an SLR problem

Estimate $\theta_0$ and $\theta_1$ using the following equations

$\theta_1 = \frac{\Sigma_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\Sigma_{i=1}^n(x_i-\bar x)^2}$

$\theta_0=\bar y-\theta_1\bar x$

We can use these values to make predictions with the equation

$\hat y=\theta_0+\theta_1x_1$

Pros

Fast
Easy to Understand
No tuning needed

Lab

Import Necessary Libraries

%reset -f
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

Import Data

df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv')
df.head()

	MODELYEAR	MAKE	MODEL	VEHICLECLASS	ENGINESIZE	CYLINDERS	TRANSMISSION	FUELTYPE	FUELCONSUMPTION_CITY	FUELCONSUMPTION_HWY	FUELCONSUMPTION_COMB	FUELCONSUMPTION_COMB_MPG	CO2EMISSIONS
0	2014	ACURA	ILX	COMPACT	2.0	4	AS5	Z	9.9	6.7	8.5	33	196
1	2014	ACURA	ILX	COMPACT	2.4	4	M6	Z	11.2	7.7	9.6	29	221
2	2014	ACURA	ILX HYBRID	COMPACT	1.5	4	AV7	Z	6.0	5.8	5.9	48	136
3	2014	ACURA	MDX 4WD	SUV - SMALL	3.5	6	AS6	Z	12.7	9.1	11.1	25	255
4	2014	ACURA	RDX AWD	SUV - SMALL	3.5	6	AS6	Z	12.1	8.7	10.6	27	244

Data Exploration

df.describe()

	MODELYEAR	ENGINESIZE	CYLINDERS	FUELCONSUMPTION_CITY	FUELCONSUMPTION_HWY	FUELCONSUMPTION_COMB	FUELCONSUMPTION_COMB_MPG	CO2EMISSIONS
count	1067.0	1067.000000	1067.000000	1067.000000	1067.000000	1067.000000	1067.000000	1067.000000
mean	2014.0	3.346298	5.794752	13.296532	9.474602	11.580881	26.441425	256.228679
std	0.0	1.415895	1.797447	4.101253	2.794510	3.485595	7.468702	63.372304
min	2014.0	1.000000	3.000000	4.600000	4.900000	4.700000	11.000000	108.000000
25%	2014.0	2.000000	4.000000	10.250000	7.500000	9.000000	21.000000	207.000000
50%	2014.0	3.400000	6.000000	12.600000	8.800000	10.900000	26.000000	251.000000
75%	2014.0	4.300000	8.000000	15.550000	10.850000	13.350000	31.000000	294.000000
max	2014.0	8.400000	12.000000	30.200000	20.500000	25.800000	60.000000	488.000000

cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
cdf.head(10)

	ENGINESIZE	CYLINDERS	FUELCONSUMPTION_COMB	CO2EMISSIONS
0	2.0	4	8.5	196
1	2.4	4	9.6	221
2	1.5	4	5.9	136
3	3.5	6	11.1	255
4	3.5	6	10.6	244
5	3.5	6	10.0	230
6	3.5	6	10.1	232
7	3.7	6	11.1	255
8	3.7	6	11.6	267
9	2.4	4	9.2	212

viz = cdf[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']]
viz.hist()
plt.show()

<matplotlib.figure.Figure at 0x7f2ab0310c50>

plt.title('CO2 Emission vs Fuel Consumption')
plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS,  color='blue')
plt.xlabel("FUELCONSUMPTION_COMB")
plt.ylabel("Emission")
plt.show()

<matplotlib.figure.Figure at 0x7f2aa8826080>

plt.title('CO2 Emission vs Engine Size')
plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS,  color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()

<matplotlib.figure.Figure at 0x7f2aa8769cc0>

plt.title('CO2 Emission vs Cylinders')
plt.scatter(cdf.CYLINDERS, cdf.CO2EMISSIONS, color='blue')
plt.xlabel("Cylinders")
plt.ylabel("Emission")
plt.show()

<matplotlib.figure.Figure at 0x7f2aa86b2748>

Test-Train Split

We need to split our data into a test set and a train set

tt_mask = np.random.rand(len(df)) < 0.8
train = cdf[tt_mask].reset_index()
test = cdf[~tt_mask].reset_index()

Simple Regression Model

We can look at the distribution of the Engine Size in our training and test set respectively as follows

plt.title('CO2 Emissions vs Engine Size for Test and Train Data')
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,color='blue',label='train')
plt.scatter(test.ENGINESIZE, test.CO2EMISSIONS,color='red',label='test')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.legend()
plt.show()

<matplotlib.figure.Figure at 0x7f2aa86d9eb8>

Modeling

from sklearn import linear_model

lin_reg = linear_model.LinearRegression()
train_x = train[['ENGINESIZE']]
train_y = train[['CO2EMISSIONS']]

test_x = test[['ENGINESIZE']]
test_y = test[['CO2EMISSIONS']]

lin_reg.fit(train_x, train_y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

'Coefficients: ' + str(lin_reg.coef_) + ' Intercept: ' + str(lin_reg.intercept_)

'Coefficients: [[ 39.30964622]] Intercept: [ 124.8710344]'

We can plot the line on our data to see the fit

plt.title('CO2 Emissions vs Engine Size, Training and Fit')
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,color='blue',label='train')
plt.plot(train_x, lin_reg.coef_[0,0]*train_x + lin_reg.intercept_[0],color='red',label='regression')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.legend()
plt.show()

<matplotlib.figure.Figure at 0x7f2a9dbdef60>

Model Evaluation

Import Packages

from sklearn.metrics import r2_score

Predict the CO2 Emissions

predicted_y = lin_reg.predict(test_x)

Display Results

results = pd.DataFrame()

results[['ENGINESIZE']] = test_x
results[['ACTUALCO2']] = test_y
results[['PREDICTEDCO2']] = pd.DataFrame(predicted_y)
results[['ERROR']] = pd.DataFrame(np.abs(predicted_y - test_y))
results[['SQUAREDERROR']] = pd.DataFrame((predicted_y - test_y)**2)

results.head()

	ENGINESIZE	ACTUALCO2	PREDICTEDCO2	ERROR	SQUAREDERROR
0	5.9	359	356.797947	2.202053	4.849037
1	2.0	230	203.490327	26.509673	702.762771
2	2.0	230	203.490327	26.509673	702.762771
3	2.0	214	203.490327	10.509673	110.453230
4	5.2	409	329.281195	79.718805	6355.087912

Model Evaluation

MAE = np.mean(results[['ERROR']])
MSE = np.mean(results[['SQUAREDERROR']])
R2  = r2_score(test_y, predicted_y)

print("Mean absolute error: %.2f" % MAE)
print("Residual sum of squares (MSE): %.2f" % MSE)
print("R2-score: %.2f" % R2)

Multiple Linear Regression

In reality multiple independent variables will define a specific target. MLR is simply an extension on the SLR Model

MLR is useful for solving problems such as

Define the impact of independent variables on effectiveness of prediction
Predicting the impact of change in a specific variable

MLR makes use of multiple predictors to predict the target value, and is generally of the form

$\hat y=\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + ... + \theta_nx_n$

$\hat y=\theta^TX$

$\theta$ is a vector of coefficients which are multiplied by $x$ , these are called the parameters or weight vectors, and $x$ is the feature set, the idea with MLR is to predict the best-fit hyperplane for our data

Estimating Parameters

We have a few ways to estimate the best parameters, such as

Ordinary Least Squares
- Linear algebra
- Not suited to large datasets
Gradient Descent
- Good for large datasets
Other methods are available to do this as well

How Many Variables?

Making use of more variables will generally increase the accuracy of the model, howevre using too many variables without good justification can lead to us overfitting the model

We can make use of categorical variables if we convert them to numerric values

MLR assumes that we have a linear relationship between the dependent and independent variables

Model Evaluation

We have to perform regression evaluation when building a model

Train/Test Joint

We make use of our data to train our model, and then compare the predicted values to the actual values of our model

The error of the model is the average of the actual and predicted values for the model

This approach has a high training accuracy, but a lower out-of-sample accuracy

Aiming for a very high training accuracy can lead to overfitting to the training data resulting in poor out-of-sample data

Train/Test Split

We split our data into a portion for testing and a portion for training, these two sets are mutually exclusive and allow us to get a good idea of what our out-of-sample accuracy will be

Generally we would train our data with the testing data afterwards in order to increase our accuracy

K-Fold Cross-Validation

This makes use of us splitting the dataset into different pieces, and using every combination of test/train datasets in order to get a more aggregated fit

Evaluation Metrics

Evaluation metrics are used to evaluate the performance of a model, metrics provide insight into areas of the model that require attention

Errors

In the context of regression, error is the difference between the data points and the valuedetermined by the model

Some of the main error equations are defined below

$MAE=\frac{1}{n}\Sigma_{i=1}^n|y_i-\hat y_i|$

$MSE=\frac{1}{n}\Sigma_{i=1}^n(y_i-\hat y_i)^2$

$RMSE=\sqrt{\frac{1}{n}\Sigma_{i=1}^n(y_i-\hat y_i)^2}$

$RAE=\frac{\Sigma_{i=1}^n|y_i-\hat y_i|}{\Sigma_{i=1}^n|y_i-\bar y_i|}$

$RSE=\frac{\Sigma_{i=1}^n(y_i-\hat y_i)^2}{\Sigma_{i=1}^n(y_i-\bar y_i)^2}$

Fit

$R^2$ helps us see how closely our data is represented by a specific regression line, and is defined as

$R^2=1-RSE$

$R^2=1-\frac{\Sigma_{i=1}^n(y_i-\hat y_i)^2}{\Sigma_{i=1}^n(y_i-\bar y_i)^2}$

A higher $R^2$ represents a better fit

Non-Linear Regression

Not all data can be predicted using a linear regression line, we have many diferent regression lines to fit more complex data

Polynomial Regression

Polynomial Regression is a method with which we can fit a polynomial to our data, it is still possible for us to define a polynomial regression by transforming it into a multi-variable linear regression problem as follows

Given the polynomial

$\hat y=\theta_0+\theta_1x+\theta_2x^2+\theta_3x^3$

We can create new variables which represent the different powers of our initial variable

$x_1=x$

$x_2=x^2$

$x_3=x^3$

Therefore resulting in the following linear equation

$\hat y=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3$

Other Non-Linear Regression

Non-Linear Regression can be of many forms as well, including any other mathematical relationships that we can define

For more complex NLR problems it can be difficult to evaluate the parameters for the equation

Lab

There are many different model types and equations shown in the Lab Notebook aside from what I have here

Import the Data

Using China's GDP data

df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv')
df.head()

	Year	Value
0	1960	5.918412e+10
1	1961	4.955705e+10
2	1962	4.668518e+10
3	1963	5.009730e+10
4	1964	5.906225e+10

x_data, y_data = (df[['Year']], df[['Value']])

Plotting the Data

plt.title('China\'s GDP by Year')
plt.plot(x_data, y_data, 'o')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()

<matplotlib.figure.Figure at 0x7f2a9dbdeac8>

Defining a Fit

Next we can try to approximate a curve that we think will fit the data we have, we can use a sigmoid, as defined below

$\hat{Y} = \frac1{1+e^{\beta_1(X-\beta_2)}}$

$\beta_1$ : Controls the curve's steepness,

$\beta_2$ : Slides the curve on the x-axis.

def sigmoid(x, b_1, b_2):
     y = 1 / (1 + np.exp(-b_1*(x-b_2)))
     return y

The above function can be seen to be

X = np.arange(-5.0, 5.0, 0.1)
Y = sigmoid(X, 1, 1)

plt.title('Sigmoid')
plt.plot(X,Y) 
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()

<matplotlib.figure.Figure at 0x7f2a8405f390>

Next let's try to fit this to the data with some example values

b_1 = 0.10
b_2 = 1990.0

#logistic function
y_pred = sigmoid(x_data, b_1 , b_2)

#plot initial prediction against datapoints
plt.title('Approximating NLR with Sigmoid')
plt.plot(x_data, y_pred*15000000000000.)
plt.plot(x_data, y_data, 'ro')
plt.show()

<matplotlib.figure.Figure at 0x7f2a84010be0>

Data Normalization

Let's normalize our data so that we don't need to multiply by crazy numbers as before

# for some reason this seems to be the only way the conversion
# from a dataframe works as desired
# the normalization from the labs are as such:
# xdata =x_data/max(x_data)
# ydata =y_data/max(y_data)
x_norm = (np.array(x_data)/max(np.array(x_data))).transpose()[0]
y_norm = (np.array(y_data)/max(np.array(y_data))).transpose()[0]

Finding the Best Fit

Next we can import curve_fit to help us fit the the curve to our data

from scipy.optimize import curve_fit

popt, pcov = curve_fit(sigmoid, x_norm, y_norm)
print(" beta_1 = %f, beta_2 = %f" % (popt[0], popt[1]))
print(popt)
print(pcov)

And we can plot the result as follows

x = np.linspace(1960, 2015, 55)
x = x/max(x)
y = sigmoid(x, *popt)

plt.title('Sigmoid Fit of Data')
plt.plot(x_norm, y_norm, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend()
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()

<matplotlib.figure.Figure at 0x7f2a443bb780>

Model Accuracy

from sklearn.metrics import r2_score

# split data into train/test
mask = np.random.rand(len(df)) < 0.8
train_x = x_norm[mask]
test_x = x_norm[~mask]
train_y = y_norm[mask]
test_y = y_norm[~mask]

# build the model using train set
popt, pcov = curve_fit(sigmoid, train_x, train_y)

# predict using test set
y_hat = sigmoid(test_x, *popt)

# evaluation
print("Mean absolute error: %.2f" % np.mean(np.absolute(y_hat - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((y_hat - test_y) ** 2))
print("R2-score: %.2f" % r2_score(y_hat , test_y) )

Classification

Classification is a supervised learning approach which is a means of splitting data into discrete classes

The target atribute is a categorical value with discrete values

Classification will determine the class label for a specific test case

Binary as well as multi-class classification methods are available

Learning Algorithms

Many learning algorithms are available for classification such as

Decision trees
Naive Bayes
KNN
Logistic Regression
Neural Networks
SVM

Evaluation Metrics

We have a few different evaluation metrics for classification

Jaccard Index

We simply measure which fraction of our predicted values $\hat y$ intersect with the actual values $y$

$J(y,\hat y)=\frac{|y\cap\hat y|}{|y\cup\hat y|}=\frac{|y\cap\hat y|}{|y|+|\hat y|-|y\cap\hat y|}$

F1 Score

This is a measure which makes use of a confusion matrix and compares the predictions vs actual values for each class

In the count of binary classification this will give us our True Positives, False Positives, True Negatives and False Negatives

We can define some metrics for each class with the following

$Precision=\frac{TP}{TP+FP}$

$Recall=\frac{TP}{TP+FN}$

$F1=2\frac{Precision\times Rcall}{Precision+Recall}$

F1 varies between 0 and 1, with 1 being the best

The accuracy for a classifier is the average accuracy of each of its classes

Log Loss

The log loss is the performance of a classifier where the predicted output is a probability between 1 and 0

$y\cdot log(\hat y)+(1-y)\cdot log(1-\hat y)$

Better classifiers have a log loss closer to zero

K-Nearest Neighbor

KNN is a method of determining class based on the training datapoints that sit near our test datapoint based on the fact that closer datapoints are more important than those further away in predicting a specific value

Algorithm

Pick a value for K
Calculate distance of unknown case from known cases
Select k observations
Predict the value based on the most common observaton value

We can make use of euclidean distance to calculate the distance between our continuous values, and a voting system for discrete data

Using a low K value can lead to overfitting, and using a very high value can lead to us underfitting

In order to find the optimal K value we do multiple tests by continuously increasing our K value and measuring the accuracy for that K value

Furthermore KNN can also be used to predict continuous values (regression) by simply having a target variable and predictors that are continuous

Lab

Import Libraries

import itertools
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing

Import Data

The dataset being used is one in which demographic data is used to define a customer service group, these being as follows

Value	Category
1	Basic Service
2	E-Service
3	Plus Service
4	Total Service

df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/teleCust1000t.csv')

df.head()

	region	tenure	age	marital	address	income	ed	employ	gender	reside	custcat
0	2	13	44	1	9	64.0	4	5	0	2	1
1	3	11	33	1	7	136.0	5	5	0	6	4
2	3	68	52	1	24	116.0	1	29	1	2	3
3	2	33	33	0	12	33.0	2	0	1	1	1
4	2	23	30	1	9	30.0	1	2	0	4	3

Data Visualization and Analysis

We can look at the number of customers in each class

df.custcat.value_counts()

3    281
1    266
4    236
2    217
Name: custcat, dtype: int64

df.hist()
plt.show()

<matplotlib.figure.Figure at 0x7f2a4435ef98>

We can take a closer look at income with

df.income.hist(bins=50)
plt.title('Income of Customers')
plt.xlabel('Frequency')
plt.ylabel('Income')
plt.show()

<matplotlib.figure.Figure at 0x7f2a441bc908>

Features

To use sklearn we need to convert our data into an array as follows

df.columns

Index(['region', 'tenure', 'age', 'marital', 'address', 'income', 'ed',
       'employ', 'retire', 'gender', 'reside', 'custcat'],
      dtype='object')

# X = df.loc[:, 'region':'reside'].values
# Y = df.loc[:,'custcat'].values
X = df.loc[:, 'region':'reside']
Y = df.loc[:,'custcat']

X.head()

	region	tenure	age	marital	address	income	ed	employ	gender	reside
0	2	13	44	1	9	64.0	4	5	0	2
1	3	11	33	1	7	136.0	5	5	0	6
2	3	68	52	1	24	116.0	1	29	1	2
3	2	33	33	0	12	33.0	2	0	1	1
4	2	23	30	1	9	30.0	1	2	0	4

Y.head()

0    1
1    4
2    3
3    1
4    3
Name: custcat, dtype: int64

Normalize Data

For alogrithms like KNN which are distance based it is useful to normalize the data to have a zero mean and unit variance, we can do this using the sklearn.preprocessing package

X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
print(X[0:5])

Test/Train Split

Next we can split our model into a test and train set using sklearn.model_selection.train_test_split()

from sklearn.model_selection import train_test_split

ran = 4
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=ran)

print('Train: ', X_train.shape, Y_train.shape)
print('Test: ', X_test.shape, Y_test.shape)

Classification

We can then make use of the KNN classifier on our data

from sklearn.neighbors import KNeighborsClassifier as knn_classifier

We will use an intial value of 4 for k, but will later evaluate different k values

k = 4

knn = knn_classifier(n_neighbors=k)
knn.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
           weights='uniform')

Y_hat = knn.predict(X_test)
print(Y_hat[0:5])

Model Evaluation

from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(Y_train, knn.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(Y_test, Y_hat))

Other K Values

We can do this for additional K values to look at how the accuracy is affected

k_max = 100
mean_acc = np.zeros((k_max))
std_acc = np.zeros((k_max))
ConfustionMx = [];
for n in range(1,k_max + 1):
    
    #Train Model and Predict  
    knn = knn_classifier(n_neighbors = n).fit(X_train,Y_train)
    Y_hat = knn.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(Y_test, Y_hat)

    std_acc[n-1] = np.std(Y_hat == Y_test)/np.sqrt(Y_hat.shape[0])
    
print(mean_acc)

plt.title('Accuracy vs K')
plt.plot(range(1,k_max + 1),mean_acc,'g')
plt.fill_between(range(1,k_max + 1),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()

<matplotlib.figure.Figure at 0x7f2a44198438>

The maximum accuracy can be found to be

print('Max Accuracy: {}, K={}'.format(max(mean_acc),mean_acc.argmax() + 1))

Test Sample

It can be noted that the accuracy and optimal value varies based on the random_state parameter in the train_test_split function used when doing the test/train split

Retrain with All Data

We can retrain the model to use all the data at the determined optimal value and look at the in-sample accuracy

k = mean_acc.argmax()
knn = knn_classifier(n_neighbors=k)
knn.fit(X, Y)
print("In-Sample Accuracy: ", metrics.accuracy_score(Y, knn.predict(X)))

Decision Trees

Decision Trees allow us to make use of discrete and continuous predictors to find a discrete target

Decision trees test a condition and branch off based on the result, eventually leading to a specific outcome/decision

Algorithm

Choose a dataset
Calculate the significance of an attribute in splitting the data
Split the data based on the value of the attribute
Go to 1

We aim to have resulting nodes that are high in purity. A higher purity increases predictiveness/significance

Recursive partitining is used to decrease the impurity/entropy in the resulting nodes

Entropy is a measurement of randomness

If samples are equally mixed, the entropy is 1, if the samples are pure, the entropy is 1

$Entropy(v)=P(v)-log(P(v))$

The best tree is the one that results in the most information gain after the split

$Gain(S,A) = Entropy(S)-\Sigma_v\frac{|S_v|}{|S|}Entropy(S_v)$

Lab

Import Libraries

import numpy as np 
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

Import Data

df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv')

print(df.shape)
df.head()

	Age	Sex	BP	Cholesterol	Na_to_K	Drug
0	23	F	HIGH	HIGH	25.355	drugY
1	47	M	LOW	HIGH	13.093	drugC
2	47	M	LOW	HIGH	10.114	drugC
3	28	F	NORMAL	HIGH	7.798	drugX
4	61	F	LOW	HIGH	18.043	drugY

Split X and Y Values

X_headers = ['Age','Sex','BP','Cholesterol','Na_to_K']
X = df[X_headers]
X.head()

	Age	Sex	BP	Cholesterol	Na_to_K
0	23	F	HIGH	HIGH	25.355
1	47	M	LOW	HIGH	13.093
2	47	M	LOW	HIGH	10.114
3	28	F	NORMAL	HIGH	7.798
4	61	F	LOW	HIGH	18.043

Y = df[['Drug']]
Y.head()

	Drug
0	drugY
1	drugC
2	drugC
3	drugX
4	drugY

Create Numeric Variables

We need to get numeric variables for X as sklearn does not support string categorization (according to the guy in the course anyway)

from sklearn import preprocessing

X_arr = np.array(X)

encoder = preprocessing.LabelEncoder()
encoder.fit(['F','M'])
X_arr[:,1] = encoder.transform(X_arr[:,1])

encoder.fit(['LOW','NORMAL','HIGH'])
X_arr[:,2] = encoder.transform(X_arr[:,2])

encoder.fit(['NORMAL','HIGH'])
X_arr[:,3] = encoder.transform(X_arr[:,3])

print(X_arr[0:5])

X_encoded = pd.DataFrame(data=X_arr, columns=X_headers)
X_encoded.head()

	Age	Sex	BP	Na_to_K
0	23	0	0	25.355
1	47	1	1	13.093
2	47	1	1	10.114
3	28	0	2	7.798
4	61	0	1	18.043

Train/Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_encoded, Y, test_size = 0.3)
print('Training: X : {}, Y : {}'.format(X_train.shape,Y_train.shape))
print('Testing: X : {}, Y : {}'.format(X_test.shape,Y_test.shape))

Decision Tree

drug_tree = DecisionTreeClassifier(criterion='entropy', max_depth = 4)
drug_tree

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

drug_tree.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Prediction

Y_predicted = drug_tree.predict(X_test)

print(Y_predicted[0:5])
print(Y_test[0:5])

Evaluation

from sklearn import metrics

print('Decision Tree Accuracy: ', metrics.accuracy_score(Y_test, Y_predicted))

Visalization

!pip install pydotplus
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree

dot_data = StringIO()
filename = 'drug_decision_tree.png'
feature_names = X_headers
target_names = df['Drug'].unique().tolist()

out = tree.export_graphviz(drug_tree, 
                           feature_names=feature_names, 
                           out_file=dot_data, 
                           class_names=target_names, 
                           filled=True, 
                           special_characters=True, 
                           rotate=False)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(filename)

True

img = mpimg.imread(filename)
plt.figure(figsize=(100, 100))
plt.imshow(img, interpolation='nearest')
plt.show()

<matplotlib.figure.Figure at 0x7f2a840262b0>

Logistic Regression

Logistic regression is a categorical classification algorithm based on a linear division between categorical values

Logistic regression can be used for binary and multi class classification and predicts the probability of a class which is then mapped to a discrete value

Logistic regression is best suited to

Binary Classification
If you need probabilistic results
Linear decision boundry

$\theta_0+\theta_1x_1+\theta_2x_2>0$

If you need to understand the impact of a feature

A logistic regression can calculate

$\hat y=P(y=1|x)$

Logistic vs Linear Regression

We can use linear regression with a dividing line to give whether or not a specific circumstance will lead to a specific output, where we define a threshold value which would define a boundry for the target class

The problem with this method is that we only have a specific binary outcome, and not any information as to what the probability of that outcome is. Logistic regression helps us to define this by making use of a sigmoid to smoothen out the classification boundry, the sigmoid function can be seen below

$\sigma(\theta^TX)=\frac{1}{1+e^{-\theta^Tx}}$

import numpy as np
from math import exp
import matplotlib.pyplot as plt

x = np.array(range(-100,102,2))/10
sigmoid = 1/(1+np.exp(-1*x))
step = []
for i in range(len(x)):
    step.append(1 if x[i] >= 0 else 0)

plt.plot(x,step, label='Step' )
plt.plot(x,sigmoid, label='Sigmoid')
# plt.xlim(-10,10)
plt.ylim(-0.1,1.1)
plt.xlabel('$x$')
plt.ylabel('$\sigma(x)$')
plt.legend()
plt.show()

<matplotlib.figure.Figure at 0x7f2a441f6278>

Based on the above we can see that depending on the value of $x$ we will have a greater tendency of a value towards 0 or 1 but not explicitly either

Algorithm

Initialize $\theta$
Calculate $\hat y=\sigma(\theta^TX)$ for an $X$
Compare $Y$ and $\hat Y$ and record the error, defined by a cost function $J(\theta)$
Change $\theta$ to reduce the cost
Go to 2

We can use different ways to change $\theta$ such as gradient descent

Lab

Import Libraries

import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
import matplotlib.pyplot as plt

Import Data

df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv')

df.head()

	tenure	age	address	income	ed	employ	equip	callcard	wireless	longmon	...	pager	internet	callwait	confer	ebill	loglong	logtoll	lninc	custcat	churn
0	11.0	33.0	7.0	136.0	5.0	5.0	0.0	1.0	1.0	4.40	...	1.0	0.0	1.0	1.0	0.0	1.482	3.033	4.913	4.0	1.0
1	33.0	33.0	12.0	33.0	2.0	0.0	0.0	0.0	0.0	9.45	...	0.0	0.0	0.0	0.0	0.0	2.246	3.240	3.497	1.0	1.0
2	23.0	30.0	9.0	30.0	1.0	2.0	0.0	0.0	0.0	6.30	...	0.0	0.0	0.0	1.0	0.0	1.841	3.240	3.401	3.0	0.0
3	38.0	35.0	5.0	76.0	2.0	10.0	1.0	1.0	1.0	6.05	...	1.0	1.0	1.0	1.0	1.0	1.800	3.807	4.331	4.0	0.0
4	7.0	35.0	14.0	80.0	2.0	15.0	0.0	1.0	0.0	7.10	...	0.0	0.0	1.0	1.0	0.0	1.960	3.091	4.382	3.0	0.0

5 rows × 28 columns

Preprocessing

df = df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip',   'callcard', 'wireless','churn']]
df[['churn']] = df[['churn']].astype('int')
df.head()

	tenure	age	address	income	ed	employ	equip	callcard	wireless	churn
0	11.0	33.0	7.0	136.0	5.0	5.0	0.0	1.0	1.0	1
1	33.0	33.0	12.0	33.0	2.0	0.0	0.0	0.0	0.0	1
2	23.0	30.0	9.0	30.0	1.0	2.0	0.0	0.0	0.0	0
3	38.0	35.0	5.0	76.0	2.0	10.0	1.0	1.0	1.0	0
4	7.0	35.0	14.0	80.0	2.0	15.0	0.0	1.0	0.0	0

Define X and Y

X = np.asarray(df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
X[0:5]

array([[  11.,   33.,    7.,  136.,    5.,    5.,    0.],
       [  33.,   33.,   12.,   33.,    2.,    0.,    0.],
       [  23.,   30.,    9.,   30.,    1.,    2.,    0.],
       [  38.,   35.,    5.,   76.,    2.,   10.,    1.],
       [   7.,   35.,   14.,   80.,    2.,   15.,    0.]])

Y = np.asarray(df['churn'])
Y[0:5]

array([1, 1, 0, 0, 0])

Normalize Data

from sklearn import preprocessing

X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-1.13518441, -0.62595491, -0.4588971 ,  0.4751423 ,  1.6961288 ,
        -0.58477841, -0.85972695],
       [-0.11604313, -0.62595491,  0.03454064, -0.32886061, -0.6433592 ,
        -1.14437497, -0.85972695],
       [-0.57928917, -0.85594447, -0.261522  , -0.35227817, -1.42318853,
        -0.92053635, -0.85972695],
       [ 0.11557989, -0.47262854, -0.65627219,  0.00679109, -0.6433592 ,
        -0.02518185,  1.16316   ],
       [-1.32048283, -0.47262854,  0.23191574,  0.03801451, -0.6433592 ,
         0.53441472, -0.85972695]])

Train/Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=4)

print('Train: ', X_train.shape, Y_train.shape)
print('Test: ', X_test.shape, Y_test.shape)

Modelling

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

lr = LogisticRegression(C=0.01, solver='liblinear').fit(X_train, Y_train)
lr

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Predict

Y_hat = lr.predict(X_test)

Y_hat_prob = lr.predict_proba(X_test)
Y_hat_prob[0:5]

array([[ 0.54132919,  0.45867081],
       [ 0.60593357,  0.39406643],
       [ 0.56277713,  0.43722287],
       [ 0.63432489,  0.36567511],
       [ 0.56431839,  0.43568161]])

Evaluation

Jaccard Index

from sklearn.metrics import jaccard_similarity_score

jaccard_similarity_score(Y_test, Y_hat)

0.75

from sklearn.metrics import classification_report, confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

print(confusion_matrix(Y_test, Y_hat, labels=[1,0]))

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test, Y_hat, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False,  title='Confusion matrix')

<matplotlib.figure.Figure at 0x7f2a44129e48>

print(classification_report(Y_test, Y_hat))

from sklearn.metrics import log_loss

log_loss(Y_hat, Y_hat_prob)

0.54903192026736869

Using Different Model Parameters

lr2 = LogisticRegression(C=10, solver='sag').fit(X_train,Y_train)
Y_hat_prob2 = lr2.predict_proba(X_test)
print ("LogLoss: : %.2f" % log_loss(Y_test, Y_hat_prob2))

Support Vector Machine

SVM is a supervised algorithm that classifies data by finding a separator

Map data to higher-dimensional Feature Space
Find a separating hyperplane in higher dimensional space

Data Transformation

Mapping data into a higher space is known as kernelling and can be of different functions such as

Linear
Polynomial
RBF
Sigmoid

The best hyperplane is the one that results in the largest margin possible between the hyperplane and our closest sample, the samples closest to our hyperlane are known as support vectors

Advantages and Disadvantages

Advantages
- Accurate in high dimensional spaces
- Memory efficient
Disadvantages
- Prone to overfitting
- No probability estimation
- Not suited to very large datasets

Applications

Image Recognition
Text mining/categorization
- Spam detection
- Sentiment analysis
Regression
Outlier detection
Clustering

Lab

Import Packages

import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Import Data

The data is from the UCI Machine Learning Archive, the fields are as follows

Field name	Description
ID	Clump thickness
Clump	Clump thickness
UnifSize	Uniformity of cell size
UnifShape	Uniformity of cell shape
MargAdh	Marginal adhesion
SingEpiSize	Single epithelial cell size
BareNuc	Bare nuclei
BlandChrom	Bland chromatin
NormNucl	Normal nucleoli
Mit	Mitoses
Class	Benign or malignant

df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv')

df.head()

	ID	Clump	UnifSize	UnifShape	MargAdh	SingEpiSize	BareNuc	BlandChrom	NormNucl	Mit	Class
0	1000025	5	1	1	1	2	1	3	1	1	2
1	1002945	5	4	4	5	7	10	3	2	1	2
2	1015425	3	1	1	1	2	2	3	1	1	2
3	1016277	6	8	8	1	3	4	3	7	1	2
4	1017023	4	1	1	3	2	1	3	1	1	2

Visualization

The Class field contains the diagnosis where 2 means benign, and 4 means malignant

ax = df[df['Class'] == 4][0:50].plot(kind='scatter', 
                                     x='Clump', 
                                     y='UnifSize', 
                                     color='DarkBlue', 
                                     label='malignant');
df[df['Class'] == 2][0:50].plot(kind='scatter', 
                                x='Clump', 
                                y='UnifSize', 
                                color='Yellow', 
                                label='benign', 
                                ax=ax);
plt.show()

<matplotlib.figure.Figure at 0x7f2a44235d68>

Preprocessing Data

print(df.dtypes)

df = df[pd.to_numeric(df['BareNuc'].apply(lambda x: x.isnumeric()))]
df['BareNuc'] = df['BareNuc'].astype('int')
df.dtypes

ID             int64
Clump          int64
UnifSize       int64
UnifShape      int64
MargAdh        int64
SingEpiSize    int64
BareNuc        int64
BlandChrom     int64
NormNucl       int64
Mit            int64
Class          int64
dtype: object

Break into X and Y

X = np.asarray(df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']])
X[0:5]

array([[ 5,  1,  1,  1,  2,  1,  3,  1,  1],
       [ 5,  4,  4,  5,  7, 10,  3,  2,  1],
       [ 3,  1,  1,  1,  2,  2,  3,  1,  1],
       [ 6,  8,  8,  1,  3,  4,  3,  7,  1],
       [ 4,  1,  1,  3,  2,  1,  3,  1,  1]])

Y = np.asarray(df['Class'])
Y[0:5]

array([2, 2, 2, 2, 2])

Train/Test Split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, 
                                                    test_size=0.2, 
                                                    random_state=4)
print ('Train set:', X_train.shape,  Y_train.shape)
print ('Test set:', X_test.shape,  Y_test.shape)

Modeling

from sklearn import svm

clf = svm.SVC(gamma='auto', kernel='rbf')
clf.fit(X_train, Y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Y_hat = clf.predict(X_test)
Y_hat[0:5]

array([2, 4, 2, 4, 2])

Evaluation

from sklearn.metrics import classification_report, confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test, Y_hat, labels=[2,4])
np.set_printoptions(precision=2)

print (classification_report(Y_test, Y_hat))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],
                      normalize= False,  title='Confusion matrix')

<matplotlib.figure.Figure at 0x7f2a441c3208>

from sklearn.metrics import f1_score
print('F1 Score: ', f1_score(Y_test, Y_hat, average='weighted'))

from sklearn.metrics import jaccard_similarity_score
print('Jaccard Index: ', jaccard_similarity_score(Y_test, Y_hat))

Using an Alternative Kernal

clf2 = svm.SVC(kernel='linear')
clf2.fit(X_train, Y_train) 
Y_hat2 = clf2.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(Y_test, Y_hat2, average='weighted'))
print("Jaccard score: %.4f" % jaccard_similarity_score(Y_test, Y_hat2))

Clustering

Clustering is an unsupervised grouping of data in which similar datapoints are grouped together

The diference between clustering and classification is that clustering does not speficy what th groupings should be

Uses of Clustering

Exploration of data
Summary Generation
Outlier Detection
Finding Duplicates
Data Pre-Processing

Clustering Algorithms

Partitioned Based
- Efficient
Hierachical
- Produces trees of clusters
Density based
- Produces arbitrary shaped clusters

K-Means

Partitioning Clustering
Divides data into K non-overlapping subsets

K tries to minimize intra-cluster distances, and maximize inter-cluster distances

Distance

We can define the distance simply as the euclidean distance, typically normalizing the values so that our distances are not affected more by one value than another

Other distance formulas can be used depending on our understanding of the data as appropriate

Algorithm

Determine K and initialize centroids randomly
Measure distance from centroids to each datapoint
Assign each point to closest centroid
New centroids are at the mean of the points in its cluster
Go to 2 if not converged

K-Means may not converge to a global optimum, but simply a local one and is somewhat dependant on the intial choice in 1

Accuracy

Average distance between datapoints within a cluster is a measure of error

Choice of K

We can use the elbow method in which we look at the distance of the datapoints to their centroid versus the K value, and select the one at which we notice a sharp change in the distance gradient

Lab

Import Packages

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
# from sklearn.datasets.samples_generator import make_blobs

Import Data

df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/Cust_Segmentation.csv')

df.head()

	Customer Id	Age	Edu	Years Employed	Income	Card Debt	Other Debt	Defaulted	Address	DebtIncomeRatio
0	1	41	2	6	19	0.124	1.073	0.0	NBA001	6.3
1	2	47	1	26	100	4.582	8.218	0.0	NBA021	12.8
2	3	33	2	10	57	6.111	5.802	1.0	NBA013	20.9
3	4	29	2	4	19	0.681	0.516	0.0	NBA009	6.3
4	5	47	1	31	253	9.308	8.908	0.0	NBA008	7.2

df = df.drop('Address', axis=1)
df.head()

	Customer Id	Age	Edu	Years Employed	Income	Card Debt	Other Debt	Defaulted	DebtIncomeRatio
0	1	41	2	6	19	0.124	1.073	0.0	6.3
1	2	47	1	26	100	4.582	8.218	0.0	12.8
2	3	33	2	10	57	6.111	5.802	1.0	20.9
3	4	29	2	4	19	0.681	0.516	0.0	6.3
4	5	47	1	31	253	9.308	8.908	0.0	7.2

Normalize the Data

from sklearn.preprocessing import StandardScaler

X = np.asarray(df.values[:,1:])
X = np.nan_to_num(X)
X

array([[ 41.  ,   2.  ,   6.  , ...,   1.07,   0.  ,   6.3 ],
       [ 47.  ,   1.  ,  26.  , ...,   8.22,   0.  ,  12.8 ],
       [ 33.  ,   2.  ,  10.  , ...,   5.8 ,   1.  ,  20.9 ],
       ..., 
       [ 25.  ,   4.  ,   0.  , ...,   3.21,   1.  ,  33.4 ],
       [ 32.  ,   1.  ,  12.  , ...,   0.7 ,   0.  ,   2.9 ],
       [ 52.  ,   1.  ,  16.  , ...,   3.64,   0.  ,   8.6 ]])

X_norm = StandardScaler().fit_transform(X)
X_norm

array([[ 0.74,  0.31, -0.38, ..., -0.59, -0.52, -0.58],
       [ 1.49, -0.77,  2.57, ...,  1.51, -0.52,  0.39],
       [-0.25,  0.31,  0.21, ...,  0.8 ,  1.91,  1.6 ],
       ..., 
       [-1.25,  2.47, -1.26, ...,  0.04,  1.91,  3.46],
       [-0.38, -0.77,  0.51, ..., -0.7 , -0.52, -1.08],
       [ 2.11, -0.77,  1.1 , ...,  0.16, -0.52, -0.23]])

Modeling

k = 3
k_means = KMeans(init='k-means++',
                n_clusters=k,
                n_init=12)

k_means.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=12, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

labels = k_means.labels_
print(labels[:20], labels.shape)

df['Cluster'] = labels
df.head()

	Customer Id	Age	Edu	Years Employed	Income	Card Debt	Other Debt	Defaulted	DebtIncomeRatio	Cluster
0	1	41	2	6	19	0.124	1.073	0.0	6.3	1
1	2	47	1	26	100	4.582	8.218	0.0	12.8	0
2	3	33	2	10	57	6.111	5.802	1.0	20.9	1
3	4	29	2	4	19	0.681	0.516	0.0	6.3	1
4	5	47	1	31	253	9.308	8.908	0.0	7.2	2

df.groupby('Cluster').mean()

	Customer Id	Age	Edu	Years Employed	Income	Card Debt	Other Debt	Defaulted	DebtIncomeRatio
Cluster
0	402.295082	41.333333	1.956284	15.256831	83.928962	3.103639	5.765279	0.171233	10.724590
1	432.468413	32.964561	1.614792	6.374422	31.164869	1.032541	2.104133	0.285185	10.094761
2	410.166667	45.388889	2.666667	19.555556	227.166667	5.678444	10.907167	0.285714	7.322222

Visualization

%matplotlib inline
area = np.pi*(X[:,1])**2
plt.figure()
plt.title('Income vs Age')
plt.scatter(X[:,0], X[:,3], s=area, c=labels, alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()

<matplotlib.figure.Figure at 0x7f2adaf72ac8>

from mpl_toolkits.mplot3d import Axes3D

plt.clf()
ax = Axes3D(plt.figure(figsize=(8,6)), rect=[0,0,0.95,1], elev=48, azim=134)
plt.cla()
ax.set_xlabel('Education')
ax.set_ylabel('Age')
ax.set_zlabel('Income')

ax.scatter(X[:,1], X[:,0], X[:,3], c=labels)

plt.figure()
plt.show()

<matplotlib.figure.Figure at 0x7f2a4432b828>

<matplotlib.figure.Figure at 0x7f2a4432b6d8>

<matplotlib.figure.Figure at 0x7f2a9c021e48>

Hierachical Clustering

Two types

Divisive - Top Down
Agglomerative - Bottom Up

Agglomerative works by combining clusters based on the distance between them, this is the most popular method for HC

Agglomerative Algorithm

Create n clusters, one for each datapoint
Compute the proximity matrix
Repeat Until a single cluster remains
1. Merge the two closest clusters
2. Update the proximity matrix

We can use any distance function we want to, there are multiple algorithms for this

Single linkage clustering
Complete linkage clustering
Average linkag clustering
Centroid linkage clustering

Advantages and Disadvantages

Advantages
- Number of clusters does not need to be specified
- Easy to implement
- Dendogram can be easily understood
Disadvantages
- Long runtimes
- Cannot undo previous steps
- Difficult to identify the number of clusters on dendogram

Lab

Import Packages

import numpy as np 
import pandas as pd
from scipy import ndimage 
from scipy.cluster import hierarchy 
from scipy.spatial import distance_matrix 
from matplotlib import pyplot as plt 
from sklearn import manifold, datasets 
from sklearn.cluster import AgglomerativeClustering 
from sklearn.datasets.samples_generator import make_blobs

Import Data

df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cars_clus.csv')

df.head()

	manufact	model	sales	resale	price	engine_s	horsepow	wheelbas	width	length	curb_wgt	fuel_cap	mpg	lnsales
0	Acura	Integra	16.919	16.360	21.500	1.800	140.000	101.200	67.300	172.400	2.639	13.200	28.000	2.828
1	Acura	TL	39.384	19.875	28.400	3.200	225.000	108.100	70.300	192.900	3.517	17.200	25.000	3.673
2	Acura	CL	14.114	18.225	$null$	3.200	225.000	106.900	70.600	192.000	3.470	17.200	26.000	2.647
3	Acura	RL	8.588	29.725	42.000	3.500	210.000	114.600	71.400	196.600	3.850	18.000	22.000	2.150
4	Audi	A4	20.397	22.255	23.990	1.800	150.000	102.600	68.200	178.000	2.998	16.400	27.000	3.015

Clean Data

print ("Shape of dataset before cleaning: ", df.size)

df[[ 'sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']] = df[['sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']].apply(pd.to_numeric, errors='coerce')
df = df.dropna()
df = df.reset_index(drop=True)

print ("Shape of dataset after cleaning: ", df.size)
df.head()

	manufact	model	sales	resale	price	engine_s	horsepow	wheelbas	width	length	curb_wgt	fuel_cap	mpg	lnsales
0	Acura	Integra	16.919	16.360	21.50	1.8	140.0	101.2	67.3	172.4	2.639	13.2	28.0	2.828
1	Acura	TL	39.384	19.875	28.40	3.2	225.0	108.1	70.3	192.9	3.517	17.2	25.0	3.673
2	Acura	RL	8.588	29.725	42.00	3.5	210.0	114.6	71.4	196.6	3.850	18.0	22.0	2.150
3	Audi	A4	20.397	22.255	23.99	1.8	150.0	102.6	68.2	178.0	2.998	16.4	27.0	3.015
4	Audi	A6	18.780	23.555	33.95	2.8	200.0	108.7	76.1	192.0	3.561	18.5	22.0	2.933

Selecting Features

X = df[['engine_s','horsepow', 'wheelbas', 
        'width', 'length', 'curb_wgt', 
        'fuel_cap', 'mpg']].values
print(X[:5])

Normalization

from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)
print(X_norm[:5])

Clustering with Scipy

import scipy as sp

entries = X_norm.shape[0]
D = sp.zeros([entries, entries])
for i in range(entries):
    for j in range(entries):
        D[i,j] = sp.spatial.distance.euclidean(X[i],X[j])
print(D)

We have different distace formulas such as

single
complete
average
weighted
centroid

import pylab
import scipy.cluster.hierarchy

Z = hierarchy.linkage(D, 'complete')
print(Z[:5])

from scipy.cluster.hierarchy import fcluster

max_d =  3
clusters = fcluster(Z, max_d, criterion='distance')
print(clusters)

max_d = 5
clusters = fcluster(Z, max_d, criterion='maxclust')
print(clusters)

fig = pylab.figure(figsize=(18,50))
def llf(id):
    return '[%s %s %s]' % (df['manufact'][id], 
                           df['model'][id], 
                           int(float(df['type'][id])))

dendro = hierarchy.dendrogram(Z, leaf_label_func=llf, 
                             leaf_rotation=0, 
                             leaf_font_size=12, 
                             orientation='right')

<matplotlib.figure.Figure at 0x7f2a4422e128>

Clustering with SciKit Learn

D = distance_matrix(X, X)
print(D)

agglom = AgglomerativeClustering(n_clusters=6, linkage='complete')

agglom.fit(X)

AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='complete', memory=None,
            n_clusters=6, pooling_func=<function mean at 0x7f2ad42c3730>)

df['cluster_'] = agglom.labels_
df.head()

	manufact	model	sales	resale	price	engine_s	horsepow	wheelbas	width	length	curb_wgt	fuel_cap	mpg	lnsales	cluster_
0	Acura	Integra	16.919	16.360	21.50	1.8	140.0	101.2	67.3	172.4	2.639	13.2	28.0	2.828	2
1	Acura	TL	39.384	19.875	28.40	3.2	225.0	108.1	70.3	192.9	3.517	17.2	25.0	3.673	0
2	Acura	RL	8.588	29.725	42.00	3.5	210.0	114.6	71.4	196.6	3.850	18.0	22.0	2.150	0
3	Audi	A4	20.397	22.255	23.99	1.8	150.0	102.6	68.2	178.0	2.998	16.4	27.0	3.015	3
4	Audi	A6	18.780	23.555	33.95	2.8	200.0	108.7	76.1	192.0	3.561	18.5	22.0	2.933	0

import matplotlib.cm as cm
n_clusters = max(agglom.labels_)+1
colors = cm.rainbow(np.linspace(0,1,n_clusters))
cluster_labels = list(range(0,n_clusters))

plt.figure(figsize=(16,14))

for color, label in zip(colors, cluster_labels):
    subset = df[df.cluster_ == label]
    for i in subset.index:
            plt.text(subset.horsepow[i], 
                     subset.mpg[i],
                     str(subset['model'][i]), 
                     rotation=25) 
            
    plt.scatter(subset.horsepow, subset.mpg, 
                s= subset.price*10, c=color, 
                label='cluster'+str(label),alpha=0.5)

plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')
plt.show()

<matplotlib.figure.Figure at 0x7f2a44101048>

df.groupby(['cluster_','type'])['cluster_'].count()

cluster_  type
0         0.0     29
          1.0     14
1         0.0     10
2         0.0     26
          1.0      4
3         0.0     21
          1.0     11
4         0.0      1
5         0.0      1
Name: cluster_, dtype: int64

df_mean = df.groupby(['cluster_','type'])['horsepow','engine_s','mpg','price'].mean()
df_mean

		horsepow	engine_s	mpg	price
cluster_	type
0	0.0	210.551724	3.420690	23.648276	30.449310
1.0	206.428571	4.064286	18.500000	28.727714
1	0.0	294.700000	4.380000	21.600000	57.864000
2	0.0	121.230769	1.934615	29.115385	14.720385
1.0	133.750000	2.225000	22.750000	15.856500
3	0.0	160.857143	2.680952	24.857143	19.822048
1.0	154.272727	2.936364	20.909091	21.199364
4	0.0	55.000000	1.000000	45.000000	9.235000
5	0.0	450.000000	8.000000	16.000000	69.725000

plt.figure(figsize=(16,10))
for color, label in zip(colors, cluster_labels):
    subset = df_mean.loc[(label,),]
    for i in subset.index:
        plt.text(subset.loc[i][0]+5, subset.loc[i][2], 'type='+str(int(i)) + ', price='+str(int(subset.loc[i][3]))+'k')
    plt.scatter(subset.horsepow, subset.mpg, s=subset.price*20, c=color, label='cluster'+str(label))
plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')

Text(0,0.5,'mpg')

<matplotlib.figure.Figure at 0x7f29c4117978>

DBSCAN

Density Based clstering locates regions of high density and separates outliers while being able to find arbitrarily shaped clusters while ignoring noise

Density Based Spacial Clustering of Applications with Noise
- Common clustering algorithm
- Based on object density
Radius of neighborhood
Min number of neighbors

Different types of points

Core
- Has M neighbors withn R
Border
- Has Core point within R, less than M in R
Outlier
- Not Core, or within R of Core

DBSCAN visits each point and identifies its type, and then groups points based on this

Lab

Import Packages

import numpy as np 
from sklearn.cluster import DBSCAN 
from sklearn.datasets.samples_generator import make_blobs 
from sklearn.preprocessing import StandardScaler 
import matplotlib.pyplot as plt
import pandas as pd

About the Data

Environment Canada

Monthly Values for July - 2015

Name in the table	Meaning
Stn_Name	Station Name
Lat	Latitude (North+, degrees)
Long	Longitude (West - , degrees)
Prov	Province
Tm	Mean Temperature (°C)
DwTm	Days without Valid Mean Temperature
D	Mean Temperature difference from Normal (1981-2010) (°C)
Tx	Highest Monthly Maximum Temperature (°C)
DwTx	Days without Valid Maximum Temperature
Tn	Lowest Monthly Minimum Temperature (°C)
DwTn	Days without Valid Minimum Temperature
S	Snowfall (cm)
DwS	Days without Valid Snowfall
S%N	Percent of Normal (1981-2010) Snowfall
P	Total Precipitation (mm)
DwP	Days without Valid Precipitation
P%N	Percent of Normal (1981-2010) Precipitation
S_G	Snow on the ground at the end of the month (cm)
Pd	Number of days with Precipitation 1.0 mm or more
BS	Bright Sunshine (hours)
DwBS	Days without Valid Bright Sunshine
BS%	Percent of Normal (1981-2010) Bright Sunshine
HDD	Degree Days below 18 °C
CDD	Degree Days above 18 °C
Stn_No	Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically).
NA	Not Available

Import the Data

df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/weather-stations20140101-20141231.csv')

df.head()

	Stn_Name	Lat	Long	Prov	Tm	DwTm	D	Tx	DwTx	Tn	...	DwP	P%N	S_G	Pd	BS	DwBS	BS%	HDD	CDD	Stn_No
0	CHEMAINUS	48.935	-123.742	BC	8.2	0.0	NaN	13.5	0.0	1.0	...	0.0	NaN	0.0	12.0	NaN	NaN	NaN	273.3	0.0	1011500
1	COWICHAN LAKE FORESTRY	48.824	-124.133	BC	7.0	0.0	3.0	15.0	0.0	-3.0	...	0.0	104.0	0.0	12.0	NaN	NaN	NaN	307.0	0.0	1012040
2	LAKE COWICHAN	48.829	-124.052	BC	6.8	13.0	2.8	16.0	9.0	-2.5	...	9.0	NaN	NaN	11.0	NaN	NaN	NaN	168.1	0.0	1012055
3	DISCOVERY ISLAND	48.425	-123.226	BC	NaN	NaN	NaN	12.5	0.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1012475
4	DUNCAN KELVIN CREEK	48.735	-123.728	BC	7.7	2.0	3.4	14.5	2.0	-1.0	...	2.0	NaN	NaN	11.0	NaN	NaN	NaN	267.7	0.0	1012573

5 rows × 25 columns

Clean Data

df = df[pd.notnull(df['Tm'])]
df.reset_index(drop=True)
df.head()

	Stn_Name	Lat	Long	Prov	Tm	DwTm	D	Tx	DwTx	Tn	...	DwP	P%N	S_G	Pd	BS	DwBS	BS%	HDD	Stn_No
0	CHEMAINUS	48.935	-123.742	BC	8.2	0.0	NaN	13.5	0.0	1.0	...	0.0	NaN	0.0	12.0	NaN	NaN	NaN	273.3	1011500
1	COWICHAN LAKE FORESTRY	48.824	-124.133	BC	7.0	0.0	3.0	15.0	0.0	-3.0	...	0.0	104.0	0.0	12.0	NaN	NaN	NaN	307.0	1012040
2	LAKE COWICHAN	48.829	-124.052	BC	6.8	13.0	2.8	16.0	9.0	-2.5	...	9.0	NaN	NaN	11.0	NaN	NaN	NaN	168.1	1012055
4	DUNCAN KELVIN CREEK	48.735	-123.728	BC	7.7	2.0	3.4	14.5	2.0	-1.0	...	2.0	NaN	NaN	11.0	NaN	NaN	NaN	267.7	1012573
5	ESQUIMALT HARBOUR	48.432	-123.439	BC	8.8	0.0	NaN	13.1	0.0	1.9	...	8.0	NaN	NaN	12.0	NaN	NaN	NaN	258.6	1012710

5 rows × 25 columns

# ! pip install --user git+https://github.com/matplotlib/basemap.git

# from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams

rcParams['figure.figsize'] = (14,10)

llon=-140
ulon=-50
llat=40
ulat=65

df = df[(df['Long'] > llon) & (df['Long'] < ulon) & (df['Lat'] > llat) &(df['Lat'] < ulat)]

plt.title('Location of Sensors')
plt.scatter(list(df['Long']),list(df['Lat']))
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

<matplotlib.figure.Figure at 0x7f2aa86d1278>

Compute DBSCAN

from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)

<mtrand.RandomState at 0x7f29c41733a8>

X = np.nan_to_num(df[['Lat','Long']])
X = StandardScaler().fit_transform(X)

X

array([[-0.3 , -1.17],
       [-0.33, -1.19],
       [-0.33, -1.18],
       ..., 
       [ 1.84,  1.47],
       [ 1.01,  1.65],
       [ 0.6 ,  1.28]])

db = DBSCAN(eps=0.15, min_samples=10).fit(X)
db

DBSCAN(algorithm='auto', eps=0.15, leaf_size=30, metric='euclidean',
    metric_params=None, min_samples=10, n_jobs=1, p=None)

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True

core_samples_mask

array([ True,  True,  True, ..., False, False, False], dtype=bool)

df['Clus_db'] = db.labels_
df.head()

	Stn_Name	Lat	Long	Prov	Tm	DwTm	D	Tx	DwTx	Tn	...	P%N	S_G	Pd	BS	DwBS	BS%	HDD	Stn_No
0	CHEMAINUS	48.935	-123.742	BC	8.2	0.0	NaN	13.5	0.0	1.0	...	NaN	0.0	12.0	NaN	NaN	NaN	273.3	1011500
1	COWICHAN LAKE FORESTRY	48.824	-124.133	BC	7.0	0.0	3.0	15.0	0.0	-3.0	...	104.0	0.0	12.0	NaN	NaN	NaN	307.0	1012040
2	LAKE COWICHAN	48.829	-124.052	BC	6.8	13.0	2.8	16.0	9.0	-2.5	...	NaN	NaN	11.0	NaN	NaN	NaN	168.1	1012055
4	DUNCAN KELVIN CREEK	48.735	-123.728	BC	7.7	2.0	3.4	14.5	2.0	-1.0	...	NaN	NaN	11.0	NaN	NaN	NaN	267.7	1012573
5	ESQUIMALT HARBOUR	48.432	-123.439	BC	8.8	0.0	NaN	13.1	0.0	1.9	...	NaN	NaN	12.0	NaN	NaN	NaN	258.6	1012710

5 rows × 26 columns

df[['Stn_Name','Tx','Tm','Clus_db']][1000:1500:45]

	Stn_Name	Tx	Tm	Clus_db
1138	HEATH POINT	-1.0	-13.3	-1
1185	LA GRANDE RIVIERE A	-11.6	-28.4	-1
1234	BRIER ISLAND	4.4	-6.3	3
1286	BRANCH	8.0	-3.4	4
1332	GOOSE A	-4.2	-22.0	-1

Cluster Visualization

print(df['Clus_db'].max(), df['Clus_db'].min())

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
colours = ['#D3D3D3','blue','red','green','purple','yellow','deepskyblue']
le.fit(colours)
le.classes_

array(['#D3D3D3', 'blue', 'deepskyblue', 'green', 'purple', 'red', 'yellow'],
      dtype='<U11')

# le.inverse_transform([0,1,2,3,4,5,6])
df['Colours'] = le.inverse_transform(db.labels_ + 1)
df[['Stn_Name','Tx','Tm','Clus_db', 'Colours']][1000:1500:45]

	Stn_Name	Tx	Tm	Clus_db	Colours
1138	HEATH POINT	-1.0	-13.3	-1	#D3D3D3
1185	LA GRANDE RIVIERE A	-11.6	-28.4	-1	#D3D3D3
1234	BRIER ISLAND	4.4	-6.3	3	purple
1286	BRANCH	8.0	-3.4	4	red
1332	GOOSE A	-4.2	-22.0	-1	#D3D3D3

plt.title('Clusters')
plt.scatter(list(df['Long']),
            list(df['Lat']),
            c=list(df['Colours']))
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

<matplotlib.figure.Figure at 0x7f2a0408b320>

Recommender Systems

Recommender systems try to capture people's behaviour in order to predict what people may like

There are two main types

Content based
- Provide more content similar to what that user likes
Collaborative filtering
- A user may be interested in what other similar users like

There are two types of implementations

Memory based
- Uses entire user-item dataset to generate a recommendation
Model based
- Develops model of users in an attempt to learn their preferences

Content Based

Content based systems try to recommend content based on a model of the user and similarity of the content that they interact with

Lab

Download the Data

The dataset being used is a movie dataset from GroupLens

#only run once

# !wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
# print('unziping ...')
# !unzip -o -j moviedataset.zip

Import Packages

import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt

Import Data

movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')

movies_df.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

ratings_df.head()

	userId	movieId	rating	timestamp
0	1	169	2.5	1204927694
1	1	2471	3.0	1204927438
2	1	48516	5.0	1204927435
3	2	2571	3.5	1436165433
4	2	109487	4.0	1436165496

Preprocessing

movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))', 
                                                expand=False)
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)', 
                                               expand=False)

movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

movies_df.head()

	movieId	title	genres	year
0	1	Toy Story	Adventure\|Animation\|Children\|Comedy\|Fantasy	1995
1	2	Jumanji	Adventure\|Children\|Fantasy	1995
2	3	Grumpier Old Men	Comedy\|Romance	1995
3	4	Waiting to Exhale	Comedy\|Drama\|Romance	1995
4	5	Father of the Bride Part II	Comedy	1995

movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

	movieId	title	genres	year
0	1	Toy Story	[Adventure, Animation, Children, Comedy, Fantasy]	1995
1	2	Jumanji	[Adventure, Children, Fantasy]	1995
2	3	Grumpier Old Men	[Comedy, Romance]	1995
3	4	Waiting to Exhale	[Comedy, Drama, Romance]	1995
4	5	Father of the Bride Part II	[Comedy]	1995

genres_df = movies_df.copy()

for index, row in movies_df.iterrows():
    for genre in row['genres']:
        genres_df.at[index, genre] = 1

genres_df = genres_df.fillna(0)
genres_df.head()

	movieId	title	genres	year	Adventure	Animation	Children	Comedy	Fantasy	Romance	...
0	1	Toy Story	[Adventure, Animation, Children, Comedy, Fantasy]	1995	1.0	1.0	1.0	1.0	1.0	0.0	...
1	2	Jumanji	[Adventure, Children, Fantasy]	1995	1.0	0.0	1.0	0.0	1.0	0.0	...
2	3	Grumpier Old Men	[Comedy, Romance]	1995	0.0	0.0	0.0	1.0	0.0	1.0	...
3	4	Waiting to Exhale	[Comedy, Drama, Romance]	1995	0.0	0.0	0.0	1.0	0.0	1.0	...
4	5	Father of the Bride Part II	[Comedy]	1995	0.0	0.0	0.0	1.0	0.0	0.0	...

5 rows × 24 columns

ratings_df.head()

	userId	movieId	rating	timestamp
0	1	169	2.5	1204927694
1	1	2471	3.0	1204927438
2	1	48516	5.0	1204927435
3	2	2571	3.5	1436165433
4	2	109487	4.0	1436165496

ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()

	userId	movieId	rating
0	1	169	2.5
1	1	2471	3.0
2	1	48516	5.0
3	2	2571	3.5
4	2	109487	4.0

User Interests

user_movies = pd.DataFrame([
                            {'title':'Breakfast Club, The', 'rating':5},
                            {'title':'Toy Story', 'rating':3.5},
                            {'title':'Jumanji', 'rating':2},
                            {'title':"Pulp Fiction", 'rating':5},
                            {'title':'Akira', 'rating':4.5}
                           ])
user_movies

	rating	title
0	5.0	Breakfast Club, The
1	3.5	Toy Story
2	2.0	Jumanji
3	5.0	Pulp Fiction
4	4.5	Akira

movie_ids = genres_df[genres_df['title'].isin(user_movies['title'].tolist())]
user_movies = pd.merge(movie_ids, user_movies)
user_genres = user_movies.drop('genres', 1).drop('year',1)
user_genres

	movieId	title	Adventure	Animation	Children	Comedy	Fantasy	Drama	Action	...	Sci-Fi	rating
0	1	Toy Story	1.0	1.0	1.0	1.0	1.0	0.0	0.0	...	0.0	3.5
1	2	Jumanji	1.0	0.0	1.0	0.0	1.0	0.0	0.0	...	0.0	2.0
2	296	Pulp Fiction	0.0	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	5.0
3	1274	Akira	1.0	1.0	0.0	0.0	0.0	0.0	1.0	...	1.0	4.5
4	1968	Breakfast Club, The	0.0	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	5.0

5 rows × 23 columns

Since we only need the genres

user_genres.drop('title', 1, inplace=True)
user_genres.drop('movieId', 1, inplace=True)
user_genres.drop('rating', 1, inplace=True)
user_genres

	Adventure	Animation	Children	Comedy	Fantasy	Drama	Action	Crime	Thriller	Sci-Fi
0	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0
1	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0
3	1.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
4	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0

And next we need to multiply this with the ratings column

user_profile = user_genres.transpose().dot(user_movies['rating'])
user_profile

Adventure             10.0
Animation              8.0
Children               5.5
Comedy                13.5
Fantasy                5.5
Romance                0.0
Drama                 10.0
Action                 4.5
Crime                  5.0
Thriller               5.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 4.5
IMAX                   0.0
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

We can then compare this to the table of all our movies, and build a recommendation based on that

all_genres = genres_df.set_index(genres_df['movieId'])
all_genres.head()

	movieId	title	genres	year	Adventure	Animation	Children	Comedy	Fantasy	Romance	...	Horror	Mystery	Sci-Fi	IMAX	Documentary	War	Musical	Western	Film-Noir	(no genres listed)
movieId
1	1	Toy Story	[Adventure, Animation, Children, Comedy, Fantasy]	1995	1.0	1.0	1.0	1.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	2	Jumanji	[Adventure, Children, Fantasy]	1995	1.0	0.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	3	Grumpier Old Men	[Comedy, Romance]	1995	0.0	0.0	0.0	1.0	0.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	4	Waiting to Exhale	[Comedy, Drama, Romance]	1995	0.0	0.0	0.0	1.0	0.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	5	Father of the Bride Part II	[Comedy]	1995	0.0	0.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 24 columns

all_genres.drop(['movieId','title','genres','year'], 1, inplace=True)
all_genres.head()

	Adventure	Animation	Children	Comedy	Fantasy	Romance	Drama	Action	Crime	Thriller	Horror	Mystery	Sci-Fi	IMAX	Documentary	War	Musical	Western	Film-Noir	(no genres listed)
movieId
1	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

user_recommendation = all_genres.dot(user_profile)/user_profile.sum()
user_recommendation.head()

movieId
1    0.594406
2    0.293706
3    0.188811
4    0.328671
5    0.188811
dtype: float64

user_recommendation.sort_values(ascending=False, inplace=True)
user_recommendation.head(10)

movieId
5018      0.748252
26093     0.734266
27344     0.720280
148775    0.685315
6902      0.678322
117646    0.678322
64645     0.671329
81132     0.671329
122787    0.671329
2987      0.664336
dtype: float64

Top Recommendations for User

movies_df.loc[movies_df['movieId'].isin(user_recommendation.head().keys())]

	movieId	title	genres	year
4923	5018	Motorama	[Adventure, Comedy, Crime, Drama, Fantasy, Mys...	1991
6793	6902	Interstate 60	[Adventure, Comedy, Drama, Fantasy, Mystery, S...	2002
8605	26093	Wonderful World of the Brothers Grimm, The	[Adventure, Animation, Children, Comedy, Drama...	1962
9296	27344	Revolutionary Girl Utena: Adolescence of Utena...	[Action, Adventure, Animation, Comedy, Drama, ...	1999
33509	148775	Wizards of Waverly Place: The Movie	[Adventure, Children, Comedy, Drama, Fantasy, ...	2009

Collaborative Filtering

Collaborative filtering works by recommending content based on other similar users/items

There are two types

User
- Based on user's similar neighborhood
Item
- Based on similarity of item recommendations

Lab

Note that this uses the same movie data as before and uses the Pearson Correlation Coefficient to identify users who rate movies similarly based on the ratings table and can be found in 5-2-Collaborative-Filtering

	Adventure	Animation	Children	Comedy	Fantasy	Drama	Action	Crime	Thriller	Sci-Fi
0	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0
1	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0
3	1.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
4	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0

	Adventure	Animation	Children	Comedy	Fantasy	Drama	Action	Crime	Thriller	Sci-Fi
0	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0
1	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0
3	1.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
4	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0

	Adventure	Animation	Children	Comedy	Fantasy	Drama	Action	Crime	Thriller	Sci-Fi
0	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0
1	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0
3	1.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
4	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0