Data Science Methodology

Updated: 03 September 2023

Based on this Cognitive Class Course

The Methodology

The data science methodology described here is as outlined by John Rollins of IBM

The Methodology can be seen in the following steps

Data Science Methodology (Cognitive Class)

The Questions

The Data Science methodology aims to answer ten main questions

From Problem to Approach

  1. What is the problem you are trying to solve?
  2. How can you use data to answer the question?

Working with the Data

  1. What data do you need to answer the question?>
  2. Where is the data coming from and how will you get it?
  3. Is the data that you collected representative of the problem to be solved?
  4. What additional work is required to manipulate and work with the data

Deriving the Answer

  1. In what way can the data be visualized to get to the answer that is required?
  2. Does the model used really answer the initial question or does it need to be adjusted?
  3. Can you put the model into practice?
  4. Can you get constructive feedback into answering the question?

Problem and Approach

Business Understanding

What is the problem you are trying to solve?

The firs step in the methodology involves seeking any needed clarification in order to identify what the problem we are trying to solve is as this drives the data we use and the analytical approach that we will go about applying

It is important to seek clarification early on otherwise we can waste time and resources moving in the wrong direction

In order to understand a question, it is important to understand the goal of the person asking the question

Based on this we will break down objectives and prioritize them

Analytic Approach

How can you use data to answer the question?

The second step in the methodology is electing the correct approach involves the specific problem being addressed, this points to the purpose of business understanding and helps us to identify what methods we should use in order to address the problem

Approach to be Used

When we have a strong understanding of the problem, wwe can pick an analytical approach to be used

  • Descriptive
    • Current Status
  • Diagnostic (Statistical Analysis)
    • What happened?
    • Why is this happening?
  • Predictive (Forecasting)
    • What if these trends continue?
    • What will happen next?
  • Prescriptive
    • How do we solve it?

Question Types

We have a few different types of questions that can direct our modelling

  • Question is to determine probabilities of an action
    • Predictive Model
  • Question is to show relationships
    • Descriptive Model
  • Question requires a binary answer
    • Classification Model

Machine Learning

Machine learning allows us to identify relationships and trends that cannot otherwise be established

Decision Trees

Decision trees are a machine learning algorithm that allow us to classify nodes while also giving us some information as to how the information is classified

It makes use of a tree structure with recursive partitioning to classify data, predictiveness is based on decrease in entropy - gain in information or impurity

A decision tree for classifying data can result in leaf nodes of varying purity, as seen below which will provide us with different ammounts of information

Pure Decision Tree (Cognitive Class)

Impure Decision Tree (Cognitive Class)

Some of the characteristics of decision trees are summarized below

ProsCons
Easy to interpretEasy to over or underfit the model
Can handle numeric or categorical featuresCannot model feature interaction
Can handle missing dataLarge trees can be difficult to interpret
Uses only the most important features
Can be used on very large or small datasets

Labs

The Lab notebooks have been added in the labs folder, and are released under the MIT License

The Lab for this section is 1-From-Problem-to-Approach.ipynb

Requirements and Collection

Data Requirements

What data do you need to answer the question?

We need to understand what data is required, how to collect it, and how to transform it to address the problem at hand

It is necessary to identify the data requirements for the initial data collection

We typically make use of the following steps

  1. Define and select the set of data needed
  2. Content format and representation of data was defined
  3. It is important to look ahead when transforming our data to a form that would be most suitable for us

Data Collection

Where is the data coming from and how will you get it?

After the initial data collection has been performed, we look at the data and verify that we have all the data that we need, and the data requirements are revisited in order to define what has not been met or needs to be changed

We then make use of descriptive statistics and visuals in order to define the quality and other aspects of the data and then identify how we can fill in these gaps

Collecting data requires that we know the data source and where to find the required data

Labs

The lab documents for this section are in both Python and R, and can be found in the labs folder as 2-Requirements-to-Collection-py.ipynb and 2-Requirements-to-Collection-R.ipynb

These labs will simply read in a dataset from a remote source as a CSV and display it

Python

In Python we will use Pandas to read data as DataFrames

We can use Pandas to read data into the data frame

1
import pandas as pd # download library to read data into dataframe
2
pd.set_option('display.max_columns', None)
3
4
recipes = pd.read_csv("https://ibm.box.com/shared/static/5wah9atr5o1akuuavl2z9tkjzdinr1lv.csv")
5
6
print("Data read into dataframe!") # takes about 30 seconds

Thereafter we can view the dataframe by looking at the first few rows, as well as the dimensions with

1
recipes.head()
2
recipes.shape

R

We do the same aas the above in R as follows

First we download the file from the remote resource

1
# click here and press Shift + Enter
2
download.file("https://ibm.box.com/shared/static/5wah9atr5o1akuuavl2z9tkjzdinr1lv.csv",
3
destfile = "/resources/data/recipes.csv", quiet = TRUE)
4
5
print("Done!") # takes about 30 seconds

Thereafter we can read this into a variable with

1
recipes <- read.csv("/resources/data/recipes.csv") # takes 10 sec

We can then see the first few rows of data as well as the dimensions with

1
head(recipes)
2
nrow(recipes)
3
ncol(recipes)

Understanding and Preparation

Data Understanding

Is the data that you collected representative of the problem to be solved?

We make use of descriptive statistics to understand the data

We run statistical analyses to learn about the data with means such as such as

  • Univariate
  • Pairwise
  • Histogram
  • Mean
  • Medium
  • Min
  • Max
  • etc.

We also make use of these to understand data quality and values such as Missing values and Invalid or Misleading values

Data Preparation

What additional work is required to manipulate and work with the data

Data preparation is similar to cleansing data by removing unwanted elements and imperfections, this can take between 70% and 90% of the project time

Transforming data in this phase is the process of turining data into something that would be easier to work with

Some examples of what we need to look out for are

  • Invalid values
  • Missing data
  • Duplicates
  • Formatting

Another part of data preparation is feature engineering which is when we use domain knowledge to create features for our predictive models

The data preparation will support the remainder of the project

Labs

The lab documents for this section are in both Python and R, and can be found in the labs folder as 3-Understanding-to-Preparation-py.ipynb and 3-Understanding-to-Preparation-R.ipynb

These labs will continue to analyze the data that was imported from the previous lab

Python

First, we check if the ingredients exist in our dataframe

1
ingredients = list(recipes.columns.values)
2
3
print([match.group(0) for ingredient in ingredients for match in [(re.compile(".*(rice).*")).search(ingredient)] if match])
4
print([match.group(0) for ingredient in ingredients for match in [(re.compile(".*(wasabi).*")).search(ingredient)] if match])
5
print([match.group(0) for ingredient in ingredients for match in [(re.compile(".*(soy).*")).search(ingredient)] if match])

Thereafter we can look at our data in order to see if there are any changes that need to be made

1
recipes["country"].value_counts() # frequency table
2
3
# American 40150
4
# Mexico 1754
5
# Italian 1715
6
# Italy 1461
7
# Asian 1176
8
# French 996
9
# east_asian 951
10
# Canada 774
11
# korean 767
12
# Mexican 622
13
# western 450
14
# Southern_SoulFood 346
15
# India 324
16
# Jewish 320
17
# Spanish_Portuguese 291

From here the following can be seen

  1. Cuisine is labelled as country
  2. Cuisine names are not consistent, uppercase, lowercase, etc.
  3. Some cuisines are duplicates of the country name
  4. Some cuisines have very few recipes

We can take a few steps to solve these problems

First we fix the Country title to be Cuisine

1
column_names = recipes.columns.values
2
column_names[0] = "cuisine"
3
recipes.columns = column_names
4
5
recipes

Then we can make all the names lowercase

1
recipes["cuisine"] = recipes["cuisine"].str.lower()

Next we correct the mislablled cuisine names

1
recipes.loc[recipes["cuisine"] == "austria", "cuisine"] = "austrian"
2
recipes.loc[recipes["cuisine"] == "belgium", "cuisine"] = "belgian"
3
recipes.loc[recipes["cuisine"] == "china", "cuisine"] = "chinese"
4
recipes.loc[recipes["cuisine"] == "canada", "cuisine"] = "canadian"
5
recipes.loc[recipes["cuisine"] == "netherlands", "cuisine"] = "dutch"
6
recipes.loc[recipes["cuisine"] == "france", "cuisine"] = "french"
7
recipes.loc[recipes["cuisine"] == "germany", "cuisine"] = "german"
8
recipes.loc[recipes["cuisine"] == "india", "cuisine"] = "indian"
9
recipes.loc[recipes["cuisine"] == "indonesia", "cuisine"] = "indonesian"
10
recipes.loc[recipes["cuisine"] == "iran", "cuisine"] = "iranian"
11
recipes.loc[recipes["cuisine"] == "italy", "cuisine"] = "italian"
12
recipes.loc[recipes["cuisine"] == "japan", "cuisine"] = "japanese"
13
recipes.loc[recipes["cuisine"] == "israel", "cuisine"] = "jewish"
14
recipes.loc[recipes["cuisine"] == "korea", "cuisine"] = "korean"
15
recipes.loc[recipes["cuisine"] == "lebanon", "cuisine"] = "lebanese"
16
recipes.loc[recipes["cuisine"] == "malaysia", "cuisine"] = "malaysian"
17
recipes.loc[recipes["cuisine"] == "mexico", "cuisine"] = "mexican"
18
recipes.loc[recipes["cuisine"] == "pakistan", "cuisine"] = "pakistani"
19
recipes.loc[recipes["cuisine"] == "philippines", "cuisine"] = "philippine"
20
recipes.loc[recipes["cuisine"] == "scandinavia", "cuisine"] = "scandinavian"
21
recipes.loc[recipes["cuisine"] == "spain", "cuisine"] = "spanish_portuguese"
22
recipes.loc[recipes["cuisine"] == "portugal", "cuisine"] = "spanish_portuguese"
23
recipes.loc[recipes["cuisine"] == "switzerland", "cuisine"] = "swiss"
24
recipes.loc[recipes["cuisine"] == "thailand", "cuisine"] = "thai"
25
recipes.loc[recipes["cuisine"] == "turkey", "cuisine"] = "turkish"
26
recipes.loc[recipes["cuisine"] == "vietnam", "cuisine"] = "vietnamese"
27
recipes.loc[recipes["cuisine"] == "uk-and-ireland", "cuisine"] = "uk-and-irish"
28
recipes.loc[recipes["cuisine"] == "irish", "cuisine"] = "uk-and-irish"
29
30
recipes

After that we can remove the cuisines with less than 50 recipes

1
# get list of cuisines to keep
2
recipes_counts = recipes["cuisine"].value_counts()
3
cuisines_indices = recipes_counts > 50
4
5
cuisines_to_keep = list(np.array(recipes_counts.index.values)[np.array(cuisines_indices)])

And then view the number of rows we kepy/removed

1
rows_before = recipes.shape[0] # number of rows of original dataframe
2
print("Number of rows of original dataframe is {}.".format(rows_before))
3
4
recipes = recipes.loc[recipes['cuisine'].isin(cuisines_to_keep)]
5
6
rows_after = recipes.shape[0] # number of rows of processed dataframe
7
print("Number of rows of processed dataframe is {}.".format(rows_after))
8
9
print("{} rows removed!".format(rows_before - rows_after))

Next we can convert the yes/no fields to be binary

1
recipes = recipes.replace(to_replace="Yes", value=1)
2
recipes = recipes.replace(to_replace="No", value=0)

And lastly view our data with

1
recipes.head()

Next we can Look for recipes that contain rice and soy and wasabi and seaweed

1
check_recipes = recipes.loc[
2
(recipes["rice"] == 1) &
3
(recipes["soy_sauce"] == 1) &
4
(recipes["wasabi"] == 1) &
5
(recipes["seaweed"] == 1)
6
]
7
8
check_recipes

Based on this we can see that not all recipes with those ingredients are Japanese

Now we can look at the frequency of different ingredients in these recipes

1
# sum each column
2
ing = recipes.iloc[:, 1:].sum(axis=0)
3
# define each column as a pandas series
4
ingredient = pd.Series(ing.index.values, index = np.arange(len(ing)))
5
count = pd.Series(list(ing), index = np.arange(len(ing)))
6
7
# create the dataframe
8
ing_df = pd.DataFrame(dict(ingredient = ingredient, count = count))
9
ing_df = ing_df[["ingredient", "count"]]
10
print(ing_df.to_string())

We can then sort the dataframe of ingredients in descending order

1
# define each column as a pandas series
2
ingredient = pd.Series(ing.index.values, index = np.arange(len(ing)))
3
count = pd.Series(list(ing), index = np.arange(len(ing)))
4
5
# create the dataframe
6
ing_df = pd.DataFrame(dict(ingredient = ingredient, count = count))
7
ing_df = ing_df[["ingredient", "count"]]
8
print(ing_df.to_string())

From this we can see that the most common ingredients are Egg, Wheat, and Butter. However we have a lot more American recipes than the others, indicating that our data is skewed towards American ingredients

We can now create a profile for each cuisine in order to see a more representative recipe distribution with

1
cuisines = recipes.groupby("cuisine").mean()
2
cuisines.head()

We can then print out the top 4 ingredients of every couisine with the following

1
num_ingredients = 4 # define number of top ingredients to print
2
3
# define a function that prints the top ingredients for each cuisine
4
def print_top_ingredients(row):
5
print(row.name.upper())
6
row_sorted = row.sort_values(ascending=False)*100
7
top_ingredients = list(row_sorted.index.values)[0:num_ingredients]
8
row_sorted = list(row_sorted)[0:num_ingredients]
9
10
for ind, ingredient in enumerate(top_ingredients):
11
print("%s (%d%%)" % (ingredient, row_sorted[ind]), end=' ')
12
print("\n")
13
14
# apply function to cuisines dataframe
15
create_cuisines_profiles = cuisines.apply(print_top_ingredients, axis=1)
16
17
# AFRICAN
18
# onion (53%) olive_oil (52%) garlic (49%) cumin (42%)
19
20
# AMERICAN
21
# butter (41%) egg (40%) wheat (39%) onion (29%)
22
23
# ASIAN
24
# soy_sauce (49%) ginger (48%) garlic (47%) rice (41%)
25
26
# CAJUN_CREOLE
27
# onion (69%) cayenne (56%) garlic (48%) butter (36%)
28
29
# CANADIAN
30
# wheat (39%) butter (38%) egg (35%) onion (34%)

R

First, we check if the ingredients exist in our dataframe

1
grep("rice", names(recipes), value = TRUE) # yes as rice
2
grep("wasabi", names(recipes), value = TRUE) # yes
3
grep("soy", names(recipes), value = TRUE) # yes as soy_sauce

Thereafter we can look at our data in order to see if there are any changes that need to be made

1
base::table(recipes$country) # frequency table
2
3
# American 40150
4
# Mexico 1754
5
# Italian 1715
6
# Italy 1461
7
# Asian 1176
8
# French 996
9
# east_asian 951
10
# Canada 774
11
# korean 767
12
# Mexican 622
13
# western 450
14
# Southern_SoulFood 346
15
# India 324
16
# Jewish 320
17
# Spanish_Portuguese 291

From here the following can be seen

  1. Cuisine is labelled as country
  2. Cuisine names are not consistent, uppercase, lowercase, etc.
  3. Some cuisines are duplicates of the country name
  4. Some cuisines have very few recipes

We can take a few steps to solve these problems

First we fix the Country title to be Cuisine

1
colnames(recipes)[1] = "cuisine"

Then we can make all the names lowercase

1
recipes$cuisine <- tolower(as.character(recipes$cuisine))
2
3
recipes

Next we correct the mislablled cuisine names

1
recipes$cuisine[recipes$cuisine == "austria"] <- "austrian"
2
recipes$cuisine[recipes$cuisine == "belgium"] <- "belgian"
3
recipes$cuisine[recipes$cuisine == "china"] <- "chinese"
4
recipes$cuisine[recipes$cuisine == "canada"] <- "canadian"
5
recipes$cuisine[recipes$cuisine == "netherlands"] <- "dutch"
6
recipes$cuisine[recipes$cuisine == "france"] <- "french"
7
recipes$cuisine[recipes$cuisine == "germany"] <- "german"
8
recipes$cuisine[recipes$cuisine == "india"] <- "indian"
9
recipes$cuisine[recipes$cuisine == "indonesia"] <- "indonesian"
10
recipes$cuisine[recipes$cuisine == "iran"] <- "iranian"
11
recipes$cuisine[recipes$cuisine == "israel"] <- "jewish"
12
recipes$cuisine[recipes$cuisine == "italy"] <- "italian"
13
recipes$cuisine[recipes$cuisine == "japan"] <- "japanese"
14
recipes$cuisine[recipes$cuisine == "korea"] <- "korean"
15
recipes$cuisine[recipes$cuisine == "lebanon"] <- "lebanese"
16
recipes$cuisine[recipes$cuisine == "malaysia"] <- "malaysian"
17
recipes$cuisine[recipes$cuisine == "mexico"] <- "mexican"
18
recipes$cuisine[recipes$cuisine == "pakistan"] <- "pakistani"
19
recipes$cuisine[recipes$cuisine == "philippines"] <- "philippine"
20
recipes$cuisine[recipes$cuisine == "scandinavia"] <- "scandinavian"
21
recipes$cuisine[recipes$cuisine == "spain"] <- "spanish_portuguese"
22
recipes$cuisine[recipes$cuisine == "portugal"] <- "spanish_portuguese"
23
recipes$cuisine[recipes$cuisine == "switzerland"] <- "swiss"
24
recipes$cuisine[recipes$cuisine == "thailand"] <- "thai"
25
recipes$cuisine[recipes$cuisine == "turkey"] <- "turkish"
26
recipes$cuisine[recipes$cuisine == "irish"] <- "uk-and-irish"
27
recipes$cuisine[recipes$cuisine == "uk-and-ireland"] <- "uk-and-irish"
28
recipes$cuisine[recipes$cuisine == "vietnam"] <- "vietnamese"
29
30
recipes

After that we can remove the cuisines with less than 50 recipes

1
# sort the table of cuisines by descending order
2
# get cuisines with >= 50 recipes
3
filter_list <- names(t[t >= 50])
4
5
filter_list

And then view the number of rows we kept/removed

1
# sort the table of cuisines by descending order
2
t <- sort(base::table(recipes$cuisine), decreasing = T)
3
4
t

Next we convert all the columns into factors for classification later

1
recipes[,names(recipes)] <- lapply(recipes[,names(recipes)], as.factor)
2
3
recipes

We can look at the structure of our dataframe as

1
str(recipes)

Now we can look at which recipes contain rice and soy_sauce and wasabi and seaweed

1
check_recipes <- recipes[
2
recipes$rice == "Yes" &
3
recipes$soy_sauce == "Yes" &
4
recipes$wasabi == "Yes" &
5
recipes$seaweed == "Yes",
6
]
7
8
check_recipes

We can count the ingredients across all recipes with

1
# sum the row count when the value of the row in a column is equal to "Yes" (value of 2)
2
ingred <- unlist(
3
lapply( recipes[, names(recipes)], function(x) sum(as.integer(x) == 2))
4
)
5
6
# transpose the dataframe so that each row is an ingredient
7
ingred <- as.data.frame( t( as.data.frame(ingred) ))
8
9
ing_df <- data.frame("ingredient" = names(ingred),
10
"count" = as.numeric(ingred[1,])
11
)[-1,]
12
13
ing_df

We can next count the total ingredients and sort that in descending order

1
ing_df_sort <- ing_df[order(ing_df$count, decreasing = TRUE),]
2
rownames(ing_df_sort) <- 1:nrow(ing_df_sort)
3
4
ing_df_sort

We can then create a profile for each cuisine as we did previously

1
# create a dataframe of the counts of ingredients by cuisine, normalized by the number of
2
# recipes pertaining to that cuisine
3
by_cuisine_norm <- aggregate(recipes,
4
by = list(recipes$cuisine),
5
FUN = function(x) round(sum(as.integer(x) == 2)/
6
length(as.integer(x)),4))
7
# remove the unnecessary column "cuisine"
8
by_cuisine_norm <- by_cuisine_norm[,-2]
9
10
# rename the first column into "cuisine"
11
names(by_cuisine_norm)[1] <- "cuisine"
12
13
head(by_cuisine_norm)

We can then print out the top 4 ingredients for each recipe with

1
for(nation in by_cuisine_norm$cuisine){
2
x <- sort(by_cuisine_norm[by_cuisine_norm$cuisine == nation,][-1], decreasing = TRUE)
3
cat(c(toupper(nation)))
4
cat("\n")
5
cat(paste0(names(x)[1:4], " (", round(x[1:4]*100,0), "%) "))
6
cat("\n")
7
cat("\n")
8
}

Modeling and Evaluation

Modeling

In what way can the data be visualized to get to the answer that is required?

Modeling is the stage in which the Data Scientist

Data modeling either tries to get to a predictive or descriptive model

Data scientists use a training set for predictive modeling, this is historical data that acts as a way to test that the data we are using is suitable for the problem we are tryig to solve

Evaluation

Does the model used really answer the initial question or does it need to be adjusted?

A model evaluation goes hand in hand with model building, model building and evaluation are done iteratively

This is done before the model is deployed in order to verify that the model answers our questions and the quality meets our standard

Two phases are considered when evaluating a model

  • Diagnostic Measures
    • Predictive
    • Descriptive
  • Statistical Significance

We can make use of the ROC curve to evaluate models and determine the optimal model for a binary classification model by plotting the True-Positive vs False-Positive rate for the model

Labs

The lab documents for this section are in both Python and R, and can be found in the labs folder as 4-Modeling-to-Evaluation-py.ipynb and 4-Modeling-to-Evaluation-R.ipynb

These labs will continue from where the last lab left off and build a decision Tree Model for the recipe data

Python

First we will need to import some libraries for modelling

1
# import decision trees scikit-learn libraries
2
%matplotlib inline
3
from sklearn import tree
4
from sklearn.metrics import accuracy_score, confusion_matrix
5
6
import matplotlib.pyplot as plt
7
8
!conda install python-graphviz --yes
9
import graphviz
10
11
from sklearn.tree import export_graphviz
12
13
import itertools

We will make use of a decision tree called bamboo_tree which will be used to classify between Korean, Japanese, Chinese, Thai, and Indian Food

The following code will create our decision tree

1
# select subset of cuisines
2
asian_indian_recipes = recipes[recipes.cuisine.isin(["korean", "japanese", "chinese", "thai", "indian"])]
3
cuisines = asian_indian_recipes["cuisine"]
4
ingredients = asian_indian_recipes.iloc[:,1:]
5
6
bamboo_tree = tree.DecisionTreeClassifier(max_depth=3)
7
bamboo_tree.fit(ingredients, cuisines)
8
9
print("Decision tree model saved to bamboo_tree!")

Thereafter we can plot the decision tree with

1
export_graphviz(bamboo_tree,
2
feature_names=list(ingredients.columns.values),
3
out_file="bamboo_tree.dot",
4
class_names=np.unique(cuisines),
5
filled=True,
6
node_ids=True,
7
special_characters=True,
8
impurity=False,
9
label="all",
10
leaves_parallel=False)
11
12
with open("bamboo_tree.dot") as bamboo_tree_image:
13
bamboo_tree_graph = bamboo_tree_image.read()
14
graphviz.Source(bamboo_tree_graph)

Decision Tree (Cognitive Class)

Now we can go back and rebuild our model, however this time retaining some data so we can evaluate the model

1
bamboo = recipes[recipes.cuisine.isin(["korean", "japanese", "chinese", "thai", "indian"])]
2
bamboo["cuisine"].value_counts()

We can use 30 values as our sample size

1
# set sample size
2
sample_n = 30
3
# take 30 recipes from each cuisine
4
random.seed(1234) # set random seed
5
bamboo_test = bamboo.groupby("cuisine", group_keys=False).apply(lambda x: x.sample(sample_n))
6
7
bamboo_test_ingredients = bamboo_test.iloc[:,1:] # ingredients
8
bamboo_test_cuisines = bamboo_test["cuisine"] # corresponding cuisines or labels

We can verify that we have 30 recipes from each cuisine

1
# check that we have 30 recipes from each cuisine
2
bamboo_test["cuisine"].value_counts()

We can now separate our data in to a test and training set

1
bamboo_test_index = bamboo.index.isin(bamboo_test.index)
2
bamboo_train = bamboo[~bamboo_test_index]
3
4
bamboo_train_ingredients = bamboo_train.iloc[:,1:] # ingredients
5
bamboo_train_cuisines = bamboo_train["cuisine"] # corresponding cuisines or labels
6
7
bamboo_train["cuisine"].value_counts()

And then train our model again

1
bamboo_train_tree = tree.DecisionTreeClassifier(max_depth=15)
2
bamboo_train_tree.fit(bamboo_train_ingredients, bamboo_train_cuisines)
3
4
print("Decision tree model saved to bamboo_train_tree!")

We can then view our tree as before

1
export_graphviz(bamboo_train_tree,
2
feature_names=list(bamboo_train_ingredients.columns.values),
3
out_file="bamboo_train_tree.dot",
4
class_names=np.unique(bamboo_train_cuisines),
5
filled=True,
6
node_ids=True,
7
special_characters=True,
8
impurity=False,
9
label="all",
10
leaves_parallel=False)
11
12
with open("bamboo_train_tree.dot") as bamboo_train_tree_image:
13
bamboo_train_tree_graph = bamboo_train_tree_image.read()
14
graphviz.Source(bamboo_train_tree_graph)

If you run this you will see that the new tree is more complex than the last one due to it having fewer data points to work with (I did not put it here because it renders very big in the plot)

Next we can test our model based on the Test Data

1
bamboo_pred_cuisines = bamboo_train_tree.predict(bamboo_test_ingredients)

We can then create a confusion matrix to see how well the tree does

1
test_cuisines = np.unique(bamboo_test_cuisines)
2
bamboo_confusion_matrix = confusion_matrix(bamboo_test_cuisines, bamboo_pred_cuisines, test_cuisines)
3
title = 'Bamboo Confusion Matrix'
4
cmap = plt.cm.Blues
5
6
plt.figure(figsize=(8, 6))
7
bamboo_confusion_matrix = (
8
bamboo_confusion_matrix.astype('float') / bamboo_confusion_matrix.sum(axis=1)[:, np.newaxis]
9
) * 100
10
11
plt.imshow(bamboo_confusion_matrix, interpolation='nearest', cmap=cmap)
12
plt.title(title)
13
plt.colorbar()
14
tick_marks = np.arange(len(test_cuisines))
15
plt.xticks(tick_marks, test_cuisines)
16
plt.yticks(tick_marks, test_cuisines)
17
18
fmt = '.2f'
19
thresh = bamboo_confusion_matrix.max() / 2.
20
for i, j in itertools.product(range(bamboo_confusion_matrix.shape[0]), range(bamboo_confusion_matrix.shape[1])):
21
plt.text(j, i, format(bamboo_confusion_matrix[i, j], fmt),
22
horizontalalignment="center",
23
color="white" if bamboo_confusion_matrix[i, j] > thresh else "black")
24
25
plt.tight_layout()
26
plt.ylabel('True label')
27
plt.xlabel('Predicted label')
28
29
plt.show()

The rows on a confusion matrix epresent the actual values, and the rows are the predicted values

The resulting confusion matrix can be seen below

Confusion Matrix (Cognitive Class)

The squares along the top-left to bottom-right diagonal are those that the model correctly classified

R

We can follow a similar process as above using R

First we import the libraries we will need to build our decision trees as follows

1
# load libraries
2
library(rpart)
3
4
if("rpart.plot" %in% rownames(installed.packages()) == FALSE) {install.packages("rpart.plot",
5
repo = "http://mirror.las.iastate.edu/CRAN/")}
6
library(rpart.plot)
7
8
print("Libraries loaded!")

Thereafter we can train our model using our data with

1
# select subset of cuisines
2
cuisines_to_keep = c("korean", "japanese", "chinese", "thai", "indian")
3
cuisines_data <- recipes[recipes$cuisine %in% cuisines_to_keep, ]
4
cuisines_data$cuisine <- as.factor(as.character(cuisines_data$cuisine))
5
6
bamboo_tree <- rpart(formula=cuisine ~ ., data=cuisines_data, method ="class")
7
8
print("Decision tree model saved to bamboo_tree!")

And view it with the following

1
# plot bamboo_tree
2
rpart.plot(bamboo_tree, type=3, extra=2, under=TRUE, cex=0.75, varlen=0, faclen=0, Margin=0.03)

Decision Tree (Cognitive Class)

Now we can redefine our dataframe to only include the Asian and Indian cuisine

1
bamboo <- recipes[recipes$cuisine %in% c("korean", "japanese", "chinese", "thai", "indian"),]

And take a sample of 30 for our test set from each cuisine

1
# take 30 recipes from each cuisine
2
set.seed(4) # set random seed
3
korean <- bamboo[base::sample(which(bamboo$cuisine == "korean") , sample_n), ]
4
japanese <- bamboo[base::sample(which(bamboo$cuisine == "japanese") , sample_n), ]
5
chinese <- bamboo[base::sample(which(bamboo$cuisine == "chinese") , sample_n), ]
6
thai <- bamboo[base::sample(which(bamboo$cuisine == "thai") , sample_n), ]
7
indian <- bamboo[base::sample(which(bamboo$cuisine == "indian") , sample_n), ]
8
9
# create the dataframe
10
bamboo_test <- rbind(korean, japanese, chinese, thai, indian)

Thereafter we can create our training set with

1
bamboo_train <- bamboo[!(rownames(bamboo) %in% rownames(bamboo_test)),]
2
bamboo_train$cuisine <- as.factor(as.character(bamboo_train$cuisine))

And verify that we have correctly removed the 30 elements from each revipe

1
base::table(bamboo_train$cuisine)

Next we can train our tree and plot it

1
bamboo_train_tree <- rpart(formula=cuisine ~ ., data=bamboo_train, method="class")
2
rpart.plot(bamboo_train_tree, type=3, extra=0, under=TRUE, cex=0.75, varlen=0, faclen=0, Margin=0.03)

It can be seen that by removing elements we get a more complex decision tree, this is the same as in the Python case

Decision Tree (Cognitive Class)

We can then view the confusion matrix as follows

1
bamboo_confusion_matrix <- base::table(
2
paste(as.character(bamboo_test$cuisine),"_true", sep=""),
3
paste(as.character(bamboo_pred_cuisines),"_pred", sep="")
4
)
5
6
round(prop.table(bamboo_confusion_matrix, 1)*100, 1)

Which will result in

1
chinese_pred indian_pred japanese_pred korean_pred thai_pred
2
chinese_true 60.0 0.0 3.3 36.7 0.0
3
indian_true 0.0 90.0 0.0 10.0 0.0
4
japanese_true 20.0 3.3 33.3 40.0 3.3
5
korean_true 6.7 0.0 16.7 76.7 0.0
6
thai_true 3.3 20.0 0.0 33.3 43.3

Deployment and Feedback

Deployment

Can you put the model into practice?

The key to making your model relevant is making the stakeholders familiar with the solution developed

When the model is evaluated and we are confident in the model we deploy it, typically first to a small set of users to put it through practical tests

Deployment also consists of developing a suitable method to enable our users to interact with and use the model as well as looking to ways to improve the model with a feedback system

Feedback

Can you get constructive feedback into answering the question?

User feedback helps us to refine and assess the model’s performance and impact, and based on this feedback making changes to make the model

Once the model is deployed we can make use of feedback and experience with the model to refine the model or incorporate different data into it that we had not initally considered