Data Analysis

Updated: 03 September 2023

Data Analysis With Python

Labs

The Labs for the course are located in the Labs folder are from CognitiveClass and are licensed under MIT

Introduction

The data being analysed will be based on a used-car analysis and how to estimate the price of a used car based on its characteristics

Dataset to be analyzed in Python
Overview of Python Packages
Importing and Exporting Data
Basic Insights from the Data

Understanding the Data

Thhe Data being used is the Autos dataset from the Machine Learning Database at archive.ics.uci

The first few lines of the file are as follows

imports-85.data

1
3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548,dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495
2
3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548,dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,16500
3
1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.50,171.20,65.50,52.40,2823,ohcv,six,152,mpfi,2.68,3.47,9.00,154,5000,19,26,16500
4
2,164,audi,gas,std,four,sedan,fwd,front,99.80,176.60,66.20,54.30,2337,ohc,four,109,mpfi,3.19,3.40,10.00,102,5500,24,30,13950
5
2,164,audi,gas,std,four,sedan,4wd,front,99.40,176.60,66.40,54.30,2824,ohc,five,136,mpfi,3.19,3.40,8.00,115,5500,18,22,17450
6
2,?,audi,gas,std,two,sedan,fwd,front,99.80,177.30,66.30,53.10,2507,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,15250
7
1,158,audi,gas,std,four,sedan,fwd,front,105.80,192.70,71.40,55.70,2844,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,17710
8
1,?,audi,gas,std,four,wagon,fwd,front,105.80,192.70,71.40,55.70,2954,ohc,five,136,mpfi,3.19,3.40,8.50,110,5500,19,25,18920

And the description of the data can be found here, the attributes are as follows

Column	Attribute	Attribute Range
1.	symboling:	-3, -2, -1, 0, 1, 2, 3.
2.	normalized-losses:	continuous from 65 to 256.
3.	make:	alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
4.	fuel-type:	diesel, gas.
5.	aspiration:	std, turbo.
6.	num-of-doors:	four, two.
7.	body-style:	hardtop, wagon, sedan, hatchback, convertible.
8.	drive-wheels:	4wd, fwd, rwd.
9.	engine-location:	front, rear.
10.	wheel-base:	continuous from 86.6 120.9.
11.	length:	continuous from 141.1 to 208.1.
12.	width:	continuous from 60.3 to 72.3.
13.	height:	continuous from 47.8 to 59.8.
14.	curb-weight:	continuous from 1488 to 4066.
15.	engine-type:	dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
16.	num-of-cylinders:	eight, five, four, six, three, twelve, two.
17.	engine-size:	continuous from 61 to 326.
18.	fuel-system:	1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
19.	bore:	continuous from 2.54 to 3.94.
20.	stroke:	continuous from 2.07 to 4.17.
21.	compression-ratio:	continuous from 7 to 23.
22.	horsepower:	continuous from 48 to 288.
23.	peak-rpm:	continuous from 4150 to 6600.
24.	city-mpg:	continuous from 13 to 49.
25.	highway-mpg:	continuous from 16 to 54.
26.	price:	continuous from 5118 to 45400.

Missing Attribute Values: (denoted by ”?“)

The Symboling is an indicator of the vehicle risk level, the lower the level the lower the risk (from an insurance level)

The normalized-losses is an indicator of the rate at which the vehicle loses value over time

The price is the value that we would like to predict given the other features

Note that this dataset is from 1995 and therefore the prices may seem a little low

Python Libraries

Scientific Computing Libraries

Pandas

Pandas is a library for working with Data Structures, primarily DataFrames

NumPy

NumPy is a library for working with Arrays and Matrices

SciPy

SciPy includes functions for assisting in mathematical analysis as well as some basic visualizations

Visualization

Matplotlib

Matplotlib is the most commonly used Python library for data visualization

Seaborn

Seaborn is based on Matplotlib and provides functionality such as heat maps, time series, and viollin plots

Algorithmic Libraries

Scikit-learn

Machine learning, regression, classification,etc.

Statsmodels

Explore data, estimate statistical models, and perform statistical tests

Import and Export Data

Importing

When importing data we need ot take a few things into consideration such as

Format
File source/path

In our case the data is CSV (.data), and is located as a remote source

We can import our data as follows

1
import pandas as pd
2
import numpy as np

1
# read the online file by the URL provides above, and assign it to variable "df"
2
path="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
3

4
df = pd.read_csv(path,header=None)
5

6
print("import done")

1
import done

1
df.head(2)

	0	1	2	3	4	5	6	7	8	9	...	16	17	18	19	20	21	22	23	24	25
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500

2 rows × 26 columns

We can see that our data comes in without headers, we can assign the headers to our data based on the information in the imports-85.names file

1
headers = ['symboling','normalized-losses','make','fuel-type','aspiration', 'num-of-doors','body-style',
2
         'drive-wheels','engine-location','wheel-base', 'length','width','height','curb-weight','engine-type',
3
         'num-of-cylinders', 'engine-size','fuel-system','bore','stroke','compression-ratio','horsepower',
4
         'peak-rpm','city-mpg','highway-mpg','price']
5
df.columns = headers

1
df.head(10)

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450
5	2	?	audi	gas	std	two	sedan	fwd	front	99.8	...	136	mpfi	3.19	3.40	8.5	110	5500	19	25	15250
6	1	158	audi	gas	std	four	sedan	fwd	front	105.8	...	136	mpfi	3.19	3.40	8.5	110	5500	19	25	17710
7	1	?	audi	gas	std	four	wagon	fwd	front	105.8	...	136	mpfi	3.19	3.40	8.5	110	5500	19	25	18920
8	1	158	audi	gas	turbo	four	sedan	fwd	front	105.8	...	131	mpfi	3.13	3.40	8.3	140	5500	17	20	23875
9	0	?	audi	gas	turbo	two	hatchback	4wd	front	99.5	...	131	mpfi	3.13	3.40	7.0	160	5500	16	22	?

10 rows × 26 columns

Next we can remove the missing values from the price column as follows

Analyzing Data

Pandas has a few different methods to undertand the data

We need to do some basic checks such as

Data Types
- Pandas automatically assigns data types which may not necessarily be correct
- This will further allow us to understand what functions we can apply to what columns

1
df.dtypes

1
symboling              int64
2
normalized-losses     object
3
make                  object
4
fuel-type             object
5
aspiration            object
6
num-of-doors          object
7
body-style            object
8
drive-wheels          object
9
engine-location       object
10
wheel-base           float64
11
length               float64
12
width                float64
13
height               float64
14
curb-weight            int64
15
engine-type           object
16
num-of-cylinders      object
17
engine-size            int64
18
fuel-system           object
19
bore                  object
20
stroke                object
21
compression-ratio    float64
22
horsepower            object
23
peak-rpm              object
24
city-mpg               int64
25
highway-mpg            int64
26
price                 object
27
dtype: object

We can also view the statisitcal summary as follows

1
df.describe()

	symboling	wheel-base	length	width	height	curb-weight	engine-size	compression-ratio	city-mpg	highway-mpg
count	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000
mean	0.834146	98.756585	174.049268	65.907805	53.724878	2555.565854	126.907317	10.142537	25.219512	30.751220
std	1.245307	6.021776	12.337289	2.145204	2.443522	520.680204	41.642693	3.972040	6.542142	6.886443
min	-2.000000	86.600000	141.100000	60.300000	47.800000	1488.000000	61.000000	7.000000	13.000000	16.000000
25%	0.000000	94.500000	166.300000	64.100000	52.000000	2145.000000	97.000000	8.600000	19.000000	25.000000
50%	1.000000	97.000000	173.200000	65.500000	54.100000	2414.000000	120.000000	9.000000	24.000000	30.000000
75%	2.000000	102.400000	183.100000	66.900000	55.500000	2935.000000	141.000000	9.400000	30.000000	34.000000
max	3.000000	120.900000	208.100000	72.300000	59.800000	4066.000000	326.000000	23.000000	49.000000	54.000000

However, if we want to view a summary including fields that are of type objectm we can do the following

1
df.describe(include = "all")

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
count	205.000000	205	205	205	205	205	205	205	205	205.000000	...	205.000000	205	205	205	205.000000	205	205	205.000000	205.000000	205
unique	NaN	52	22	2	2	3	5	3	2	NaN	...	NaN	8	39	37	NaN	60	24	NaN	NaN	187
top	NaN	?	toyota	gas	std	four	sedan	fwd	front	NaN	...	NaN	mpfi	3.62	3.40	NaN	68	5500	NaN	NaN	?
freq	NaN	41	32	185	168	114	96	120	202	NaN	...	NaN	94	23	20	NaN	19	37	NaN	NaN	4
mean	0.834146	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	98.756585	...	126.907317	NaN	NaN	NaN	10.142537	NaN	NaN	25.219512	30.751220	NaN
std	1.245307	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6.021776	...	41.642693	NaN	NaN	NaN	3.972040	NaN	NaN	6.542142	6.886443	NaN
min	-2.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	86.600000	...	61.000000	NaN	NaN	NaN	7.000000	NaN	NaN	13.000000	16.000000	NaN
25%	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	94.500000	...	97.000000	NaN	NaN	NaN	8.600000	NaN	NaN	19.000000	25.000000	NaN
50%	1.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	97.000000	...	120.000000	NaN	NaN	NaN	9.000000	NaN	NaN	24.000000	30.000000	NaN
75%	2.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	102.400000	...	141.000000	NaN	NaN	NaN	9.400000	NaN	NaN	30.000000	34.000000	NaN
max	3.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	120.900000	...	326.000000	NaN	NaN	NaN	23.000000	NaN	NaN	49.000000	54.000000	NaN

11 rows × 26 columns

We can also use df.info to see the top and bottom thirty rows of the dataframe

If we would like to get a more complete report of the dataset however, we can make use of the pandas_profiling library which will output a full overall data visualisation

1
import pandas as pd
2
import pandas_profiling
3
pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/planets.csv').profile_report()

Preprocessing Data

Data preprocessing is the process of cleaning the data in order to get data from its raw form to a more usable format

This can include different stages such as

Missing value handling
Data formatting
Data normalization
Data binning
Turning categorical values into numerical values

Missing Values

When no data value is stored for a specific feature in an obsevation

Missing data can be represented in many different ways such as ?, N/A, 0 or blank among other ways

In our dataset normalized losses are represented as NaA

We can deal with missing values in a variety of ways

Check if the correct data can be found
Drop the feature
Drop the data entry
Replace with average or similar datapoints
Replace with most common value
Replace based on knowledge about data
Leave the data as missing data

Dropping missing data can be done with the df.dropna() function, since we have missing values in the price column, which means those rows will need to be dropped

We also need to make use of the inplace=True parameter to modify the actual dataframe

The df.replace() function allows us to replace missing values in the data

1
df.replace("?", np.nan, inplace = True)
2
df.head(10)

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450
5	2	NaN	audi	gas	std	two	sedan	fwd	front	99.8	...	136	mpfi	3.19	3.40	8.5	110	5500	19	25	15250
6	1	158	audi	gas	std	four	sedan	fwd	front	105.8	...	136	mpfi	3.19	3.40	8.5	110	5500	19	25	17710
7	1	NaN	audi	gas	std	four	wagon	fwd	front	105.8	...	136	mpfi	3.19	3.40	8.5	110	5500	19	25	18920
8	1	158	audi	gas	turbo	four	sedan	fwd	front	105.8	...	131	mpfi	3.13	3.40	8.3	140	5500	17	20	23875
9	0	NaN	audi	gas	turbo	two	hatchback	4wd	front	99.5	...	131	mpfi	3.13	3.40	7.0	160	5500	16	22	NaN

10 rows × 26 columns

Next, we can count the missing value in each column

1
missing_data = df.isnull()
2
missing_data.head(5)

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

5 rows × 26 columns

1
for column in missing_data.columns.values.tolist():
2
    print(column)
3
    print (missing_data[column].value_counts())
4
    print('')

1
symboling
2
False    205
3
Name: symboling, dtype: int64
4

5
normalized-losses
6
False    164
7
True      41
8
Name: normalized-losses, dtype: int64
9

10
make
11
False    205
12
Name: make, dtype: int64
13

14
fuel-type
15
False    205
16
Name: fuel-type, dtype: int64
17

18
aspiration
19
False    205
20
Name: aspiration, dtype: int64
21

22
num-of-doors
23
False    203
24
True       2
25
Name: num-of-doors, dtype: int64
26

27
body-style
28
False    205
29
Name: body-style, dtype: int64
30

31
drive-wheels
32
False    205
33
Name: drive-wheels, dtype: int64
34

35
engine-location
36
False    205
37
Name: engine-location, dtype: int64
38

39
wheel-base
40
False    205
41
Name: wheel-base, dtype: int64
42

43
length
44
False    205
45
Name: length, dtype: int64
46

47
width
48
False    205
49
Name: width, dtype: int64
50

51
height
52
False    205
53
Name: height, dtype: int64
54

55
curb-weight
56
False    205
57
Name: curb-weight, dtype: int64
58

59
engine-type
60
False    205
61
Name: engine-type, dtype: int64
62

63
num-of-cylinders
64
False    205
65
Name: num-of-cylinders, dtype: int64
66

67
engine-size
68
False    205
69
Name: engine-size, dtype: int64
70

71
fuel-system
72
False    205
73
Name: fuel-system, dtype: int64
74

75
bore
76
False    201
77
True       4
78
Name: bore, dtype: int64
79

80
stroke
81
False    201
82
True       4
83
Name: stroke, dtype: int64
84

85
compression-ratio
86
False    205
87
Name: compression-ratio, dtype: int64
88

89
horsepower
90
False    203
91
True       2
92
Name: horsepower, dtype: int64
93

94
peak-rpm
95
False    203
96
True       2
97
Name: peak-rpm, dtype: int64
98

99
city-mpg
100
False    205
101
Name: city-mpg, dtype: int64
102

103
highway-mpg
104
False    205
105
Name: highway-mpg, dtype: int64
106

107
price
108
False    201
109
True       4
110
Name: price, dtype: int64

We’ll replace the missing data as follows

Replace by Mean
- “normalized-losses”: 41 missing data
- “stroke”: 4 missing data
- “bore”: 4 missing data
- “horsepower”: 2 missing data
- “peak-rpm”: 2 missing data
Replace by Frequency
- “num-of-doors”: 2 missing data
  - Becase most cars have 4 doors
Drop the Roe
- “price”: 4 missing data
  - Because price is what we want to predict, it does not help if it is not there

Normalized Losses

1
avg_1 = df['normalized-losses'].astype('float').mean(axis = 0)
2
df['normalized-losses'].replace(np.nan, avg_1, inplace = True)

Bore

1
avg_2=df['bore'].astype('float').mean(axis=0)
2
df['bore'].replace(np.nan, avg_2, inplace= True)

Stroke

1
avg_3 = df["stroke"].astype("float").mean(axis = 0)
2
df["stroke"].replace(np.nan, avg_3, inplace = True)

Horsepower

1
avg_4=df['horsepower'].astype('float').mean(axis=0)
2
df['horsepower'].replace(np.nan, avg_4, inplace= True)

Peak RPM

1
avg_5=df['peak-rpm'].astype('float').mean(axis=0)
2
df['peak-rpm'].replace(np.nan, avg_5, inplace= True)

Number of Doors

First we need to check which door count is the most common

1
df['num-of-doors'].value_counts()

1
four    114
2
two      89
3
Name: num-of-doors, dtype: int64

And then replace invalid values with that

1
df["num-of-doors"].replace(np.nan, "four", inplace = True)

Price

And then lastly drop the columns with a missing price

1
df.dropna(subset=["price"], axis=0, inplace = True)
2

3
# reset index, because we droped two rows
4
df.reset_index(drop = True, inplace = True)

Clean Data

The we can view our dataset which has no missing values

1
df.head()

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	122	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	122	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	122	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450

5 rows × 26 columns

Formatting Data

Now we nee to ensure that our data is correctly formatted

Data is usually collected in different places and stored in different formats
Bringing data into a common standard allows for more meaningful comparison

In order to format data, Pandas comes with some tool for us to use

We can also apply mathematical to our data as well as type conversions and column renames

Sometimes the incorrect data type may be set by default to the correct type, it may be necessary for us to convert our data to the correct type for analysis. We can convert datatypes using the df.astype() function, and check types with df.dtypes

Data Types

1
df.dtypes

1
symboling              int64
2
normalized-losses     object
3
make                  object
4
fuel-type             object
5
aspiration            object
6
num-of-doors          object
7
body-style            object
8
drive-wheels          object
9
engine-location       object
10
wheel-base           float64
11
length               float64
12
width                float64
13
height               float64
14
curb-weight            int64
15
engine-type           object
16
num-of-cylinders      object
17
engine-size            int64
18
fuel-system           object
19
bore                  object
20
stroke                object
21
compression-ratio    float64
22
horsepower            object
23
peak-rpm              object
24
city-mpg               int64
25
highway-mpg            int64
26
price                 object
27
dtype: object

Convert types

We can easily convert our data to the correct types with

1
df[['bore', 'stroke']] = df[['bore', 'stroke']].astype('float')
2
df[['normalized-losses']] = df[['normalized-losses']].astype('int')
3
df[['price']] = df[['price']].astype('float')
4
df[['peak-rpm']] = df[['peak-rpm']].astype('float')

And view the corrected types

1
df.dtypes

1
symboling              int64
2
normalized-losses      int64
3
make                  object
4
fuel-type             object
5
aspiration            object
6
num-of-doors          object
7
body-style            object
8
drive-wheels          object
9
engine-location       object
10
wheel-base           float64
11
length               float64
12
width                float64
13
height               float64
14
curb-weight            int64
15
engine-type           object
16
num-of-cylinders      object
17
engine-size            int64
18
fuel-system           object
19
bore                 float64
20
stroke               float64
21
compression-ratio    float64
22
horsepower            object
23
peak-rpm             float64
24
city-mpg               int64
25
highway-mpg            int64
26
price                float64
27
dtype: object

Standardization

Standardization is the process of taking data from one format to another that may be more meaningful for us to use, such as converting our fuel consumption

1
# transform mpg to L/100km by mathematical operation (235 divided by mpg)
2
df['city-L/100km'] = 235/df['city-mpg']
3
df[['city-mpg','city-L/100km']].head()

	city-mpg	city-L/100km
0	21	11.190476
1	21	11.190476
2	19	12.368421
3	24	9.791667
4	18	13.055556

1
df.head()

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price	city-L/100km
0	3	122	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	mpfi	3.47	2.68	9.0	111	5000.0	21	27	13495.0	11.190476
1	3	122	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	mpfi	3.47	2.68	9.0	111	5000.0	21	27	16500.0	11.190476
2	1	122	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	mpfi	2.68	3.47	9.0	154	5000.0	19	26	16500.0	12.368421
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	mpfi	3.19	3.40	10.0	102	5500.0	24	30	13950.0	9.791667
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	mpfi	3.19	3.40	8.0	115	5500.0	18	22	17450.0	13.055556

5 rows × 27 columns

1
df['highway-L/100km'] = 235/df['highway-mpg']
2
df[['highway-mpg','highway-L/100km']].head()

	highway-mpg	highway-L/100km
0	27	8.703704
1	27	8.703704
2	26	9.038462
3	30	7.833333
4	22	10.681818

1
df.head()

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price	city-L/100km	highway-L/100km
0	3	122	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	3.47	2.68	9.0	111	5000.0	21	27	13495.0	11.190476	8.703704
1	3	122	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	3.47	2.68	9.0	111	5000.0	21	27	16500.0	11.190476	8.703704
2	1	122	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	2.68	3.47	9.0	154	5000.0	19	26	16500.0	12.368421	9.038462
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	3.19	3.40	10.0	102	5500.0	24	30	13950.0	9.791667	7.833333
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	3.19	3.40	8.0	115	5500.0	18	22	17450.0	13.055556	10.681818

5 rows × 28 columns

Renaming Columns

If there is a need for us to rename columns we can do so with df.rename(), for example

1
df.rename(columns={'"highway-mpg"':'highway-consumption'}, inplace=True)

Data Normalization

This is an important part of data preprocessing

We may need to normalize our variables such that the range of our data is more consistent, allowing us to manage the way that different values will impact our analysis

There are several differnet ways to normalize data

Simple Feature Scaling

x_{new}=\frac{x_{old}}{x_{max}}

1
df['length'] = df['length']/df['length'].max()

Min-Max

x_{new}=\frac{x_{old}-x_{min}}{x_{max}-x_{min}}

1
df['length'] = (df['length']-df['length'].min())/
2
               (df['length'].max()-df['length'].min())

Z-Score

The resulting values hover around zero, and are in terms of their standard deviation from the mean

x_{new}=\frac{x_{old}-\mu}{\sigma}

1
df['length'] = (df['length']-df['length'].mean())/
2
               df['length'].std()

Normalization

Next we will normalize our values using the Simple Feature Scaling Method

We will apply this to the following features

‘length’
‘width’
‘height’

1
df['length'] = df['length']/df['length'].max()
2
df['width'] = df['width']/df['width'].max()
3
df['height'] = df['height']/df['height'].max()
4
df[["length","width","height"]].head()

	length	width	height
0	0.811148	0.890278	0.816054
1	0.811148	0.890278	0.816054
2	0.822681	0.909722	0.876254
3	0.848630	0.919444	0.908027
4	0.848630	0.922222	0.908027

Binning

Binning is grouping values together into bins, this can sometimes improve accuracy of predictive models and help us to better understand the data distribution

We’ll arrange our ‘horsepower’ column into bins such that we can label cars as having a ‘Low’, ‘Medium’, or ‘High’ horsepower as follows

1
df["horsepower"]=df["horsepower"].astype(float, copy=True)
2
binwidth = (max(df["horsepower"])-min(df["horsepower"]))/4
3
bins = np.arange(min(df["horsepower"]), max(df["horsepower"]), binwidth)
4
bins

1
array([  48. ,  101.5,  155. ,  208.5])

We can then use pd.cut() to determine what bin each value belongs in

1
group_names = ['Low', 'Medium', 'High']
2
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names,include_lowest=True )
3
df[['horsepower','horsepower-binned']].head(20)

	horsepower	horsepower-binned
0	111.0	Medium
1	111.0	Medium
2	154.0	Medium
3	102.0	Medium
4	115.0	Medium
5	110.0	Medium
6	110.0	Medium
7	110.0	Medium
8	140.0	Medium
9	101.0	Low
10	101.0	Low
11	121.0	Medium
12	121.0	Medium
13	121.0	Medium
14	182.0	High
15	182.0	High
16	182.0	High
17	48.0	Low
18	70.0	Low
19	70.0	Low

Visualization

We can use matplotlib to visualize the number of vehicles in each of our bins with the following

1
%matplotlib inline
2
from matplotlib import pyplot as plt
3

4
a = (0,1,2)
5

6
# draw historgram of attribute "horsepower" with bins = 3
7
plt.hist(df["horsepower"], bins = 3)
8

9
# set x/y labels and plot title
10
plt.xlabel("Horsepower")
11
plt.ylabel("Count")
12
plt.title("Horsepower Bins")
13
plt.show()

png

Categorical to Quantatative

Most statistical models cannot take in objects as strings, in order to do this we provide a dummy variable that contains whether or not an entry is a certain string variable, we can do this in pandas with pd.get_dummies()

This method will automatically generate a dummy vector for the provided dataframe, and is used as follows

1
dummy_value = pd.get_dummies(df['value'])

Fuel

We will do this for the ‘fuel-type’ column in our data

1
dummy_fuel = pd.get_dummies(df["fuel-type"])
2
dummy_fuel.head()

	diesel	gas
0	0	1
1	0	1
2	0	1
3	0	1
4	0	1

Then we will rename our columns for clarity

1
dummy_fuel.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)
2
dummy_fuel.head()

	fuel-type-diesel	fuel-type-gas
0	0	1
1	0	1
2	0	1
3	0	1
4	0	1

Then we will add this column to our dataset

1
# merge data frame "df" and "dummy_fuel"
2
df = pd.concat([df, dummy_fuel], axis=1)
3

4
# drop original column "fuel-type" from "df"
5
df.drop("fuel-type", axis = 1, inplace=True)

1
df.head()

	symboling	normalized-losses	make	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	...	horsepower	peak-rpm	city-mpg	highway-mpg	price	city-L/100km	highway-L/100km	horsepower-binned	fuel-type-gas
0	3	122	alfa-romero	std	two	convertible	rwd	front	88.6	0.811148	...	111.0	5000.0	21	27	13495.0	11.190476	8.703704	Medium	1
1	3	122	alfa-romero	std	two	convertible	rwd	front	88.6	0.811148	...	111.0	5000.0	21	27	16500.0	11.190476	8.703704	Medium	1
2	1	122	alfa-romero	std	two	hatchback	rwd	front	94.5	0.822681	...	154.0	5000.0	19	26	16500.0	12.368421	9.038462	Medium	1
3	2	164	audi	std	four	sedan	fwd	front	99.8	0.848630	...	102.0	5500.0	24	30	13950.0	9.791667	7.833333	Medium	1
4	2	164	audi	std	four	sedan	4wd	front	99.4	0.848630	...	115.0	5500.0	18	22	17450.0	13.055556	10.681818	Medium	1

5 rows × 30 columns

Aspiration

We will do the same as above with our aspiration variable

1
# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
2
dummy_aspiration = pd.get_dummies(df['aspiration'])
3

4
# change column names for clarity
5
dummy_aspiration.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)
6

7
# show first 5 instances of data frame "dummy_variable_1"
8
dummy_aspiration.head()

	aspiration-std	aspiration-turbo
0	1	0
1	1	0
2	1	0
3	1	0
4	1	0

1
# merge data frame "df" and "dummy_variable_1"
2
df = pd.concat([df, dummy_aspiration], axis=1)
3

4
# drop original column "fuel-type" from "df"
5
df.drop('aspiration', axis = 1, inplace=True)

1
df.head()

	symboling	normalized-losses	make	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	width	...	city-mpg	highway-mpg	price	city-L/100km	highway-L/100km	horsepower-binned	fuel-type-gas	aspiration-std
0	3	122	alfa-romero	two	convertible	rwd	front	88.6	0.811148	0.890278	...	21	27	13495.0	11.190476	8.703704	Medium	1	1
1	3	122	alfa-romero	two	convertible	rwd	front	88.6	0.811148	0.890278	...	21	27	16500.0	11.190476	8.703704	Medium	1	1
2	1	122	alfa-romero	two	hatchback	rwd	front	94.5	0.822681	0.909722	...	19	26	16500.0	12.368421	9.038462	Medium	1	1
3	2	164	audi	four	sedan	fwd	front	99.8	0.848630	0.919444	...	24	30	13950.0	9.791667	7.833333	Medium	1	1
4	2	164	audi	four	sedan	4wd	front	99.4	0.848630	0.922222	...	18	22	17450.0	13.055556	10.681818	Medium	1	1

5 rows × 31 columns

Exporting

Then we can export our clean data to a CSV as follows

1
df.to_csv('clean_df.csv')

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach to analyze data in order to

Summarize main characteristics of data
Gain better understanding of a dataset
Uncover relationships between varables
Identify important variables that impact our problem

Descriptive Statistics

When analyzing data it is important to describe the data giving a few short summaries. We have a few different ways to do this, such as

df.describe() to get general statistical information about numeric data
df.values_counts to get a count of the different values of categorical data
sns.boxplot() to generate box plots
plt.scatter() Scatter plots show the relationship between the predictor and target variables

Correlation

Correlation is a measure to what exctent different vriables are interdependent

Correlation does not imply causation

Pearson Correlation

We can make use of Pearson correlation to measure the strength of correlation between two features, this has two values

Correlation Coeffient
- Close to 1: Positive relationship
- Close to -1: Negative relationship
- Close to 0: No relationship
P-Value
- P < 0.001: Strong certainty in the result
- P < 0.05 : Moderate certainty in result
- P < 0.1 : Weak certainty in result
- P > 0.1 : No certainty in result

We would define a strong correlation when the Correlation Coefficient is around 1 or -1, and the P-Value is less than 0.001

We can find our correlation values for two variables with

1
from scipy import stats

Wheel Base and Price

1
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
2
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

1
The Pearson Correlation Coefficient is 0.584641822266  with a P-value of P = 8.07648827073e-20

Horsepower and Price

1
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
2
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

1
The Pearson Correlation Coefficient is 0.809574567004  with a P-value of P = 6.36905742826e-48

Positive Correlation

We can visualize our data using some of the visualizations we described above such as by comparing engine size to price

1
import seaborn as sns
2

3
sns.regplot(x="engine-size", y="price", data=df)
4
plt.ylim(0,)
5
plt.show()

png

We can also view the correlation matrix for these two variables as follows

1
df[["engine-size", "price"]].corr()

	engine-size	price
engine-size	1.000000	0.872335
price	0.872335	1.000000

From this we can see that engine size is a fairly good predictor of price

Negative Correlation

We can also look at the correation between highway mileage and price

1
sns.regplot(x="highway-mpg", y="price", data=df)
2
plt.show()
3
df[['highway-mpg', 'price']].corr()

png

	highway-mpg	price
highway-mpg	1.000000	-0.704692
price	-0.704692	1.000000

Where we can note a Negative Linear Relationship

Weak Correlation

We can the compare Peak RPM and price

1
sns.regplot(x="peak-rpm", y="price", data=df)
2
plt.show()
3
df[['peak-rpm','price']].corr()

png

	peak-rpm	price
peak-rpm	1.000000	-0.101616
price	-0.101616	1.000000

Where we can see a weak linear relationship

We can also observe the relationships between stroke and price

1
sns.regplot(x="stroke", y="price", data=df)
2
plt.show()
3
df[["stroke","price"]].corr()

png

	stroke	price
stroke	1.000000	0.082269
price	0.082269	1.000000

Next we can look at boxplots of our different categorical values in order to look at their distribution

1
sns.boxplot(x="body-style", y="price", data=df)
2
plt.show()

1
/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
2
  box_data = remove_na(group_data)

png

1
sns.boxplot(x="engine-location", y="price", data=df)
2
plt.show()

1
/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
2
  box_data = remove_na(group_data)

png

1
sns.boxplot(x="drive-wheels", y="price", data=df)
2
plt.show()

1
/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.
2
  box_data = remove_na(group_data)

png

Descriptive Statistics

Next we can find some descriptive statistics about our data by the following

1
df.describe()

	symboling	normalized-losses	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	...	peak-rpm	city-mpg	highway-mpg	price	city-L/100km	highway-L/100km	fuel-type-diesel	fuel-type-gas	aspiration-std	aspiration-turbo
count	201.000000	201.00000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	...	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000
mean	0.840796	122.00000	98.797015	0.837102	0.915126	0.899108	2555.666667	126.875622	3.330692	3.256874	...	5117.665368	25.179104	30.686567	13207.129353	9.944145	8.044957	0.099502	0.900498	0.820896	0.179104
std	1.254802	31.99625	6.066366	0.059213	0.029187	0.040933	517.296727	41.546834	0.268072	0.316048	...	478.113805	6.423220	6.815150	7947.066342	2.534599	1.840739	0.300083	0.300083	0.384397	0.384397
min	-2.000000	65.00000	86.600000	0.678039	0.837500	0.799331	1488.000000	61.000000	2.540000	2.070000	...	4150.000000	13.000000	16.000000	5118.000000	4.795918	4.351852	0.000000	0.000000	0.000000	0.000000
25%	0.000000	101.00000	94.500000	0.801538	0.890278	0.869565	2169.000000	98.000000	3.150000	3.110000	...	4800.000000	19.000000	25.000000	7775.000000	7.833333	6.911765	0.000000	1.000000	1.000000	0.000000
50%	1.000000	122.00000	97.000000	0.832292	0.909722	0.904682	2414.000000	120.000000	3.310000	3.290000	...	5125.369458	24.000000	30.000000	10295.000000	9.791667	7.833333	0.000000	1.000000	1.000000	0.000000
75%	2.000000	137.00000	102.400000	0.881788	0.925000	0.928094	2926.000000	141.000000	3.580000	3.410000	...	5500.000000	30.000000	34.000000	16500.000000	12.368421	9.400000	0.000000	1.000000	1.000000	0.000000
max	3.000000	256.00000	120.900000	1.000000	1.000000	1.000000	4066.000000	326.000000	3.940000	4.170000	...	6600.000000	49.000000	54.000000	45400.000000	18.076923	14.687500	1.000000	1.000000	1.000000	1.000000

8 rows × 22 columns

1
df.describe(include=['object'])

	make	num-of-doors	body-style	drive-wheels	engine-location	engine-type	num-of-cylinders	fuel-system
count	201	201	201	201	201	201	201	201
unique	22	2	5	3	2	6	7	8
top	toyota	four	sedan	fwd	front	ohc	four	mpfi
freq	32	115	94	118	198	145	157	92

Value Counts

We can get an indication as to the frequencies of our categorical values with the following

1
# drive-wheel counts as a variable
2
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
3
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
4
drive_wheels_counts
5
drive_wheels_counts.index.name = 'drive-wheels'
6
drive_wheels_counts

	value_counts
drive-wheels
fwd	118
rwd	75
4wd	8

1
# engine-location as variable
2
engine_loc_counts = df['engine-location'].value_counts().to_frame()
3
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
4
engine_loc_counts.index.name = 'engine-location'
5
engine_loc_counts.head(10)

	value_counts
engine-location
front	198
rear	3

Grouping Data

It can be helpful to group data in order to see if there is a correlation between a certain categorical data and some other variable

We can get an array of all uniques ‘drive-wheel’ categories with

1
df['drive-wheels'].unique()

1
array(['rwd', 'fwd', '4wd'], dtype=object)

In Pandas we can use df.groupby()to do this as such

1
df_wheels = df[['drive-wheels','body-style','price']]
2
df_wheels = df_wheels.groupby(['drive-wheels'],as_index= False).mean()
3
df_wheels

	drive-wheels	price
0	4wd	10241.000000
1	fwd	9244.779661
2	rwd	19757.613333

Next we can look at the wheels compared to the body style

1
df_external = df[['drive-wheels', 'body-style', 'price']]
2
df_group = df_external = df_external.groupby(['drive-wheels','body-style'],
3
                                            as_index=False).mean().round()
4
df_group

	drive-wheels	body-style	price
0	4wd	hatchback	7603.0
1	4wd	sedan	12647.0
2	4wd	wagon	9096.0
3	fwd	convertible	11595.0
4	fwd	hardtop	8249.0
5	fwd	hatchback	8396.0
6	fwd	sedan	9812.0
7	fwd	wagon	9997.0
8	rwd	convertible	23950.0
9	rwd	hardtop	24203.0
10	rwd	hatchback	14338.0
11	rwd	sedan	21712.0
12	rwd	wagon	16994.0

We can view this data a bit more conviniently as a pivot table, we can do this with df.pivot() so as to display the data as below

1
grouped_pivot = df_group.pivot(index='drive-wheels',columns='body-style')
2
grouped_pivot

	price
body-style	convertible	hardtop	hatchback	sedan	wagon
drive-wheels
4wd	NaN	NaN	7603.0	12647.0	9096.0
fwd	11595.0	8249.0	8396.0	9812.0	9997.0
rwd	23950.0	24203.0	14338.0	21712.0	16994.0

Next we can use a heatmap to visualize the above relationship, the heatmap plots the target variable as the colour with respect to the drive-wheels and body-style variables

1
#use the grouped results
2
fig, ax = plt.subplots()
3
im=ax.pcolor(grouped_pivot, cmap='RdBu')
4

5
#label names
6
row_labels=grouped_pivot.columns.levels[1]
7
col_labels=grouped_pivot.index
8

9
#move ticks and labels to the center
10
ax.set_xticks(np.arange(grouped_pivot.shape[1])+0.5, minor=False)
11
ax.set_yticks(np.arange(grouped_pivot.shape[0])+0.5, minor=False)
12

13
#insert labels
14
ax.set_xticklabels(row_labels, minor=False)
15
ax.set_yticklabels(col_labels, minor=False)
16

17
#rotate label if too long
18
plt.xticks(rotation=90)
19

20
fig.colorbar(im)
21
plt.show()

png

Analysis of Variance (ANOVA)

If we want to analyze a categorical variable and see the correlation between different categories

Statistical comparison of groups

ANOVA is a statistical test that hels find the correlation between different groups of a categorical variable, the ANOVA returns two values

F-Test score is the variation between sample group means divided by variation within the sample group
P-Value is a measure of confidence and show if the obtained result is statistically significant

Drive Wheels

1
grouped_test2=df[['drive-wheels','price']].groupby(['drive-wheels'])
2
grouped_test2.head(2)

	drive-wheels	price
0	rwd	13495.0
1	rwd	16500.0
3	fwd	13950.0
4	4wd	17450.0
5	fwd	15250.0
136	4wd	7603.0

1
grouped_test2.get_group('4wd')['price']

1
4      17450.0
2
136     7603.0
3
140     9233.0
4
141    11259.0
5
144     8013.0
6
145    11694.0
7
150     7898.0
8
151     8778.0
9
Name: price, dtype: float64

To obtain the ANOVA values, we use the stats.f_oneway function from scipy for the correlation between drive wheels and price

Drive Wheels and Price

1
# ANOVA
2
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])
3

4
print( "ANOVA results: F=", f_val, ", P =", p_val)

1
ANOVA results: F= 67.9540650078 , P = 3.39454435772e-23

This shows a large F and small P, meaning that there is a strong corelation between the three drive types and the price, however does this apply for each individual comparison of drive type?

FWD and RWD

1
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])
2

3
print( "ANOVA results: F=", f_val, ", P =", p_val )

1
ANOVA results: F= 130.553316096 , P = 2.23553063557e-23

4WD and RWD

1
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])
2

3
print( "ANOVA results: F=", f_val, ", P =", p_val)

1
ANOVA results: F= 8.58068136892 , P = 0.00441149221123

4WD and FWD

1
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])
2

3
print("ANOVA results: F=", f_val, ", P =", p_val)

1
ANOVA results: F= 0.665465750252 , P = 0.416201166978

Model Development

A model can be thought of as a mathematical equation used to predict a value given a specfic input

More relevant data will allow us to more accurately predict an outcome

Linear Regression

Simple Linear Regression

SLR helps us to identify a relationship between two independant variables in the form of

y=b_0 + b_1x

We make use of training datapoints to help us find the $b_0$ and $b_1$ values

Our measured datapoints will have some noise, causing our data to be differerntiated from the model

We make use of our training data to fit a model to our data, we then use the model to make predictions. We denote our predicted values as an estimate with

\hat y=b_0+b_1x

If the linear model is correct, we can assume that the offsets are noise, otherwise it may be other factors that we have not taken into consideration

We can create a linear regression model by importing it from sklearn and creating a new LinearRegression object as follows

1
from sklearn.linear_model import LinearRegression
2

3
lm = LinearRegression()

Then we define the predictor and target variables and use lm.fit() to find the parameters $b_0$ and $b_1$

1
X = df[['predictor']]
2
Y = df[['target']]
3

4
lm.fit(X,Y)

We can then obtain a prediction using `lm.predict()’

1
yhat = lm.predict(X)

The intercept and slope are attributes of the LinearRegression object and can be found with lm.intercept_ and lm.coef_ respectively

We can train an SLR model for our data as follows

1
df.head()

	symboling	normalized-losses	make	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	width	...	city-mpg	highway-mpg	price	city-L/100km	highway-L/100km	horsepower-binned	fuel-type-gas	aspiration-std
0	3	122	alfa-romero	two	convertible	rwd	front	88.6	0.811148	0.890278	...	21	27	13495.0	11.190476	8.703704	Medium	1	1
1	3	122	alfa-romero	two	convertible	rwd	front	88.6	0.811148	0.890278	...	21	27	16500.0	11.190476	8.703704	Medium	1	1
2	1	122	alfa-romero	two	hatchback	rwd	front	94.5	0.822681	0.909722	...	19	26	16500.0	12.368421	9.038462	Medium	1	1
3	2	164	audi	four	sedan	fwd	front	99.8	0.848630	0.919444	...	24	30	13950.0	9.791667	7.833333	Medium	1	1
4	2	164	audi	four	sedan	4wd	front	99.4	0.848630	0.922222	...	18	22	17450.0	13.055556	10.681818	Medium	1	1

5 rows × 31 columns

1
from sklearn.linear_model import LinearRegression

1
X = df[['highway-mpg']]
2
Y = df['price']

1
lm = LinearRegression()
2
lm.fit(X,Y)

1
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

1
Yhat=lm.predict(X)
2
Yhat[0:5]

1
array([ 16236.50464347,  16236.50464347,  17058.23802179,  13771.3045085 ,
2
        20345.17153508])

1
lm_eq = 'yhat = ' + str(lm.intercept_) + ' + ' + str(lm.coef_[0]) + 'x'
2
print(lm_eq)

1
yhat = 38423.3058582 + -821.733378322x

Multiple Linear Regression

MLR is used to explain the relationship between 1 continuous target and 2 or more predictors

The model will then be defined by the function

\hat y=b_0 + b_1x_1 + b_2x_2 + b_3x_3 + ... + b_nx_n

The resulting equation represents a surface in $n$ dimensional space that will define $y$ given each component of $x$

We can train this model just as before as follows

1
X = df[['predictor1','predictor2','predictor3',...,'predictorN']]
2
Y = df[['target']]
3

4
lm.fit(X,Y)

And then predict as before with

1
yhat = lm.predict(X)

Where X is in the form of the training data

The intercept and coefficiencts of the model can once again be found with lm.intercept_ and lm.coef_ respectively

We can develop a MLR model as follows

1
Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
2
lm.fit(Z, df['price'])
3
print('intercept: ' + str(lm.intercept_), '\ncoefficients: ' + str(lm.coef_))

1
intercept: -15806.6246263
2
coefficients: [ 53.49574423   4.70770099  81.53026382  36.05748882]

Model Evaluation using Visualization

Regression Plot

Regression plots gives us a good estimate of:

The relationship between two variables
The strength of correlation
The direction of the relationship

A regression plot is a combination of a scatter plot, and a linear regression line

We create a regression plot using the sns.regplot() function in seaborn

1
import seaborn as sns
2

3
sns.regplot(x='predictor', y='target', data=df)
4
plt.ylim(0,)
5
plt.show()

Single Linear Regression

We can visualize SLR as follows

1
width = 12
2
height = 10
3
plt.figure(figsize=(width, height))
4
sns.regplot(x="highway-mpg", y="price", data=df)
5
plt.ylim(0,)
6
plt.show()

png

1
plt.figure(figsize=(width, height))
2
sns.regplot(x="peak-rpm", y="price", data=df)
3
plt.ylim(0,)

1
(0, 47422.919330307624)

png

Residual Plot

A Residual plot represents the error between the actual and predicted value, the results should have 0 mean if the linear assumption is applicable

If the residual plot is not equally distributed around the mean of 0 throughout, we know that the linear model is not correct for the data we have

We can create a residual plot as with sns.residplot()

1
sns.residplot(df['predictor'], df['target'])

1
width = 12
2
height = 10
3
plt.figure(figsize=(width, height))
4
sns.residplot(df['highway-mpg'], df['price'])
5
plt.show()

png

Distribution Plot

We use this to look at the distribution of the actual vs predicted values for our model

A distribution plot can be made with sns.distplot()

1
ax1 = sns.distplot(df['target'], hist=False, label='Actual Value')
2
sns.distplot(yhat, hist=False, label='Fitted Values', ax=ax1)
3
plt.show()

Multiple Linear Regression

Using A Distribution Plot is more valuable when looking at MLR model performance

1
Y_hat = lm.predict(Z)

1
plt.figure(figsize=(width, height))
2

3

4
ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
5
sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax1)
6

7

8
plt.title('Actual vs Fitted Values for Price')
9
plt.xlabel('Price (in dollars)')
10
plt.ylabel('Proportion of Cars')
11

12
plt.show()
13
plt.close()

png

Polynomial Regression and Pipelines

Polynomial Regression is a form of regression used to describe curvilinear relationships in which our predictor variable is not linear

Quadratic $\hat y=b_0 + b_1x_1 + b_2x_1^2$
Cubic $\hat y=b_0 + b_1x_1 + b_2x_1^2 + b_2x_1^3$
Higher Order $\hat y=b_0 + b_1x_1 + b_2x_1^2 + ... + b_2x_1^n$

Picking the correct order can make our model more accurate

We can fit a polynomial to our data using np.polyfit(), and can print out the resulting model with np.polydl()

For an $n^{th}$ order polynomial we can create and view a model as follows

1
model = np.polyfit(X,Y,n)
2
print(np.polydl(f))

Visualization

Visualizing Polynomial regression plots is a bit more work but can be done with the function below

1
def PlotPolly(model,independent_variable,dependent_variabble, Name):
2
    x_new = np.linspace(15, 55, 100)
3
    y_new = model(x_new)
4

5
    plt.plot(independent_variable,dependent_variabble,'.', x_new, y_new, '-')
6
    plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
7
    ax = plt.gca()
8
    fig = plt.gcf()
9
    plt.xlabel(Name)
10
    plt.ylabel('Price of Cars')
11

12
    plt.show()
13
    plt.close()

Next we can do the polynomial fit for our data

1
x = df['highway-mpg']
2
y = df['price']
3

4
f = np.polyfit(x, y, 3)
5
p = np.poly1d(f)
6
print(p)

1
        3         2
2
-1.557 x + 204.8 x - 8965 x + 1.379e+05

1
PlotPolly(p,x,y, 'highway-mpg')

png

Multi Dimension Polynomial Regression

Polynomial regression can also be done in multiple dimensions, for example a second order approximation would look like the following

\hat y=b_0 + b_1x_1 + b_2x_2 + b_3x_1x_2 + b_4x_1^2 + b_5x_2^2

If we want to do multi dimentional polynomial fits, we will need to use PolynomialFeatures from sklearn.preprocessing to preprocess our features as follows

1
X = df[['predictor1','predictor2','predictor3',...,'predictorN']]
2
Y = df[['target']]
3

4
pr = PolynomialFeatures(degree=2)
5
x_poly = pr.fit_transform(X, include_bias=False)

1
from sklearn.preprocessing import PolynomialFeatures

1
pr = PolynomialFeatures(degree=2)
2
pr

1
PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)

1
Z_pr = pr.fit_transform(Z)

Before the transformation we have 4 features

1
Z.shape

1
(201, 4)

After the transformation we have 15 features

1
Z_pr.shape

1
(201, 15)

Pre-Processing

sklearn has some preprocessing functionality such as

Normalization

We can train a scaler and normalize our data based on that as follows

1
from sklearn.preprocessing import StandardScaler
2

3
scale = StandardScaler()
4
scale.fit(X[['feature1','feature2',..., 'featureN']])
5

6
X_scaled = scale.transform(X[['feature1','feature2',..., 'featureN']])

There are other normalization functions available with which we can preprocess our data

Pipelines

There are many steps to getting a prediction, such as Normalization, Polynomial Transformation, and training a Linear Regression Model

We can use a Pipeline library to help simplify the process

First we import all the libraries we will need as well as the pipeline library

1
from sklearn.preprocessing import PolynomialFeatures
2
from sklearn.linear_model import LinearRegression
3
from sklearn.preprocessing import StandardScaler
4

5
from sklearn.pipeline import Pipeline

Then we construct our pipeline using a list of tuples defined as ('name of estimator model', ModelConstructor()) and create our pipeline object

1
input = [('scale', StandardScaler()),
2
         ('polynomial', PolynomialFeatures(degree=n),
3
         ...,
4
         ('mode', LinearRegression()]
5
pipe = Pipeline(Input)

We can then create our pipeline object on the data by using the pipe.train function

1
pipe.train(X[['feature1','feature2',..., 'featureN']], Y)
2
yhat = pipe.predict(X[['feature1','feature2',..., 'featureN']])

The above method will normalize the data, then perform a polynomial transform and output a prediction

1
from sklearn.pipeline import Pipeline
2
from sklearn.preprocessing import StandardScaler

1
Input=[('scale',StandardScaler()),
2
       ('polynomial', PolynomialFeatures(include_bias=False)),
3
       ('model',LinearRegression())]

1
pipe=Pipeline(Input)
2
pipe

1
Pipeline(memory=None,
2
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])

1
pipe.fit(Z,y)

1
Pipeline(memory=None,
2
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])

1
ypipe=pipe.predict(Z)
2
ypipe[0:4]

1
array([ 13102.74784201,  13102.74784201,  18225.54572197,  10390.29636555])

In-Sample Evaluation

Two important measures that determine how well a model fits a speficic dataset are

Mean Squared Error (MSE)
R-Squared ( $R^2$ )

Mean Squared Error

We simply find the difference between the average square error of our prediction when compared to our data

To get the MSE in Python we can do the following

1
from sklearn.metrics import mean_squared_error
2

3
mean_squared_error(X['target'],Y_predicted_sample)

R-Squared

$R^2$ is the coefficient of determination

Measure of how close the data is to the fitted regression line
The percentage of variation in Y that is explained by the model
Like comparing our model to the mean of the data in approximating the data

$R^2$ is usually between 0 and 1

R^2=1-\frac{MSE of Regression Line}{MSE of \bar y}

We can get the $R^24$ value with lm.score()

A negative $R^2$ can be a sign of overfitting

In-Sample Evaluation of Models

We can evaluate our models with the following

1
#highway_mpg_fit
2
lm.fit(X, Y)
3
# Find the R^2
4
lm.score(X, Y)

1
0.49659118843391747

1
Yhat = lm.predict(X)
2
Yhat[0:4]

1
array([ 16236.50464347,  16236.50464347,  17058.23802179,  13771.3045085 ])

1
from sklearn.metrics import mean_squared_error

1
#mean_squared_error(Y_true, Y_predict)
2
mean_squared_error(df['price'], Yhat)

1
31635042.944639895

1
# fit the model
2
lm.fit(Z, df['price'])
3
# Find the R^2
4
lm.score(Z, df['price'])

1
0.80935628065774567

1
Y_predict_multifit = lm.predict(Z)

1
mean_squared_error(df['price'], Y_predict_multifit)

1
11980366.87072649

Now we can check what the $R^2$ value for our model is

1
from sklearn.metrics import r2_score

1
r_squared = r2_score(y, p(x))
2
r_squared

1
0.67419466639065173

1
mean_squared_error(df['price'], p(x))

1
20474146.426361222

Prediction and Decision Making

We should use visualization, numerical evaluation, and model comparison in order to see if the model values makes sense

To compare our model to the data we can simply plot the output of our model over the range of our data

1
new_input=np.arange(1,100,1).reshape(-1,1)

1
lm.fit(X, Y)

1
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

1
yhat=lm.predict(new_input)
2
yhat[0:5]

1
array([ 37601.57247984,  36779.83910151,  35958.10572319,  35136.37234487,
2
        34314.63896655])

1
plt.plot(new_input,yhat)
2
plt.show()

png

Conclusion

From the results above (yes, they’re a mess but it’s all pretty much just from the CC Lab file) we can note the $R^2$ and MSE values are as follows

Simple Linear Regression: Using Highway-mpg as a Predictor Variable of Price.

R-squared: 0.49659118843391759
MSE: 3.16 x10^7

Multiple Linear Regression: Using Horsepower, Curb-weight, Engine-size, and Highway-mpg as Predictor Variables of Price.

R-squared: 0.80896354913783497
MSE: 1.2 x10^7

Polynomial Fit: Using Highway-mpg as a Predictor Variable of Price.

R-squared: 0.6741946663906514
MSE: 2.05 x 10^7

SLR vs MLR

Usually having more variables helps the prediction, however if you do not have enough data you can run into trouble or many of the variables may just be noise and not be very useful

The MSE for MLR is Smaller than for SLR, and the $R^2$ for MLR is higher as well, MSE and $R^2$ both seem to indicate that MLR is a better fit than SLR

SLR vs Polynomial

The MSE for the Polynomial Fit is less than for the SLR and the $R^2$ is higher meaning that the Polynomial Fit is a better predictor based on ‘highway-mpg’ than the SLR

MLR vs Polynomial

The MSE for the MLR is smaller than for the Polynomial Fit, and the MLR also has a higher $R^2$ therefore the MLR is as better fit than the Polynomial in this case

Overall

The MLR has the lowest $R^2$ and the highest MSE, meaning that it is the best fit of the three models that have been evaluate

Model Evaluation

Model Evaluation tells us how our model works in the real wold. In-Sample evaluation does not give us an indication as to how our model performs under real lifr circumstances

Training and Test Sets

We typically split our data into a training and testing set and use to build and evaluate our model respectively

Split into
- 70% Training
- 30% Testing
Build and Train with Training Set
Use Testing Set to evaluate model performance

sklearn gives us a function to split out data into a train and test set

1
from sklearn.model_selection import train_test_split
2

3
X_train, X_test, Y_train, Y_test = train_test_split(x_data,y_data,test_size=0.3,random_state=0)

Generalization Performance

Generalization Error is a measure of how well our data does at predicting previously unseen data

The error we obtain using our testing data is an approximation of this error

Cross Validation

Cross validation involves us splitting the data into folds and using a fold for testing and the remainder for training, and we make use of each combination of training, and evalution

We can use cross_val_score() to evaluate our out-of-sample evaluation

1
from sklearn.model_selection import cross_val_score
2
scores = cross_val_score(lr,X,Y,cv=n)

Where X is our predictor matrix, Y is our target, and n is our number of folds

If we want to use get the actual predicted values from our model we can do the following

1
from sklearn.model_selection import cross_val_predict
2
yhat = cross_val_predict(lr,X,Y,cv=n)

Overfitting, Underfitting, and Model Selection

The goal of model selection is to try to select the best function to fit our model, if our model is too simple we will have model that does not appropriately fit our data, whereas if we have a model that is too complex, it wil perfectly fit our testing data but will not be good at approximating new data

We need to be sure to fit the data, not the noise

It is also possible that the data we are trying to approximate cannot be fitted by polynomial at all, for example in the case of cyclic data

We can make use of the following code to look at the effect of order on our $R^2$ error for a model

1
rsq_test = []
2
order = [1,2,3,4]
3

4
for n in order:
5
    pr = PolynomialFeatures(degree=n)
6
    x_train_pr = pr.fit_transform(x_train[['feature']]
7
    x_test_pr = pr.fit_transform(x_test[['feature']]
8
    lm.fit(x_train_pr, y_train)
9
    rsq_test.append(lr.score(x_test_pr, y_test))

Ridge Regression

Ridge Regression prevents overfitting

Ridge regression controls the higher order parameters in our model by using a factor $\alpha$ , increasing our $\alpha$ value will help us avoid overfitting until, however increasing it too far can lead to us underfitting the data

To use ridge regression we can do the following

1
from sklearn.linear_model import Ridge
2

3
ridge_model = Ridge(alpha=0.1)
4
ridge_model.fit(X,Y)
5
yhat=RidgeModel.predict(X)

We will usually start off with a small value of $\alpha$ such as 0.0001 and increase our value in orders of magnitude

Furthermore we can use Cross Validation to identify an optimal $\alpha$

Grid Search

Grid Search allows us to scan through multiple free parameters

Parameters such as $\alpha$ are not part of the training or fitting process. These parameters are called Hyperparameters

sklearn has a means of automatically iterating over these parameters called Grid Search

This allows us to use different hyperparameters to train our model, and select the model that provides the lowest MSE

The values of a grid search are simply a list whcih contains a dictionary for the parameters we want to modify

1
params = [{'param name':[value1, value2, value3, ...]}]

We can use Grid Search as follows

1
from sklearn.linear_model import Ridge
2
from sklearn.model_selection import GridSearchCV
3

4
params = [{'alpha':[0.0001, 0.01, 1, 10, 100]}, {'normalize':[True,False]}]
5

6
rr = Ridge()
7

8
grid = GridSearchCV(rr, params, cv=4)
9
grid.fit(X_train[['feature1','feature2',...], Y_train)
10

11
grid.best_estimator_
12

13
scores = grid.cv_results_

The scores dictionary will store the results for each hyperparameter combination given our inputs

Conclusion

For more specific examples of the different setions look at the appropriate Lab