Data Analysis and Interpretation: LASSO Regression Analysis using Gapminder Data

Introduction

This blog post is about applying LASSO Regression model to gapminder dataset. My interest is in determining not only the association but its strength as well between per person electricity consumption and a myriad of predictors, both quantitative and categorical.

Code

import pandas as pd
import os
import numpy as np
import matplotlib.pylab as plt

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error

# bug fix for display format to avoid run time errors
pd.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

##############################################################################
# DATA MANAGEMENT
##############################################################################

# define method to load data of interest

def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pd.read_csv(DATA_FILE, low_memory=False)
return data

# loading data
gapminder_df = load_data('data', 'gapminder.csv')

#upper-case all DataFrame column names
gapminder_df.columns = map(str.upper, gapminder_df.columns)
gapminder_df = gapminder_df.dropna()

features_df = gapminder_df[['INCOMEPERPERSON', 'ALCCONSUMPTION',
'ARMEDFORCESRATE', 'BREASTCANCERPER100TH',
'CO2EMISSIONS', 'FEMALEEMPLOYRATE', 'HIVRATE',
'INTERNETUSERATE', 'LIFEEXPECTANCY', 'OILPERPERSON',
'POLITYSCORE', 'RELECTRICPERPERSON',
'SUICIDEPER100TH', 'EMPLOYRATE', 'URBANRATE']]

predictors = features_df.copy().dropna()

predictors['INCOMEPERPERSON'] = \
pd.to_numeric(predictors['INCOMEPERPERSON'], errors='coerce')

predictors['ALCCONSUMPTION'] =\
pd.to_numeric(predictors['ALCCONSUMPTION'], errors='coerce')

predictors['ARMEDFORCESRATE'] = \
pd.to_numeric(predictors['ARMEDFORCESRATE'], errors='coerce')

predictors['BREASTCANCERPER100TH'] = \
pd.to_numeric(predictors['BREASTCANCERPER100TH'], errors='coerce')

predictors['CO2EMISSIONS'] = \
pd.to_numeric(predictors['CO2EMISSIONS'], errors='coerce')

predictors['FEMALEEMPLOYRATE'] = \
pd.to_numeric(predictors['FEMALEEMPLOYRATE'], errors='coerce')

predictors['HIVRATE'] = \
pd.to_numeric(predictors['HIVRATE'], errors='coerce')

predictors['INTERNETUSERATE'] = \
pd.to_numeric(predictors['INTERNETUSERATE'], errors='coerce')

predictors['LIFEEXPECTANCY'] = \
pd.to_numeric(predictors['LIFEEXPECTANCY'], errors='coerce')

predictors['OILPERPERSON'] = \
pd.to_numeric(predictors['OILPERPERSON'], errors='coerce')

predictors['POLITYSCORE'] = \
pd.to_numeric(predictors['POLITYSCORE'], errors='coerce')

predictors['RELECTRICPERPERSON'] = \
pd.to_numeric(predictors['RELECTRICPERPERSON'], errors='coerce')

predictors['SUICIDEPER100TH'] = \
pd.to_numeric(predictors['SUICIDEPER100TH'], errors='coerce')

predictors['EMPLOYRATE'] = \
pd.to_numeric(predictors['EMPLOYRATE'], errors='coerce')

predictors['URBANRATE'] = \
pd.to_numeric(predictors['URBANRATE'], errors='coerce')

def map_target(row):
if row['POLITYSCORE'] > 6:
return 0 # Democracy
else:
return 1 # Dictatorship

predictors['DEMO'] = predictors.apply(lambda row: map_target(row), axis=1)
predictors.drop('POLITYSCORE', axis=1, inplace=True)

predictors = predictors.fillna(0)

# standardize predictors to have mean=0 and sd=1
predictors['INCOMEPERPERSON'] =\
preprocessing.scale(predictors['INCOMEPERPERSON'].astype('float64'))

predictors['ALCCONSUMPTION'] =\
preprocessing.scale(predictors['ALCCONSUMPTION'].astype('float64'))

predictors['ARMEDFORCESRATE'] =\
preprocessing.scale(predictors['ARMEDFORCESRATE'].astype('float64'))

predictors['BREASTCANCERPER100TH'] =\
preprocessing.scale(predictors['BREASTCANCERPER100TH'].astype('float64'))

predictors['CO2EMISSIONS'] =\
preprocessing.scale(predictors['CO2EMISSIONS'].astype('float64'))

predictors['FEMALEEMPLOYRATE'] =\
preprocessing.scale(predictors['FEMALEEMPLOYRATE'].astype('float64'))
predictors['HIVRATE'] =\
preprocessing.scale(predictors['HIVRATE'].astype('float64'))
predictors['INTERNETUSERATE'] =\
preprocessing.scale(predictors['INTERNETUSERATE'].astype('float64'))
predictors['LIFEEXPECTANCY'] =\
preprocessing.scale(predictors['LIFEEXPECTANCY'].astype('float64'))
predictors['OILPERPERSON'] =\
preprocessing.scale(predictors['OILPERPERSON'].astype('float64'))
predictors['RELECTRICPERPERSON'] =\
preprocessing.scale(predictors['RELECTRICPERPERSON'].astype('float64'))
predictors['SUICIDEPER100TH'] =\
preprocessing.scale(predictors['SUICIDEPER100TH'].astype('float64'))
predictors['EMPLOYRATE'] =\
preprocessing.scale(predictors['EMPLOYRATE'].astype('float64'))
predictors['URBANRATE'] =\
preprocessing.scale(predictors['URBANRATE'].astype('float64'))
predictors['DEMO'] =\
preprocessing.scale(predictors['DEMO'].astype('float64'))

target = predictors.RELECTRICPERPERSON
predictors.drop('RELECTRICPERPERSON', axis=1, inplace=True)
##############################################################################
# END DATA MANAGEMENT
##############################################################################

###############################################################################
# LASSO REGRESSION ANALYSIS
###############################################################################
# split data into train and test sets
pred_train, pred_test, tar_train, tar_test =\
train_test_split(predictors, target, test_size=.3, random_state=123)

# specify the lasso regression model
model = LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

# print variable names and regression coefficients
dict(zip(predictors.columns.values, model.coef_))

# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')

# MSE from training and test data
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)

# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)

Output

{'ALCCONSUMPTION': 0.0,

'ARMEDFORCESRATE': 0.053193977848769676,

'BREASTCANCERPER100TH': 0.16975696253753386,

'CO2EMISSIONS': 0.0,

'DEMO': 0.0,

'EMPLOYRATE': 0.0,

'FEMALEEMPLOYRATE': 0.0,

'HIVRATE': 0.0,

'INCOMEPERPERSON': 0.060397820371933759,

'INTERNETUSERATE': 0.10625445353846048,

'LIFEEXPECTANCY': 0.0,

'OILPERPERSON': 0.67246298567315888,

'SUICIDEPER100TH': 0.0,

'URBANRATE': 0.0}

MSE for Training Data Set

0.326120543847

MSE for Testing Data Set

1.17094785325

R-square Training Data Set

0.64113155605

R-square Test Data Set

0.0341466029722

Summary

A LASSO regression analysis was conducted to identify a subset of variables from a pool of 14 quantitative and categorical predictor variables that best predicted a quantitative response variable. Hypothesis of interest is determine the extent of association between per person electricity consumption and a myriad of predictors.

Categorical predictors included a country's political maturity - Democracy or Dictatorship. Quantitative predictor variables include per capita income, alcohol consumption, rate of growth of a country's armed forces, breast cancer infection per 100 people, carbon dioxide emission, employment rate, female employment rate, HIV infection rate, internet use rate, life expectancy, oil use per person, suicide per 100 people, employment rate and urbanization rate.
All predictor variables were standardized to have a mean of zero and a standard deviation of one.

Data were randomly split into a training set that included 70% of the observations (N=149) and a test set that included 30% of the observations (N=64). The least angle regression algorithm with k=10 fold cross validation was used to estimate the LASSO regression model in the training set and validated using the testing set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Change in the validation mean square error at each step

Per Figure 1, the MSE decreases initially and leveled off, suggesting beyond that point MSE error will not decrease by adding more predictors to the model.

Of the 14 predictor variables, 5 were retained in the selected model.
During the estimation process, oil use per person, breast cancer diagnosis per 100 people, internet use rate, per capita income and rate of growth of a country's armed forces were most strongly and positively associated with per person electricity consumption.

The MSE of selected model was 0.326 for train set and 1.17 for the test set. This suggests that even though the model performed relatively well on the train set, it performed terribly on the test set. MSE of the test set is approximately 4x that of the train set. Thus, prediction accuracy was uneven across the two data sets. The goal is consistency of prediction across both data sets.

The R-squared value for the train set was 0.641 and that for the test set was 0.034. This suggests that the selected model explains approximately 64% of the variance for per person electricity consumption in relation to the training set and approximately 3% of the variance for per person electricity consumption in relation to the test data set.

These 5 variables accounted for 64% of the variance in the per person electricity consumption response variable.

Data Analysis and Interpretation

Saturday, February 20, 2016

LASSO Regression Analysis using Gapminder Data