Introduction
This is a continuation of previous assignments to explain the association between per person electricity consumption and income per person. I will additional variables - "urbanrate" and "internetuserate". The data set is culled from gapminder.com.
The research question of interest is:
There are not association between per person electric consumption and income per person after controlling for urbanization and and internet use.
Code
import pandas
import numpy
import os
import statsmodels.formula.api as smf
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
##############################################################################
# DATA MANAGEMENT
##############################################################################
# define method to load data of interest
def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data
# loading data
data = load_data('data', 'gapminder.csv')
print(data.head(10))
# Extracting data pertinent to variables of interest
data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
'internetuserate']]
# setting variables of interest to numeric format
data['incomeperperson'] = \
pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['relectricperperson'] = \
pandas.to_numeric(data['relectricperperson'], errors='coerce')
data['urbanrate'] = \
pandas.to_numeric(data['urbanrate'], errors='coerce')
data['internetuserate'] = \
pandas.to_numeric(data['internetuserate'], errors='coerce')
# listwise deletion of missing values
data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
'internetuserate']].dropna()
'''
relectricperperson
count 22.000000
mean 149.889484
std 222.403251
min 0.000000
25% 38.325833
50% 57.961848
75% 150.778896
max 920.137600
1 - No increase in per person electricity consumption
0 - Yes increase in per person electricity consumption
'''
def percapitakWh(row):
if row['relectricperperson'] <= 149.889484:
return 1
if row['relectricperperson'] > 149.889484:
return 0
data['HighLowKWh'] = data.apply(lambda row: percapitakWh(row), axis=1)
print(data.head(10))
kWh = data['HighLowKWh'].value_counts(sort=False, dropna=False)
print(kWh)
##############################################################################
# END DATA MANAGEMENT
##############################################################################
###############################################################################
# Center explanatory variables for regression analysis
data['incomeperperson_c'] =\
(data['incomeperperson'] - numpy.mean(data['incomeperperson']))
data['urbanrate_c'] =\
(data['urbanrate'] - numpy.mean(data['urbanrate']))
data['internetuserate_c'] =\
(data['internetuserate'] - numpy.mean(data['internetuserate']))
data[['incomeperperson_c', 'urbanrate_c', 'internetuserate_c']].describe()
# linear regression analysis
print("OLS regression model for the association between income per person and \
real electric consumption per person")
ols_model =\
smf.ols('relectricperperson ~ incomeperperson_c + urbanrate_c + \
internetuserate_c', data=data).fit()
print(ols_model.summary())
# POLYNOMIAL LINEAR REGRESSION
################################################################################ Center explanatory variables for regression analysis
data['incomeperperson_c'] =\
(data['incomeperperson'] - numpy.mean(data['incomeperperson']))
data['urbanrate_c'] =\
(data['urbanrate'] - numpy.mean(data['urbanrate']))
data['internetuserate_c'] =\
(data['internetuserate'] - numpy.mean(data['internetuserate']))
data[['incomeperperson_c', 'urbanrate_c', 'internetuserate_c']].describe()
# linear regression analysis
print("OLS regression model for the association between income per person and \
real electric consumption per person")
ols_model =\
smf.ols('relectricperperson ~ incomeperperson_c + urbanrate_c + \
internetuserate_c', data=data).fit()
print(ols_model.summary())
OLS Regression Results
========================================================================
Dep. Variable: relectricperperson R-squared: 0.463
Model: OLS Adj. R-squared: 0.450
Method: Least Squares F-statistic: 35.92
Date: Sat, 30 Jan 2016 Prob (F-statistic): 8.21e-17
Time: 08:44:14 Log-Likelihood: -1093.2
No. Observations: 129 AIC: 2194.
Df Residuals: 125 BIC: 2206.
Df Model: 3
Covariance Type: nonrobust
========================================================================
coef std err t P>|t| [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept 1157.1252 103.692 11.159 0.000 951.906 1362.344
incomeperperson_c 0.0514 0.016 3.285 0.001 0.020 0.082
urbanrate_c 3.0352 6.579 0.461 0.645 -9.986 16.056
internetuserate_c 18.1816 6.764 2.688 0.008 4.794 31.569
========================================================================
Omnibus: 150.498 Durbin-Watson: 2.128
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4118.791
Skew: 4.193 Prob(JB): 0.00
Kurtosis: 29.381 Cond. No. 1.14e+04
========================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.14e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS Summary
OLS analysis shows that both "incomeperperson"(beta = 0.0514, p-value = 0.001) and "internetuserate"(beta = 18.1816, p-value = 0.008) are statistically significant and therefore, have associations with per person electricity consumption.
##############################################################################
# LOGISTIC REGRESSION
##############################################################################
# logistic regression with per capita income
econlreg = smf.logit(formula='HighLowKWh ~ incomeperperson_c ',
data=data).fit()
print(econlreg.summary())
# odds ratios
print("Odds Ratios")
print(numpy.exp(econlreg.params))
# odd ratios with 95% confidence intervals
params = econlreg.params
conf = econlreg.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print(numpy.exp(conf))
# logistic regression with per capita income and internetuserate
econlreg2 = smf.logit(formula='HighLowKWh ~ incomeperperson_c + \
internetuserate_c', data=data).fit()
print(econlreg2.summary())
# odd ratios with 95% confidence intervals
params = econlreg2.params
conf = econlreg2.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print(numpy.exp(conf))
# logistic regression with incomeperperson, internetuserate and urbanrate
econlreg3 = smf.logit(formula='HighLowKWH ~ incomeperperson_c + urbanrate_c',
data=data).fit()
print(econlreg3.summary())
# odd ratios with 95% confidence intervals
print("Odds Ratios")
params = econlreg3.params
conf = econlreg3.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print(numpy.exp(conf))
# LOGISTIC REGRESSION
##############################################################################
# logistic regression with per capita income
econlreg = smf.logit(formula='HighLowKWh ~ incomeperperson_c ',
data=data).fit()
print(econlreg.summary())
# odds ratios
print("Odds Ratios")
print(numpy.exp(econlreg.params))
# odd ratios with 95% confidence intervals
params = econlreg.params
conf = econlreg.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print(numpy.exp(conf))
# logistic regression with per capita income and internetuserate
econlreg2 = smf.logit(formula='HighLowKWh ~ incomeperperson_c + \
internetuserate_c', data=data).fit()
print(econlreg2.summary())
# odd ratios with 95% confidence intervals
params = econlreg2.params
conf = econlreg2.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print(numpy.exp(conf))
# logistic regression with incomeperperson, internetuserate and urbanrate
econlreg3 = smf.logit(formula='HighLowKWH ~ incomeperperson_c + urbanrate_c',
data=data).fit()
print(econlreg3.summary())
# odd ratios with 95% confidence intervals
print("Odds Ratios")
params = econlreg3.params
conf = econlreg3.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
# logistic regression with incomeperperson, internetuserate and urbanization
econlreg4 = smf.logit(formula='HighLowKWh ~ incomeperperson_c + \
+ internetuserate_c + urbanrate_c', data=data).fit()
print(econlreg4.summary())
# odd ratios with 95% confidence intervals
print("Odds Ratios")
params = econlreg4.params
conf = econlreg4.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print(numpy.exp(conf))
Logistic Regression Models(I - IV)
Model I: Logit Regression Results (income per person)
========================================================================
Dep. Variable: HighLowKWh No. Observations: 129
Model: Logit Df Residuals: 127
Method: MLE Df Model: 1
Date: Sat, 30 Jan 2016 Pseudo R-squ.: 0.5427
Time: 10:10:37 Log-Likelihood: -27.655
converged: True LL-Null: -60.475
LLR p-value: 5.411e-16
========================================================================
coef std err z P>|z| [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept -19.9229 5.312 -3.751 0.000 -30.334 -9.512
incomeperperson_c -0.0025 0.001 -3.742 0.000 -0.004 -0.001
========================================================================
Odds Ratios
Lower CI Upper CI OR
========================================================================
Intercept 0.000000 0.000074 0.000000
incomeperperson_c 0.996246 0.998825 0.997535
========================================================================
Model II: Logit Regression Results ( income per person and internet use rate) =============================================================
Dep. Variable: HighLowKWh No. Observations: 129
Model: Logit Df Residuals: 126
Method: MLE Df Model: 2
Date: Sat, 30 Jan 2016 Pseudo R-squ.: 0.6606
Time: 10:10:37 Log-Likelihood: -20.525
converged: True LL-Null: -60.475
LLR p-value: 4.468e-18
========================================================================
coef std err z P>|z| [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept -19.3096 5.688 -3.395 0.001 -30.458 -8.161
incomeperperson_c -0.0019 0.001 -2.758 0.006 -0.003 -0.001
internetuserate_c -0.1593 0.057 -2.804 0.005 -0.271 -0.048
========================================================================
Odds Ratios
Lower CI Upper CI OR
========================================================================
Intercept 0.000000 0.000286 0.000000
incomeperperson_c 0.996807 0.999459 0.998132
internetuserate_c 0.762852 0.953159 0.852713
========================================================================
Model III: Logit Regression Results(income per person and urban rate) ============================================================
Dep. Variable: HighLowKWh No. Observations: 129
Model: Logit Df Residuals: 126
Method: MLE Df Model: 2
Date: Sat, 30 Jan 2016 Pseudo R-squ.: 0.5446
Time: 10:16:35 Log-Likelihood: -27.542
converged: True LL-Null: -60.475
LLR p-value: 4.984e-15
========================================================================
coef std err z P>|z| [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept -18.6334 5.771 -3.229 0.001 -29.944 -7.323
incomeperperson_c -0.0023 0.001 -3.056 0.002 -0.004 -0.001
urbanrate_c -0.0124 0.026 -0.475 0.635 -0.064 0.039
========================================================================
Odds Ratios
========================================================================
Lower CI Upper CI OR
Intercept 0.000000 0.000660 0.000000
incomeperperson_c 0.996266 0.999183 0.997723
urbanrate_c 0.938213 1.039656 0.987633
Model IV: Logit Regression Results(income per person, internet use rate, urban rate)
========================================================================
Dep. Variable: HighLowKWh No. Observations: 129
Model: Logit Df Residuals: 125
Method: MLE Df Model: 3
Date: Sat, 30 Jan 2016 Pseudo R-squ.: 0.6620
Time: 09:20:07 Log-Likelihood: -20.439
converged: True LL-Null: -60.475
LLR p-value: 2.963e-17
========================================================================
coef std err z P>|z| [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept -18.5345 5.894 -3.145 0.002 -30.087 -6.982
incomeperperson_c -0.0017 0.001 -2.369 0.018 -0.003 -0.000
internetuserate_c -0.1610 0.058 -2.799 0.005 -0.274 -0.048
urbanrate_c -0.0129 0.031 -0.417 0.677 -0.073 0.048
========================================================================
Odds Ratios
Lower CI Upper CI OR
=======================================================================
Intercept 0.000000 0.000928 0.000000
incomeperperson_c 0.996832 0.999700 0.998265
internetuserate_c 0.760588 0.952895 0.851329
urbanrate_c 0.929178 1.048858 0.987206
Logistic Regression Summary
i. Globally, after controlling for rate of internet use, income per person is 0.99 less likely to lead to higher per person electricity consumption(OR = 0.998265, 95%CI = 0.996832 - 0.999700, p = 0.018). Also globally, income per person is anywhere between 0.997 to 0.999 less likely to result in higher per person electricity consumption.
ii. Globally, after controlling for per capita income, internet use rate is 0.85 less likely to lead to higher per person electricity consumption (OR = 0.851329, 95%CI = 0.760588 - 0.952895, p = 0.005). Also globally, internet use rate is anywhere between 0.76 to 0.95 less likely to result in higher per person electricity consumption.
iii. Because confidence interval of our odd ratios overlap, we cannot say per person electricity consumption is more strongly associated with income per person than internet use rate.
Conclusion
Logistic regression analysis revealed that there is an association between per person electricity consumption and income per person after controlling for internet use rate. Thus, confirming my hypothesis. None of the explanatory variables confound the results.







