Introduction
In this assignment we will use gapminder.com data to model relationship between per person electricity consumption and per person income. Additionally, we will add other exogenous variables that can influence this relationship. In particular, we will explore the impact of urbanization and internet usage on this relationship.
Summary
Model
relectricperperson=Intercept+incomeperperson_c+ incomeperperson_c**2 + urbanrate_c + internetuserate_c + error
OLS Regression Results
========================================================================
Dep. Variable: relectricperperson R-squared: 0.497
Model: OLS Adj. R-squared: 0.481
Method: Least Squares F-statistic: 30.65
Date: Sat, 23 Jan 2016 Prob (F-statistic): 9.82e-18
Time: 10:42:00 Log-Likelihood: -1089.0
No. Observations: 129 AIC: 2188.
Df Residuals: 124 BIC: 2202.
Df Model: 4
Covariance Type: nonrobust
========================================================================
coef std err t P>|t| [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept 1489.5849 152.536 9.765 0.000 1187.674 1791.495
incomeperperson_c 0.1204 0.028 4.267 0.000 0.065 0.176
I(incomeperperson_c ** 2) -2.559e-06 8.82e-07 -2.903 0.004 -4.3e-06 -8.14e-07
urbanrate_c -2.6368 6.684 -0.395 0.694 -15.866 10.592
internetuserate_c 9.6653 7.197 1.343 0.182 -4.580 23.910
========================================================================
Omnibus: 150.608 Durbin-Watson: 2.129
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3902.984
Skew: 4.231 Prob(JB): 0.00
Kurtosis: 28.584 Cond. No. 4.23e+08
========================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.23e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
relectricperperson(est) = 1489.5849 + 0.1204*incomeperperson - 2.559e-06* incomeperperson_c ** 2 - 2.6368*urbanrate_c + 9.6653*internetuserate_c
Upon running the multiple regression model, this is what I found:
- per capita income has a positive association(beta = 0.1204) with per person electricity consumption and is statistically significant(p < 0.005) after accounting for other exogenous factors.
- second order transformation of per per capita income has a negative association(beta = -2.559e-06) with per person electricity consumption after accounting for other explanatory variables.
- A positive linear coefficient(beta = 0.1204) and negative(beta = -2.559e-06) quadratic coefficient indicates that the relationship between per per person electricity consumption is curvilinear and is concave in nature.
- Approximately, 50% the variation in per person electricity consumption is explained by all the regressors - per person income, urbanization, and internet usage.
Hypothesis
The hypothesis that I have been exploiting over the course of this specialization is - there is association between global per person electricity consumption and per capita income.
From my analysis, it appears there is a positive association between global per person electricity consumption and per capita income. Thus, my results support hypothesis of interest.
Confounding
Model
relectricperperson = intercept + beta*incomeperperson_c + error
coef std err t P>|t| [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------
Intercept 1157.1252 106.646 10.850 0.000 946.092 1368.158
incomeperperson_c 0.0903 0.009 9.646 0.000 0.072 0.109
--------------------------------------------------------------------------------------------------------
Model
relectricperperson = intercept + beta*incomeperperson_c + beta2*incomeperperson**2 + error
coef std err t P>|t| [95.0% Conf. Int.]
-----------------------------------------------------------------------------------------------------------------
Intercept 1545.4660 138.810 11.134 0.000 1270.766 1820.166
incomeperperson_c 0.1443 0.016 9.038 0.000 0.113 0.176
I(incomeperperson_c ** 2) -2.99e-06 7.36e-07 -4.064 0.000 -4.45e-06 -1.53e-06
-----------------------------------------------------------------------------------------------------------------
coef std err t P>|t| [95.0% Conf. Int.]
-----------------------------------------------------------------------------------------------------------------
Intercept 1552.2880 145.680 10.655 0.000 1263.969 1840.607
incomeperperson_c 0.1464 0.021 7.103 0.000 0.106 0.187
I(incomeperperson_c ** 2) -3.042e-06 8.08e-07 -3.766 0.000 -4.64e-06 -1.44e-06
urbanrate_c -1.0601 6.601 -0.161 0.873 -14.124 12.004
coef std err t P>|t| [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------------------------------
Intercept 1489.5849 152.536 9.765 0.000 1187.674 1791.495
incomeperperson_c 0.1204 0.028 4.267 0.000 0.065 0.176
I(incomeperperson_c ** 2) -2.559e-06 8.82e-07 -2.903 0.004 -4.3e-06 -8.14e-07
urbanrate_c -2.6368 6.684 -0.395 0.694 -15.866 10.592
internetuserate_c 9.6653 7.197 1.343 0.182 -4.580 23.910
Based on the summary result above, association between real per person electricity consumption and per capita income is still statistically significant after controlling for urbanization and internet use rate. Therefore, no confounders were found by my analysis,
Regression Diagnostic Plots
Scatter Plot
The below figure confirms that the model is curvilinear and is concave.
Per the figure below, it appears that the model may be misspecified since we observer a departure of the plotted residual values from the norm(straight line)
i. the diagram below shows a positive linear relationship between per person electricity consumption and per capita income.
ii. there is no association between per person electricity consumption and internet use rate conditional on other explanatory variables.
ii. there is no association between per person electricity consumption and urban rate conditional on other explanatory variables.
The 95% of the observations seems to fall within +/- 2 standard deviations of the mean. However, we have about three data points that qualify as outliers.
From the below figure, we can observe the following:
- point 111 has large leverage and low residuals
- points 201, 13 and 144 have high residuals and low leverage
- points 194 and 94 both have low leverage and low residuals
In conclusion, we observe few outliers but not have outsize influence on our model.
Code
# -*- coding: utf-8 -*-
"""
Created on Saturday January 23 10:08:53 2016
@author: Ernest.Tanson
"""
import pandas
import numpy
import seaborn
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
# define method to load data of interest
def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data
# loading data
data = load_data('data', 'gapminder.csv')
print(data)
##############################################################################
# DATA MANAGEMENT
##############################################################################
# Extracting data pertinent to variables of interest
data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
'internetuserate']]
print(data)
# setting variables of interest to numeric format
data['incomeperperson'] = \
pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['relectricperperson'] = \
pandas.to_numeric(data['relectricperperson'], errors='coerce')
data['urbanrate'] = \
pandas.to_numeric(data['urbanrate'], errors='coerce')
data['internetuserate'] = \
pandas.to_numeric(data['internetuserate'], errors='coerce')
# listwise deletion of missing values
my_data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
'internetuserate']].dropna()
###############################################################################
# POLYNOMIAL REGRESSION
###############################################################################
# first order (linear) scatterplot
# 1. incomeperperson vs. relectricperperson
scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
scatter=True, data=my_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
# plt.title('Scatterplot for the Association Between income and \
# electricity consumption globally')
# second order (linear) scatterplot
# 1. incomeperperson vs. relectricperperson
scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
scatter=True, order=2, data=my_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
# plt.title('Scatterplot for the Association Between income and \
# electricity consumption globally')
# Center explanatory variables for regression analysis
my_data['incomeperperson_c'] =\
(my_data['incomeperperson'] - numpy.mean(my_data['incomeperperson']))
my_data['urbanrate_c'] =\
(my_data['urbanrate'] - numpy.mean(my_data['urbanrate']))
my_data['internetuserate_c'] =\
(my_data['internetuserate'] - numpy.mean(my_data['internetuserate']))
my_data[['incomeperperson_c', 'urbanrate_c', 'internetuserate_c']].describe()
# Calculating mean of centered explanatory variables
print('Centered Mean incomeperperson: {0:5.2f}'.format(numpy.mean(my_data['incomeperperson_c'])))
print('Centered Mean urbanrate: {0:5.2f}'.format(numpy.mean(my_data['urbanrate_c'])))
print('Centered Mean internetuserate: {0:5.2f}'.format(numpy.mean(my_data['internetuserate_c'])))
# linear regression analysis
print("OLS regression model for the association between income per person and \
real electric consumption per person")
ols_model =\
smf.ols('relectricperperson ~ incomeperperson_c', data=my_data).fit()
print(ols_model.summary())
# quadratic (polynomial) regression analysis
print("Quadratic regression analysis for the association between income per \
person and real electric consumption per person")
del I
quad_model =\
smf.ols('relectricperperson ~ incomeperperson_c + I(incomeperperson_c**2)',
data=my_data).fit()
print(quad_model.summary())
###############################################################################
# EVALUATING MODEL FIT
###############################################################################
X_model =\
smf.ols('relectricperperson ~ incomeperperson_c + I(incomeperperson_c**2) \
+ urbanrate_c + internetuserate_c', data=my_data).fit()
print(X_model.summary())
# qq-plot - to evaluate the assumption that the residuals from my regression
# model are normally distributed
res = X_model.resid
fig = sm.qqplot(res, line='r')
plt.show()
# simple plot of residuals
stdres = pandas.DataFrame(X_model.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
plt.show()
###############################################################################
# This plots four graphs in a 2 by 2 figure: ‘endog versus exog’,
# ‘residuals versus exog’, ‘fitted versus exog’
# and ‘fitted plus residual versus exog’
# additional regression diagnostic plots
###############################################################################
# relectricperperson vs. incomeperperson_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'incomeperperson_c',
fig=fig_exo)
# relectricperperson vs. urbanrate_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'urbanrate_c',
fig=fig_exo)
# relectricperperson vs. internetuserate_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'internetuserate_c',
fig=fig_exo)
# leverage plot
lev_fig = sm.graphics.influence_plot(X_model, size=8)
plt.show()
"""
Created on Saturday January 23 10:08:53 2016
@author: Ernest.Tanson
"""
import pandas
import numpy
import seaborn
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
# define method to load data of interest
def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data
# loading data
data = load_data('data', 'gapminder.csv')
print(data)
##############################################################################
# DATA MANAGEMENT
##############################################################################
# Extracting data pertinent to variables of interest
data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
'internetuserate']]
print(data)
# setting variables of interest to numeric format
data['incomeperperson'] = \
pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['relectricperperson'] = \
pandas.to_numeric(data['relectricperperson'], errors='coerce')
data['urbanrate'] = \
pandas.to_numeric(data['urbanrate'], errors='coerce')
data['internetuserate'] = \
pandas.to_numeric(data['internetuserate'], errors='coerce')
# listwise deletion of missing values
my_data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
'internetuserate']].dropna()
###############################################################################
# POLYNOMIAL REGRESSION
###############################################################################
# first order (linear) scatterplot
# 1. incomeperperson vs. relectricperperson
scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
scatter=True, data=my_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
# plt.title('Scatterplot for the Association Between income and \
# electricity consumption globally')
# second order (linear) scatterplot
# 1. incomeperperson vs. relectricperperson
scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
scatter=True, order=2, data=my_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
# plt.title('Scatterplot for the Association Between income and \
# electricity consumption globally')
# Center explanatory variables for regression analysis
my_data['incomeperperson_c'] =\
(my_data['incomeperperson'] - numpy.mean(my_data['incomeperperson']))
my_data['urbanrate_c'] =\
(my_data['urbanrate'] - numpy.mean(my_data['urbanrate']))
my_data['internetuserate_c'] =\
(my_data['internetuserate'] - numpy.mean(my_data['internetuserate']))
my_data[['incomeperperson_c', 'urbanrate_c', 'internetuserate_c']].describe()
# Calculating mean of centered explanatory variables
print('Centered Mean incomeperperson: {0:5.2f}'.format(numpy.mean(my_data['incomeperperson_c'])))
print('Centered Mean urbanrate: {0:5.2f}'.format(numpy.mean(my_data['urbanrate_c'])))
print('Centered Mean internetuserate: {0:5.2f}'.format(numpy.mean(my_data['internetuserate_c'])))
# linear regression analysis
print("OLS regression model for the association between income per person and \
real electric consumption per person")
ols_model =\
smf.ols('relectricperperson ~ incomeperperson_c', data=my_data).fit()
print(ols_model.summary())
# quadratic (polynomial) regression analysis
print("Quadratic regression analysis for the association between income per \
person and real electric consumption per person")
del I
quad_model =\
smf.ols('relectricperperson ~ incomeperperson_c + I(incomeperperson_c**2)',
data=my_data).fit()
print(quad_model.summary())
###############################################################################
# EVALUATING MODEL FIT
###############################################################################
X_model =\
smf.ols('relectricperperson ~ incomeperperson_c + I(incomeperperson_c**2) \
+ urbanrate_c + internetuserate_c', data=my_data).fit()
print(X_model.summary())
# qq-plot - to evaluate the assumption that the residuals from my regression
# model are normally distributed
res = X_model.resid
fig = sm.qqplot(res, line='r')
plt.show()
# simple plot of residuals
stdres = pandas.DataFrame(X_model.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
plt.show()
###############################################################################
# This plots four graphs in a 2 by 2 figure: ‘endog versus exog’,
# ‘residuals versus exog’, ‘fitted versus exog’
# and ‘fitted plus residual versus exog’
# additional regression diagnostic plots
###############################################################################
# relectricperperson vs. incomeperperson_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'incomeperperson_c',
fig=fig_exo)
# relectricperperson vs. urbanrate_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'urbanrate_c',
fig=fig_exo)
# relectricperperson vs. internetuserate_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'internetuserate_c',
fig=fig_exo)
# leverage plot
lev_fig = sm.graphics.influence_plot(X_model, size=8)
plt.show()







No comments:
Post a Comment