Saturday, January 23, 2016

Polynomial Regression Modeling

Introduction

In this assignment we will use gapminder.com data to model relationship between per person electricity consumption and per person income. Additionally, we will add other exogenous variables that can influence this relationship. In particular, we will explore the impact of urbanization and internet usage on this relationship.

Summary

Model

relectricperperson=Intercept+incomeperperson_c+ incomeperperson_c**2 + urbanrate_c +                                                 internetuserate_c + error


                            OLS Regression Results                            
========================================================================
Dep. Variable:     relectricperperson                                                  R-squared:                       0.497
Model:                            OLS                                                         Adj. R-squared:                  0.481
Method:                 Least Squares                                                       F-statistic:                     30.65
Date:                Sat, 23 Jan 2016                                                         Prob (F-statistic):           9.82e-18
Time:                        10:42:00                                                  Log-Likelihood:                -1089.0
No. Observations:                 129                                                     AIC:                             2188.
Df Residuals:                     124                                                         BIC:                             2202.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
========================================================================
                                                      coef            std err          t             P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept                                     1489.5849    152.536      9.765      0.000      1187.674  1791.495
incomeperperson_c                         0.1204      0.028         4.267      0.000         0.065     0.176
I(incomeperperson_c ** 2)         -2.559e-06   8.82e-07     -2.903    0.004      -4.3e-06 -8.14e-07
urbanrate_c                                  -2.6368        6.684          -0.395    0.694       -15.866    10.592
internetuserate_c                           9.6653        7.197           1.343    0.182        -4.580    23.910
========================================================================
Omnibus:                      150.608   Durbin-Watson:                   2.129
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3902.984
Skew:                           4.231   Prob(JB):                         0.00
Kurtosis:                      28.584   Cond. No.                     4.23e+08
========================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.23e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

relectricperperson(est) = 1489.5849 + 0.1204*incomeperperson - 2.559e-06* incomeperperson_c ** 2 - 2.6368*urbanrate_c + 9.6653*internetuserate_c


Upon running the multiple regression model, this is what I found:
  • per capita income has a positive association(beta = 0.1204) with per person electricity consumption and is statistically significant(p < 0.005) after accounting for other exogenous factors.
  • second order transformation of per per capita income  has a negative association(beta = -2.559e-06)  with per person electricity consumption after accounting for other explanatory variables. 
  • A positive linear coefficient(beta = 0.1204) and negative(beta = -2.559e-06) quadratic coefficient indicates that the relationship between per per person electricity consumption is curvilinear and  is concave in nature.
  • Approximately, 50%  the variation in per person electricity consumption is explained by all the regressors - per person income, urbanization, and internet usage.

Hypothesis

The hypothesis that I have been exploiting over the course of this specialization is - there is association between global per person electricity consumption and per capita income.

From my analysis, it appears there is a positive association between global per person electricity consumption and per capita income. Thus, my results support hypothesis of interest.

Confounding

Model
relectricperperson = intercept + beta*incomeperperson_c + error
                                       coef            std err          t              P>|t|         [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------
Intercept                      1157.1252    106.646     10.850      0.000       946.092  1368.158
incomeperperson_c     0.0903          0.009           9.646      0.000         0.072     0.109
--------------------------------------------------------------------------------------------------------

Model
relectricperperson = intercept + beta*incomeperperson_c + beta2*incomeperperson**2 + error

                                              coef            std err          t              P>|t|         [95.0% Conf. Int.]
-----------------------------------------------------------------------------------------------------------------
Intercept                              1545.4660    138.810     11.134      0.000      1270.766  1820.166
incomeperperson_c                   0.1443      0.016         9.038      0.000         0.113     0.176
I(incomeperperson_c ** 2)  -2.99e-06       7.36e-07  -4.064      0.000     -4.45e-06 -1.53e-06
-----------------------------------------------------------------------------------------------------------------

                                             coef             std err          t               P>|t|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------------------------------------
Intercept                             1552.2880    145.680     10.655      0.000      1263.969  1840.607
incomeperperson_c                  0.1464        0.021       7.103      0.000         0.106     0.187
I(incomeperperson_c ** 2)     -3.042e-06  8.08e-07 -3.766      0.000     -4.64e-06 -1.44e-06
urbanrate_c                              -1.0601      6.601       -0.161      0.873       -14.124    12.004

                                            coef             std err                t           P>|t|           [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------------------------------
Intercept                        1489.5849       152.536          9.765      0.000      1187.674  1791.495
incomeperperson_c             0.1204           0.028          4.267      0.000            0.065     0.176
I(incomeperperson_c ** 2) -2.559e-06    8.82e-07     -2.903     0.004      -4.3e-06 -8.14e-07
urbanrate_c                          -2.6368         6.684          -0.395     0.694       -15.866    10.592
internetuserate_c                   9.6653         7.197           1.343     0.182        -4.580    23.910

Based on the summary result above, association between real per person electricity consumption and per capita income is still statistically significant after controlling for urbanization and internet use rate. Therefore, no confounders were found by my analysis, 

Regression Diagnostic Plots

Scatter Plot

The below figure confirms that the model is curvilinear and is concave.

Per the figure below, it appears that the model may be misspecified since we observer a departure of the plotted residual values from the norm(straight line)

i. the diagram below shows a positive linear relationship between per person electricity consumption and per capita income. 



ii. there is no association between per person electricity consumption and internet use rate conditional on other explanatory variables.


ii. there is no association between per person electricity consumption and urban rate conditional on other explanatory variables.

The 95% of the observations seems to fall within  +/- 2 standard deviations of the mean. However, we have about three data points that qualify as outliers. 


From the below figure, we can observe the following:
  • point 111 has large leverage and low residuals
  • points  201, 13 and 144 have high residuals and low leverage
  • points 194 and 94 both have low leverage and low residuals
In conclusion, we observe few outliers but not have outsize influence on our model.





Code

# -*- coding: utf-8 -*-
"""
Created on Saturday January 23 10:08:53 2016

@author: Ernest.Tanson
"""

import pandas
import numpy
import seaborn
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm


# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# define method to load data of interest


def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data

# loading data
data = load_data('data', 'gapminder.csv')
print(data)

##############################################################################
# DATA MANAGEMENT
##############################################################################


# Extracting data pertinent to variables of interest
data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
'internetuserate']]

print(data)


# setting variables of interest to numeric format
data['incomeperperson'] = \
pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['relectricperperson'] = \
pandas.to_numeric(data['relectricperperson'], errors='coerce')
data['urbanrate'] = \
pandas.to_numeric(data['urbanrate'], errors='coerce')
data['internetuserate'] = \
pandas.to_numeric(data['internetuserate'], errors='coerce')

# listwise deletion of missing values
my_data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
'internetuserate']].dropna()

###############################################################################
# POLYNOMIAL REGRESSION
###############################################################################

# first order (linear) scatterplot
# 1. incomeperperson vs. relectricperperson

scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
scatter=True, data=my_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
# plt.title('Scatterplot for the Association Between income and \
# electricity consumption globally')

# second order (linear) scatterplot
# 1. incomeperperson vs. relectricperperson

scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
scatter=True, order=2, data=my_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
# plt.title('Scatterplot for the Association Between income and \
# electricity consumption globally')

# Center explanatory variables for regression analysis
my_data['incomeperperson_c'] =\
(my_data['incomeperperson'] - numpy.mean(my_data['incomeperperson']))

my_data['urbanrate_c'] =\
(my_data['urbanrate'] - numpy.mean(my_data['urbanrate']))

my_data['internetuserate_c'] =\
(my_data['internetuserate'] - numpy.mean(my_data['internetuserate']))

my_data[['incomeperperson_c', 'urbanrate_c', 'internetuserate_c']].describe()

# Calculating mean of centered explanatory variables
print('Centered Mean incomeperperson: {0:5.2f}'.format(numpy.mean(my_data['incomeperperson_c'])))
print('Centered Mean urbanrate: {0:5.2f}'.format(numpy.mean(my_data['urbanrate_c'])))
print('Centered Mean internetuserate: {0:5.2f}'.format(numpy.mean(my_data['internetuserate_c'])))

# linear regression analysis
print("OLS regression model for the association between income per person and \
real electric consumption per person")
ols_model =\
smf.ols('relectricperperson ~ incomeperperson_c', data=my_data).fit()
print(ols_model.summary())

# quadratic (polynomial) regression analysis
print("Quadratic regression analysis for the association between income per \
person and real electric consumption per person")
del I
quad_model =\
smf.ols('relectricperperson ~ incomeperperson_c + I(incomeperperson_c**2)',
data=my_data).fit()
print(quad_model.summary())

###############################################################################
# EVALUATING MODEL FIT
###############################################################################

X_model =\
smf.ols('relectricperperson ~ incomeperperson_c + I(incomeperperson_c**2) \
+ urbanrate_c + internetuserate_c', data=my_data).fit()
print(X_model.summary())

# qq-plot - to evaluate the assumption that the residuals from my regression
# model are normally distributed
res = X_model.resid
fig = sm.qqplot(res, line='r')
plt.show()


# simple plot of residuals
stdres = pandas.DataFrame(X_model.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
plt.show()

###############################################################################
# This plots four graphs in a 2 by 2 figure: ‘endog versus exog’,
# ‘residuals versus exog’, ‘fitted versus exog’
# and ‘fitted plus residual versus exog’
# additional regression diagnostic plots
###############################################################################
# relectricperperson vs. incomeperperson_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'incomeperperson_c',
fig=fig_exo)

# relectricperperson vs. urbanrate_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'urbanrate_c',
fig=fig_exo)

# relectricperperson vs. internetuserate_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'internetuserate_c',
fig=fig_exo)

# leverage plot
lev_fig = sm.graphics.influence_plot(X_model, size=8)
plt.show()



No comments:

Post a Comment