Sunday, January 31, 2016

Logistic Regression Model

Introduction

This is a continuation of previous assignments to explain the association between per person electricity consumption and income per person. I will additional variables - "urbanrate" and "internetuserate". The data set is culled from gapminder.com.

The research question of interest is: 
There are not association between per person electric consumption and income per person after controlling for urbanization and and internet use.

Code

import pandas
import numpy
import os
import statsmodels.formula.api as smf


# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

##############################################################################

# DATA MANAGEMENT

##############################################################################

# define method to load data of interest


def load_data(data_dir, csv_file):
    if __name__ == "__main__":
        DATA_PATH = os.path.join(os.getcwd(), data_dir)
        DATA_FILE = os.path.join(DATA_PATH, csv_file)
        data = pandas.read_csv(DATA_FILE, low_memory=False)
        return data

# loading data
data = load_data('data', 'gapminder.csv')
print(data.head(10))


# Extracting data pertinent to variables of interest
data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
             'internetuserate']]

# setting variables of interest to numeric format
data['incomeperperson'] = \
    pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['relectricperperson'] = \
    pandas.to_numeric(data['relectricperperson'], errors='coerce')
data['urbanrate'] = \
    pandas.to_numeric(data['urbanrate'], errors='coerce')
data['internetuserate'] = \
    pandas.to_numeric(data['internetuserate'], errors='coerce')

# listwise deletion of missing values
data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
             'internetuserate']].dropna()

'''
       relectricperperson
count      22.000000
mean       149.889484
std        222.403251
min        0.000000
25%        38.325833
50%        57.961848
75%        150.778896
max        920.137600

1 - No  increase in per person electricity consumption
0 - Yes increase in per person electricity consumption

'''


def percapitakWh(row):
    if row['relectricperperson'] <= 149.889484:
        return 1
    if row['relectricperperson'] > 149.889484:
        return 0

data['HighLowKWh'] = data.apply(lambda row: percapitakWh(row), axis=1)
print(data.head(10))

kWh = data['HighLowKWh'].value_counts(sort=False, dropna=False)
print(kWh)

##############################################################################

# END DATA MANAGEMENT

##############################################################################


###############################################################################

# POLYNOMIAL LINEAR REGRESSION

###############################################################################

# Center explanatory variables for regression analysis
data['incomeperperson_c'] =\
(data['incomeperperson'] - numpy.mean(data['incomeperperson']))

data['urbanrate_c'] =\
(data['urbanrate'] - numpy.mean(data['urbanrate']))

data['internetuserate_c'] =\
(data['internetuserate'] - numpy.mean(data['internetuserate']))

data[['incomeperperson_c', 'urbanrate_c', 'internetuserate_c']].describe()


# linear regression analysis
print("OLS regression model for the association between income per person and \
real electric consumption per person")
ols_model =\
smf.ols('relectricperperson ~ incomeperperson_c + urbanrate_c + \
internetuserate_c', data=data).fit()
print(ols_model.summary())

                            OLS Regression Results                            

========================================================================
Dep. Variable:     relectricperperson                                    R-squared:                       0.463
Model:                            OLS                                           Adj. R-squared:                  0.450
Method:                 Least Squares                                         F-statistic:                     35.92
Date:                Sat, 30 Jan 2016                                            Prob (F-statistic):           8.21e-17
Time:                        08:44:14                                     Log-Likelihood:                -1093.2
No. Observations:                 129                                        AIC:                             2194.
Df Residuals:                     125                                           BIC:                             2206.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
========================================================================
                                              coef                std err          t              P>|t|        [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept                             1157.1252         103.692     11.159      0.000       951.906   1362.344
incomeperperson_c                  0.0514             0.016      3.285       0.001           0.020         0.082
urbanrate_c                              3.0352             6.579      0.461        0.645         -9.986       16.056
internetuserate_c                    18.1816             6.764      2.688        0.008          4.794       31.569
========================================================================
Omnibus:                      150.498   Durbin-Watson:                   2.128
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4118.791
Skew:                           4.193   Prob(JB):                         0.00
Kurtosis:                      29.381   Cond. No.                     1.14e+04
========================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.14e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

OLS Summary

OLS analysis shows that both "incomeperperson"(beta = 0.0514, p-value = 0.001) and "internetuserate"(beta = 18.1816, p-value = 0.008) are statistically significant and therefore,  have associations with per person electricity consumption.

##############################################################################
# LOGISTIC REGRESSION
##############################################################################
# logistic regression with per capita income
econlreg = smf.logit(formula='HighLowKWh ~ incomeperperson_c ',
data=data).fit()

print(econlreg.summary())

# odds ratios
print("Odds Ratios")
print(numpy.exp(econlreg.params))

# odd ratios with 95% confidence intervals
params = econlreg.params
conf = econlreg.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print(numpy.exp(conf))


# logistic regression with per capita income and internetuserate
econlreg2 = smf.logit(formula='HighLowKWh ~ incomeperperson_c + \
internetuserate_c', data=data).fit()

print(econlreg2.summary())

# odd ratios with 95% confidence intervals
params = econlreg2.params
conf = econlreg2.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print(numpy.exp(conf))

# logistic regression with incomeperperson, internetuserate and urbanrate
econlreg3 = smf.logit(formula='HighLowKWH ~ incomeperperson_c + urbanrate_c',
data=data).fit()
print(econlreg3.summary())

# odd ratios with 95% confidence intervals
print("Odds Ratios")
params = econlreg3.params
conf = econlreg3.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print(numpy.exp(conf))

# logistic regression with incomeperperson, internetuserate and urbanization
econlreg4 = smf.logit(formula='HighLowKWh ~ incomeperperson_c + \
+ internetuserate_c + urbanrate_c', data=data).fit()
print(econlreg4.summary())

# odd ratios with 95% confidence intervals
print("Odds Ratios")
params = econlreg4.params
conf = econlreg4.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print(numpy.exp(conf))


Logistic Regression Models(I - IV)

                           Model I: Logit Regression Results (income per person)                          

========================================================================
Dep. Variable:             HighLowKWh                       No. Observations:            129
Model:                          Logit                                   Df Residuals:                      127
Method:                           MLE                               Df Model:                            1
Date:                Sat, 30 Jan 2016                                Pseudo R-squ.:                  0.5427
Time:                        10:10:37                                Log-Likelihood:                -27.655
converged:                       True                                     LL-Null:                       -60.475
                                                                                    LLR p-value:                 5.411e-16
========================================================================
                                      coef          std err          z           P>|z|          [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept                     -19.9229      5.312     -3.751       0.000            -30.334    -9.512
incomeperperson_c    -0.0025        0.001     -3.742       0.000              -0.004    -0.001
========================================================================


Odds Ratios

                                                    Lower CI         Upper CI       OR
========================================================================
Intercept                                      0.000000        0.000074       0.000000
incomeperperson_c                     0.996246        0.998825       0.997535
========================================================================

          Model II: Logit Regression Results ( income per person and internet use rate)                     =============================================================

Dep. Variable:             HighLowKWh                       No. Observations:                  129
Model:                          Logit                                         Df Residuals:                      126
Method:                           MLE                                     Df Model:                            2
Date:                Sat, 30 Jan 2016                                     Pseudo R-squ.:                  0.6606
Time:                        10:10:37                                       Log-Likelihood:                -20.525
converged:                       True                                            LL-Null:                       -60.475
                                                                                          LLR p-value:                 4.468e-18
========================================================================
                                        coef           std err          z           P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept                       -19.3096      5.688       -3.395      0.001       -30.458    -8.161
incomeperperson_c        -0.0019      0.001       -2.758      0.006        -0.003    -0.001
internetuserate_c            -0.1593      0.057       -2.804      0.005        -0.271    -0.048
========================================================================

Odds Ratios

                                                     Lower CI         Upper CI       OR
========================================================================
Intercept                                        0.000000        0.000286   0.000000
incomeperperson_c                       0.996807        0.999459   0.998132
internetuserate_c                           0.762852        0.953159   0.852713
========================================================================

  

               Model III: Logit Regression Results(income per person and urban rate)                          ============================================================

Dep. Variable:             HighLowKWh                            No. Observations:                  129
Model:                          Logit                                              Df Residuals:                      126
Method:                           MLE                                           Df Model:                            2
Date:                Sat, 30 Jan 2016                                            Pseudo R-squ.:                  0.5446
Time:                        10:16:35                                            Log-Likelihood:                -27.542
converged:                       True                                                LL-Null:                       -60.475
                                                                                                LLR p-value:                 4.984e-15
========================================================================
                                       coef           std err          z          P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept                       -18.6334      5.771     -3.229      0.001       -29.944    -7.323
incomeperperson_c        -0.0023      0.001     -3.056      0.002        -0.004     -0.001
urbanrate_c                    -0.0124       0.026    -0.475       0.635        -0.064      0.039
========================================================================

Odds Ratios

========================================================================
                                                        Lower CI            Upper CI       OR
Intercept                                          0.000000            0.000660    0.000000
incomeperperson_c                         0.996266            0.999183    0.997723
urbanrate_c                                      0.938213            1.039656    0.987633


         Model IV: Logit Regression Results(income per person, internet use rate, urban rate)                           

========================================================================
Dep. Variable:             HighLowKWh                            No. Observations:                  129
Model:                          Logit                                              Df Residuals:                      125
Method:                           MLE                                           Df Model:                            3
Date:                Sat, 30 Jan 2016                                            Pseudo R-squ.:                  0.6620
Time:                        09:20:07                                            Log-Likelihood:                -20.439
converged:                       True                                                 LL-Null:                       -60.475
                                                                                               LLR p-value:                 2.963e-17
========================================================================
                                        coef          std err          z           P>|z|                       [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept                       -18.5345      5.894       -3.145      0.002                         -30.087    -6.982
incomeperperson_c        -0.0017      0.001       -2.369      0.018                           -0.003    -0.000
internetuserate_c            -0.1610      0.058       -2.799      0.005                           -0.274    -0.048
urbanrate_c                     -0.0129      0.031       -0.417      0.677                           -0.073     0.048
========================================================================

Odds Ratios

                                                   Lower CI          Upper CI       OR
 =======================================================================
Intercept                                    0.000000           0.000928     0.000000
incomeperperson_c                   0.996832           0.999700     0.998265
internetuserate_c                       0.760588           0.952895     0.851329
urbanrate_c                               0.929178           1.048858      0.987206

Logistic Regression Summary

i. Globally, after controlling for rate of internet use, income per person is 0.99 less likely to lead to higher per person electricity consumption(OR = 0.998265,  95%CI = 0.996832 - 0.999700, p = 0.018). Also globally, income per person is anywhere between 0.997 to 0.999 less likely to result in higher per person electricity consumption.

ii. Globally, after controlling for per capita income, internet use rate is 0.85 less likely to lead to higher per person electricity consumption (OR = 0.851329,  95%CI = 0.760588 - 0.952895, p = 0.005). Also globally, internet use rate  is anywhere between 0.76 to 0.95 less likely to result in higher per person electricity consumption.

iii. Because confidence interval of our odd ratios overlap, we cannot say per person electricity consumption is more strongly associated with income per person  than internet use rate. 

Conclusion

Logistic regression analysis revealed that there is an association between per person electricity consumption and income per person after controlling for internet use rate. Thus, confirming my hypothesis. None of the explanatory variables confound the results.

Saturday, January 23, 2016

Polynomial Regression Modeling

Introduction

In this assignment we will use gapminder.com data to model relationship between per person electricity consumption and per person income. Additionally, we will add other exogenous variables that can influence this relationship. In particular, we will explore the impact of urbanization and internet usage on this relationship.

Summary

Model

relectricperperson=Intercept+incomeperperson_c+ incomeperperson_c**2 + urbanrate_c +                                                 internetuserate_c + error


                            OLS Regression Results                            
========================================================================
Dep. Variable:     relectricperperson                                                  R-squared:                       0.497
Model:                            OLS                                                         Adj. R-squared:                  0.481
Method:                 Least Squares                                                       F-statistic:                     30.65
Date:                Sat, 23 Jan 2016                                                         Prob (F-statistic):           9.82e-18
Time:                        10:42:00                                                  Log-Likelihood:                -1089.0
No. Observations:                 129                                                     AIC:                             2188.
Df Residuals:                     124                                                         BIC:                             2202.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
========================================================================
                                                      coef            std err          t             P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------------
Intercept                                     1489.5849    152.536      9.765      0.000      1187.674  1791.495
incomeperperson_c                         0.1204      0.028         4.267      0.000         0.065     0.176
I(incomeperperson_c ** 2)         -2.559e-06   8.82e-07     -2.903    0.004      -4.3e-06 -8.14e-07
urbanrate_c                                  -2.6368        6.684          -0.395    0.694       -15.866    10.592
internetuserate_c                           9.6653        7.197           1.343    0.182        -4.580    23.910
========================================================================
Omnibus:                      150.608   Durbin-Watson:                   2.129
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3902.984
Skew:                           4.231   Prob(JB):                         0.00
Kurtosis:                      28.584   Cond. No.                     4.23e+08
========================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.23e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

relectricperperson(est) = 1489.5849 + 0.1204*incomeperperson - 2.559e-06* incomeperperson_c ** 2 - 2.6368*urbanrate_c + 9.6653*internetuserate_c


Upon running the multiple regression model, this is what I found:
  • per capita income has a positive association(beta = 0.1204) with per person electricity consumption and is statistically significant(p < 0.005) after accounting for other exogenous factors.
  • second order transformation of per per capita income  has a negative association(beta = -2.559e-06)  with per person electricity consumption after accounting for other explanatory variables. 
  • A positive linear coefficient(beta = 0.1204) and negative(beta = -2.559e-06) quadratic coefficient indicates that the relationship between per per person electricity consumption is curvilinear and  is concave in nature.
  • Approximately, 50%  the variation in per person electricity consumption is explained by all the regressors - per person income, urbanization, and internet usage.

Hypothesis

The hypothesis that I have been exploiting over the course of this specialization is - there is association between global per person electricity consumption and per capita income.

From my analysis, it appears there is a positive association between global per person electricity consumption and per capita income. Thus, my results support hypothesis of interest.

Confounding

Model
relectricperperson = intercept + beta*incomeperperson_c + error
                                       coef            std err          t              P>|t|         [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------
Intercept                      1157.1252    106.646     10.850      0.000       946.092  1368.158
incomeperperson_c     0.0903          0.009           9.646      0.000         0.072     0.109
--------------------------------------------------------------------------------------------------------

Model
relectricperperson = intercept + beta*incomeperperson_c + beta2*incomeperperson**2 + error

                                              coef            std err          t              P>|t|         [95.0% Conf. Int.]
-----------------------------------------------------------------------------------------------------------------
Intercept                              1545.4660    138.810     11.134      0.000      1270.766  1820.166
incomeperperson_c                   0.1443      0.016         9.038      0.000         0.113     0.176
I(incomeperperson_c ** 2)  -2.99e-06       7.36e-07  -4.064      0.000     -4.45e-06 -1.53e-06
-----------------------------------------------------------------------------------------------------------------

                                             coef             std err          t               P>|t|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------------------------------------
Intercept                             1552.2880    145.680     10.655      0.000      1263.969  1840.607
incomeperperson_c                  0.1464        0.021       7.103      0.000         0.106     0.187
I(incomeperperson_c ** 2)     -3.042e-06  8.08e-07 -3.766      0.000     -4.64e-06 -1.44e-06
urbanrate_c                              -1.0601      6.601       -0.161      0.873       -14.124    12.004

                                            coef             std err                t           P>|t|           [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------------------------------
Intercept                        1489.5849       152.536          9.765      0.000      1187.674  1791.495
incomeperperson_c             0.1204           0.028          4.267      0.000            0.065     0.176
I(incomeperperson_c ** 2) -2.559e-06    8.82e-07     -2.903     0.004      -4.3e-06 -8.14e-07
urbanrate_c                          -2.6368         6.684          -0.395     0.694       -15.866    10.592
internetuserate_c                   9.6653         7.197           1.343     0.182        -4.580    23.910

Based on the summary result above, association between real per person electricity consumption and per capita income is still statistically significant after controlling for urbanization and internet use rate. Therefore, no confounders were found by my analysis, 

Regression Diagnostic Plots

Scatter Plot

The below figure confirms that the model is curvilinear and is concave.

Per the figure below, it appears that the model may be misspecified since we observer a departure of the plotted residual values from the norm(straight line)

i. the diagram below shows a positive linear relationship between per person electricity consumption and per capita income. 



ii. there is no association between per person electricity consumption and internet use rate conditional on other explanatory variables.


ii. there is no association between per person electricity consumption and urban rate conditional on other explanatory variables.

The 95% of the observations seems to fall within  +/- 2 standard deviations of the mean. However, we have about three data points that qualify as outliers. 


From the below figure, we can observe the following:
  • point 111 has large leverage and low residuals
  • points  201, 13 and 144 have high residuals and low leverage
  • points 194 and 94 both have low leverage and low residuals
In conclusion, we observe few outliers but not have outsize influence on our model.





Code

# -*- coding: utf-8 -*-
"""
Created on Saturday January 23 10:08:53 2016

@author: Ernest.Tanson
"""

import pandas
import numpy
import seaborn
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm


# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# define method to load data of interest


def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data

# loading data
data = load_data('data', 'gapminder.csv')
print(data)

##############################################################################
# DATA MANAGEMENT
##############################################################################


# Extracting data pertinent to variables of interest
data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
'internetuserate']]

print(data)


# setting variables of interest to numeric format
data['incomeperperson'] = \
pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['relectricperperson'] = \
pandas.to_numeric(data['relectricperperson'], errors='coerce')
data['urbanrate'] = \
pandas.to_numeric(data['urbanrate'], errors='coerce')
data['internetuserate'] = \
pandas.to_numeric(data['internetuserate'], errors='coerce')

# listwise deletion of missing values
my_data = data[['incomeperperson', 'relectricperperson', 'urbanrate',
'internetuserate']].dropna()

###############################################################################
# POLYNOMIAL REGRESSION
###############################################################################

# first order (linear) scatterplot
# 1. incomeperperson vs. relectricperperson

scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
scatter=True, data=my_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
# plt.title('Scatterplot for the Association Between income and \
# electricity consumption globally')

# second order (linear) scatterplot
# 1. incomeperperson vs. relectricperperson

scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
scatter=True, order=2, data=my_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
# plt.title('Scatterplot for the Association Between income and \
# electricity consumption globally')

# Center explanatory variables for regression analysis
my_data['incomeperperson_c'] =\
(my_data['incomeperperson'] - numpy.mean(my_data['incomeperperson']))

my_data['urbanrate_c'] =\
(my_data['urbanrate'] - numpy.mean(my_data['urbanrate']))

my_data['internetuserate_c'] =\
(my_data['internetuserate'] - numpy.mean(my_data['internetuserate']))

my_data[['incomeperperson_c', 'urbanrate_c', 'internetuserate_c']].describe()

# Calculating mean of centered explanatory variables
print('Centered Mean incomeperperson: {0:5.2f}'.format(numpy.mean(my_data['incomeperperson_c'])))
print('Centered Mean urbanrate: {0:5.2f}'.format(numpy.mean(my_data['urbanrate_c'])))
print('Centered Mean internetuserate: {0:5.2f}'.format(numpy.mean(my_data['internetuserate_c'])))

# linear regression analysis
print("OLS regression model for the association between income per person and \
real electric consumption per person")
ols_model =\
smf.ols('relectricperperson ~ incomeperperson_c', data=my_data).fit()
print(ols_model.summary())

# quadratic (polynomial) regression analysis
print("Quadratic regression analysis for the association between income per \
person and real electric consumption per person")
del I
quad_model =\
smf.ols('relectricperperson ~ incomeperperson_c + I(incomeperperson_c**2)',
data=my_data).fit()
print(quad_model.summary())

###############################################################################
# EVALUATING MODEL FIT
###############################################################################

X_model =\
smf.ols('relectricperperson ~ incomeperperson_c + I(incomeperperson_c**2) \
+ urbanrate_c + internetuserate_c', data=my_data).fit()
print(X_model.summary())

# qq-plot - to evaluate the assumption that the residuals from my regression
# model are normally distributed
res = X_model.resid
fig = sm.qqplot(res, line='r')
plt.show()


# simple plot of residuals
stdres = pandas.DataFrame(X_model.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
plt.show()

###############################################################################
# This plots four graphs in a 2 by 2 figure: ‘endog versus exog’,
# ‘residuals versus exog’, ‘fitted versus exog’
# and ‘fitted plus residual versus exog’
# additional regression diagnostic plots
###############################################################################
# relectricperperson vs. incomeperperson_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'incomeperperson_c',
fig=fig_exo)

# relectricperperson vs. urbanrate_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'urbanrate_c',
fig=fig_exo)

# relectricperperson vs. internetuserate_c
fig_exo = plt.figure(figsize=(12, 8))
fig_exo = sm.graphics.plot_regress_exog(X_model, 'internetuserate_c',
fig=fig_exo)

# leverage plot
lev_fig = sm.graphics.influence_plot(X_model, size=8)
plt.show()



Saturday, January 16, 2016

Test of a Basic Linear Regression Model

Introduction

My interest is to test the association between per person electric consumption and income per capita via basic linear regression. The source of my data set is gapminder(gapminder.org)

Data Preparation

The variables of interest are "relectricperperson" and "incomeperperson". 
I will center the explanatory variable - "incomeperperson" prior to performing the regression analysis.

Code

import pandas
import numpy
import seaborn
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

# define method to load data of interest


def load_data(data_dir, csv_file):
   if __name__ == "__main__":
   DATA_PATH = os.path.join(os.getcwd(), data_dir)
   DATA_FILE = os.path.join(DATA_PATH, csv_file)
   data = pandas.read_csv(DATA_FILE, low_memory=False)
   return data


# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)


# loading data
data = load_data('data', 'gapminder.csv')
print(data)

#Making a copy of data frame
reg_data = data.copy()

# Extracting data pertinent to variables of interest
reg_data = \
reg_data[['incomeperperson', 'relectricperperson']]


print(reg_data)

# setting variables of interest to numeric
reg_data['incomeperperson'] = \
pandas.to_numeric(reg_data['incomeperperson'], errors='coerce')
reg_data['relectricperperson'] = \

pandas.to_numeric(reg_data['relectricperperson'], errors='coerce')


Center explanatory variable "incomeperperson"



mean_percapita_income = numpy.mean(reg_data['incomeperperson'])
print(mean_percapita_income)

8740.96607617579

reg_data['incomeperperson_centered'] =\
(reg_data['incomeperperson'] - mean_percapita_income)


print(reg_data)



Calculating mean of centered "incomeperperson" variable


mean_percapitaincome_centered =\
numpy.mean(reg_data['incomeperperson_centered'])

print(mean_percapitaincome_centered)

-1.1488354127658041e-13

Basic Regression


Testing the association between incomeperperson(centered) and  relectricperperson


scat1 = seaborn.regplot(x='incomeperperson_centered', y='relectricperperson',
fit_reg=True, data=reg_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
plt.title('Scatterplot for the Association Between income and\
electricity consumption globally')

print("OLS regression model for the association between income per person and \
real electric consumption per person")
reg_model = smf.ols('relectricperperson ~ incomeperperson_centered',
data=reg_data).fit()

print(reg_model.summary())

Program Output

OLS regression model for the association between income per person and real electric consumption per person
                            OLS Regression Results                          
========================================================================
Dep. Variable:     relectricperperson                                    R-squared:                       0.425
Model:                            OLS                                           Adj. R-squared:                  0.420
Method:                 Least Squares                                         F-statistic:                     94.47
Date:                Sun, 17 Jan 2016                                          Prob (F-statistic):           4.63e-17
Time:                        12:48:43                                    Log-Likelihood:                -1105.9
No. Observations:                 130                                      AIC:                             2216.
Df Residuals:                     128                                         BIC:                             2222.
Df Model:                           1                                      
Covariance Type:            nonrobust                                      
========================================================================
                                                       coef       std err          t           P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------------------------------------
Intercept                               1144.6759    105.855     10.814      0.000       935.223  1354.129
incomeperperson_centered     0.0904           0.009       9.719      0.000         0.072     0.109
========================================================================
Omnibus:                      148.000                                             Durbin-Watson:                   2.123
Prob(Omnibus):                  0.000                                         Jarque-Bera (JB):             4079.319
Skew:                           4.030                                                      Prob(JB):                         0.00
Kurtosis:                      29.232                                                       Cond. No.                     1.14e+04
========================================================================



Result Summary

The results of my linear regression test demonstrates that,  per per person electric consumption(beta = 0.0904, p < 0.005, alpha = 1144.6759, p < 0.005 ) was significant and positively correlated with income per person. That is a one unit increase in per per person electric consumption results in 0.0904 increase in income per person.

Thursday, January 7, 2016

My Data

Data, Procedures and Measures

I have been using gapminder(gapminder.org) data for my analysis so far.
Elements of interest in the datasets are the following:

Electricity consumption, per capita(kWh)

This is a measure of per capita consumption of electricity in a given year and it is measured in kilowatt-hours(kWh). Electricity power measures the combined output of power and heat plants without transmission and transformation losses and own use by heat and power plants. The data analytic sample is made up of 214 countries and span the period 1981 - 2015.

Data Sources:
World Bank(http://data.worldbank.org/indicator/EG.USE.ELEC.KH.PC/countries).

This is the response variable -"relectricperperson" of my study. It measures residential electricity consumption per person during a given year across 214 countries. 

Gross Domestic Product per capita by Purchasing Power Parities(in US dollars at fixed 2011 prices)

These are aggregated data for countries and territories from myriad of sources.
The goal is to include as many countries and territories as possible and to measure their economic outlook. Some of the data are rough estimates for countries and territories for which no reliable data were found.

The income per person component of the dataset in question is based on GDP per capita and adjusted for  Purchasing Power Parity reflecting 2011 round of International Comparison Program(ICP).
The ICP estimates  based on regression analysis uses the GDP per capita of ICP as the dependent variable. The independent variables were Gross National Income(GNI) per capita by exchange rates and Gross enrollment in secondary school.

The predicted values from this model were used for countries lacking official ICP data, but which had observations for the two independent variables. For countries lacking GNI per capita, an alternate model using GDP per capita, by exchange rates was used to generate predicted values.

The data analytic sample consists of 201 countries and territories and span the period 1820 - 2015.

This is one of my explanatory variables, which is  measure of  per capita income adjusted for inflation using 2011 prices in US dollars.

========================================================================
Income group                        Definition           Estimated Avg. Income
========================================================================
Low Income                      $875 or less          $580
Lower middle income              $876 to $3465          $1918
Upper middle income              $3466 to 10725          $5625
High Income                      $10726 or more          $35131
========================================================================

Using IMF classification above, I categorize this data into relevant "Income Group".

Data Sources:
World Bank, UNSTAT(http://unstats.un.org/)
Maddison on-line(http://www.ggdc.net/maddison/maddison-project/home.htm)
CIA World Fact Book(https://www.cia.gov/library/publications/the-world-factbook/)
IMF(http://www.imf.org/external/datamapper/index.php)

Urban Population

This refers to people living in urban areas as defined by national statistical offices.
This aggregate date is calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects(http://data.worldbank.org/indicator/SP.URB.TOTL).

The data analytic sample consists of 214 countries covering time period 1981- 2015.

As a lurking variable, "urbanrate" measures the percentage of the population living in urban areas of countries in our sample data. I created an additional variable- "ruralrate" and computed its value as 100 - "urbanrate". This variable refers to percentage of the population living in rural areas as defined by national statistical offices.

The original goal of this study is to measure the economic development outcomes of these countries.