Introduction
My interest is to test the association between per person electric consumption and income per capita via basic linear regression. The source of my data set is gapminder(gapminder.org)
Data Preparation
The variables of interest are "relectricperperson" and "incomeperperson".
I will center the explanatory variable - "incomeperperson" prior to performing the regression analysis.
Code
import pandas
import numpy
import seaborn
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# define method to load data of interest
def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
# loading data
data = load_data('data', 'gapminder.csv')
print(data)
#Making a copy of data frame
reg_data = data.copy()
# Extracting data pertinent to variables of interest
reg_data = \
reg_data[['incomeperperson', 'relectricperperson']]
print(reg_data)
# setting variables of interest to numeric
reg_data['incomeperperson'] = \
pandas.to_numeric(reg_data['incomeperperson'], errors='coerce')
reg_data['relectricperperson'] = \
pandas.to_numeric(reg_data['relectricperperson'], errors='coerce')
mean_percapita_income = numpy.mean(reg_data['incomeperperson'])
print(mean_percapita_income)
8740.96607617579
reg_data['incomeperperson_centered'] =\
(reg_data['incomeperperson'] - mean_percapita_income)
print(reg_data)
mean_percapitaincome_centered =\
numpy.mean(reg_data['incomeperperson_centered'])
print(mean_percapitaincome_centered)
-1.1488354127658041e-13
scat1 = seaborn.regplot(x='incomeperperson_centered', y='relectricperperson',
fit_reg=True, data=reg_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
plt.title('Scatterplot for the Association Between income and\
electricity consumption globally')
print("OLS regression model for the association between income per person and \
real electric consumption per person")
reg_model = smf.ols('relectricperperson ~ incomeperperson_centered',
data=reg_data).fit()
print(reg_model.summary())
import numpy
import seaborn
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# define method to load data of interest
def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
# loading data
data = load_data('data', 'gapminder.csv')
print(data)
#Making a copy of data frame
reg_data = data.copy()
# Extracting data pertinent to variables of interest
reg_data = \
reg_data[['incomeperperson', 'relectricperperson']]
print(reg_data)
# setting variables of interest to numeric
reg_data['incomeperperson'] = \
pandas.to_numeric(reg_data['incomeperperson'], errors='coerce')
reg_data['relectricperperson'] = \
pandas.to_numeric(reg_data['relectricperperson'], errors='coerce')
Center explanatory variable "incomeperperson"
mean_percapita_income = numpy.mean(reg_data['incomeperperson'])
print(mean_percapita_income)
8740.96607617579
reg_data['incomeperperson_centered'] =\
(reg_data['incomeperperson'] - mean_percapita_income)
print(reg_data)
Calculating mean of centered "incomeperperson" variable
mean_percapitaincome_centered =\
numpy.mean(reg_data['incomeperperson_centered'])
print(mean_percapitaincome_centered)
-1.1488354127658041e-13
Basic Regression
Testing the association between incomeperperson(centered) and relectricperperson
scat1 = seaborn.regplot(x='incomeperperson_centered', y='relectricperperson',
fit_reg=True, data=reg_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
plt.title('Scatterplot for the Association Between income and\
electricity consumption globally')
print("OLS regression model for the association between income per person and \
real electric consumption per person")
reg_model = smf.ols('relectricperperson ~ incomeperperson_centered',
data=reg_data).fit()
print(reg_model.summary())
Program Output
OLS regression model for the association between income per person and real electric consumption per person
OLS Regression Results
========================================================================
Dep. Variable: relectricperperson R-squared: 0.425
Model: OLS Adj. R-squared: 0.420
Method: Least Squares F-statistic: 94.47
Date: Sun, 17 Jan 2016 Prob (F-statistic): 4.63e-17
Time: 12:48:43 Log-Likelihood: -1105.9
No. Observations: 130 AIC: 2216.
Df Residuals: 128 BIC: 2222.
Df Model: 1
Covariance Type: nonrobust
========================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------------------------------------------------
Intercept 1144.6759 105.855 10.814 0.000 935.223 1354.129
incomeperperson_centered 0.0904 0.009 9.719 0.000 0.072 0.109
========================================================================
Omnibus: 148.000 Durbin-Watson: 2.123
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4079.319
Skew: 4.030 Prob(JB): 0.00
Kurtosis: 29.232 Cond. No. 1.14e+04
========================================================================
OLS Regression Results
========================================================================
Dep. Variable: relectricperperson R-squared: 0.425
Model: OLS Adj. R-squared: 0.420
Method: Least Squares F-statistic: 94.47
Date: Sun, 17 Jan 2016 Prob (F-statistic): 4.63e-17
Time: 12:48:43 Log-Likelihood: -1105.9
No. Observations: 130 AIC: 2216.
Df Residuals: 128 BIC: 2222.
Df Model: 1
Covariance Type: nonrobust
========================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------------------------------------------------
Intercept 1144.6759 105.855 10.814 0.000 935.223 1354.129
incomeperperson_centered 0.0904 0.009 9.719 0.000 0.072 0.109
========================================================================
Omnibus: 148.000 Durbin-Watson: 2.123
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4079.319
Skew: 4.030 Prob(JB): 0.00
Kurtosis: 29.232 Cond. No. 1.14e+04
========================================================================
Result Summary
The results of my linear regression test demonstrates that, per per person electric consumption(beta = 0.0904, p < 0.005, alpha = 1144.6759, p < 0.005 ) was significant and positively correlated with income per person. That is a one unit increase in per per person electric consumption results in 0.0904 increase in income per person.

No comments:
Post a Comment