Saturday, December 26, 2015

Testing a potential Moderator

Introduction

In this assignment, I will attempt to identify the impact of population density on relationship between "incomeperperson" and "relectricperperson" based on the gapminder data. My earlier analysis revealed a strong relationship between per capita income and per person electric consumption globally. 

Variables

Explanatory: "relectricperperson"
Response: "incomeperperson"
Moderator: "urbanrate"


Data Analysis

import pandas
import seaborn
import scipy
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf


# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# define method to load data of interest


def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data


# loading data
data = load_data('data', 'gapminder.csv')

# Extracting data pertinent to variables of interest
data = data[['incomeperperson', 'relectricperperson', 'urbanrate']]

# setting variables of interest to numeric
data['incomeperperson'] = \
pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['relectricperperson'] = \
pandas.to_numeric(data['relectricperperson'], errors='coerce')
data['urbanrate'] = \
pandas.to_numeric(data['urbanrate'], errors='coerce')
print(data.head(20))

data = data.dropna()

# Splitting per capita income in 3 tranches
def percapitagdp(row):
if row['incomeperperson'] <= data['incomeperperson'].quantile(.25):
return 0
if row['incomeperperson'] > data['incomeperperson'].quantile(.25) and\
row['incomeperperson'] <= data['incomeperperson'].quantile(.75):
return 1
if row['incomeperperson'] > data['incomeperperson'].quantile(.75):
return 2


data['percapitagdp'] = data.apply(lambda row: percapitagdp(row), axis=1)

# Creating dataframe for each income group: 0 : Low Income, 1 : Middle Income
# 2: High Income
low_gdp = data[(data['percapitagdp'] == 0)]
print(low_gdp)

medium_gdp = data[(data['percapitagdp'] == 1)]
print(medium_gdp)

high_gdp = data[(data['percapitagdp'] == 2)]

print(high_gdp)

# Splitting household electiricity consumption 2 tranches - 0: low per person
# electricity consumption, 1: high per person electricity consumption
def percapitakwh(row):
if row['relectricperperson'] <= data['relectricperperson'].quantile(.5):
return 0
if row['relectricperperson'] > data['relectricperperson'].quantile(.5):
return 1

data['percapitakWh'] = data.apply(lambda row: percapitakwh(row), axis=1)

# Count items per income group
electricgrp = data['percapitakWh'].value_counts(sort=False, dropna=False)
print(electricgrp)

# Creating dataframe for each per person electric consumption group
low_kWh = data[(data['percapitakWh'] == 0)]
print(low_kWh)

high_kWh = data[(data['percapitakWh'] == 1)]

print(high_kWh)

# Splitting urbanrate into 2 tranches - 0: low population density area
# 1: high population density area


def areatype(row):
if row['urbanrate'] <= data['urbanrate'].quantile(.5):
return 0
if row['urbanrate'] > data['urbanrate'].quantile(.5):
return 1

data['areatype'] = data.apply(lambda row: areatype(row), axis=1)

areagrp = data['areatype'].value_counts(sort=False, dropna=True)

print(areagrp)

Chi Square Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"


# contingency table of observed counts
gdp_kWh = pandas.crosstab(data['percapitagdp'], data['percapitakWh'])
print(gdp_kWh)

# column percentages
colsum = gdp_kWh.sum(axis=0)
colpct = gdp_kWh/colsum
print(colpct)

# chi-square
print('chi-square value, p value, expected counts')
cs_gdpkWh = scipy.stats.chi2_contingency(gdp_kWh)

print(cs_gdpkWh)

Test Result

chi-square value =  62.183712121212125
p value = 3.1403530858162606e-14

From the Chi Square test, we can say that there is a statistically significant association between "incomeperperson" explanatory variable  and "relectricperperson" response variable as evidenced by p-value < 0.05


# set variable types
data['percapitakWh'] = data['percapitakWh'].astype('category')
data['percapitagdp'] =\
pandas.to_numeric(data['percapitagdp'], errors='coerce')

# bivariate bar graph
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=data, kind="bar",
ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')




# Creating dataframe for low and high population density groups
low_pop = data[(data['areatype'] == 0)]
high_pop = data[(data['areatype'] == 1)]

print('association between per capita income and per person electric\
consumption for residents in low population density areas')
# contingency table of observed counts
lowpop_gdpkWh = pandas.crosstab(low_pop['percapitagdp'],
low_pop['percapitakWh'])
print(lowpop_gdpkWh)

# column percentages
colsum = lowpop_gdpkWh.sum(axis=0)
colpct = lowpop_gdpkWh/colsum

print(colpct)

# chi-square
print('chi-square value, p value, expected counts')
cs_lowpop_gdpkWh = scipy.stats.chi2_contingency(lowpop_gdpkWh)

print(cs_lowpop_gdpkWh)

Result I - impact of Low population density as a moderator on income and electricity consumption

chi-square value = 28.577586206896555
p value =  6.2295403346060193e-07

The result is statistically significant as evidenced by p-value < 0.05. That is, low population density has a moderating influence on the relationship between "incomeperperson" and "relectricperperson"


print('association between per capita income and per person electric\
consumption for residents in high population density areas')
# contingency table of observed counts
ct_highpop_gdpkWh = pandas.crosstab(high_pop['percapitagdp'],
high_pop['percapitakWh'])
print(ct_highpop_gdpkWh)

# column percentages
colsum = ct_highpop_gdpkWh.sum(axis=0)
colpct = ct_highpop_gdpkWh/colsum
print(colpct)

# chi-square
print('chi-square value, p value, expected counts')
cs_highpop_gdpkWh = scipy.stats.chi2_contingency(ct_highpop_gdpkWh)
print(cs_highpop_gdpkWh)

Result II - High population density as a moderator on income and electricity consumption

chi-square value = 17.439163165266105
p value =  0.0001633555274138804

The result is statistically significant as evidenced by p-value < 0.05. That is, high population density has a moderating influence on the relationship between "incomeperperson" and "relectricperperson"

# Line Chart
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=low_pop,
kind="point", ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
plt.title('association between income and electricity consumption for \
residents of LOW population density areas.')

seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=high_pop,
kind="point", ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
plt.title('association between income and electricity consumption for \

residents of HIGH population density areas.')
Both graphs show positive trend, that is, high per person electricity consumption is associated with higher income. However, people living in high population areas tend relatively to spend more of their income on electricity compare to those in low population density areas.

ANOVA Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"


# using ols function for calculating the F-statistic and associated p value
gdpkWh_model = smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=data)
gdpkWh_res = gdpkWh_model.fit()

print(gdpkWh_res.summary())

print("standard deviation for mean Per person electric consumption by Low vs.\
Medium vs. High income levels")
st1 = gdpkWh.groupby('percapitagdp').std()

print(st1)

ANOVA Test Result

When examining the relationship between per person electricity consumption across various levels of income, ANOVA revealed that globally,  
Low Income(Mean=144.655970,s.d=153.330045 ) consumes less electricity relative to both  Medium Income(Mean=729.941778, s.d=568.913501) and High Income(2964.549001, s.d=2162.289249) levels.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2, determined by one-way ANOVA F(2,130) = 57.01, p = 2.15e-18

# Bar plot 
seaborn.factorplot(x='percapitagdp', y='relectricperperson', data=data,
kind='bar', ci=None)
plt.xlabel('Levels Income Per Person')

plt.ylabel('Mean Per Person Electric Consumption')



print('association between per person electricity consumption and per capita \
income for those living in low population density areas')
gdpkWh_lowpop_model =\
smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=low_pop)
print('association between per person electricity consumption and per capita \
income for those living in low population density areas')
gdpkWh_lowpop_model =\
smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=low_pop)
gdpkWh_lowpop_res = gdpkWh_lowpop_model.fit()
print(gdpkWh_lowpop_res.summary())

print("means for Per Person Electric Consumption by per capita income \
0 vs. 1 vs. 2 for low population density")
mean_lowpop = low_pop.groupby('percapitagdp').mean()
print(mean_lowpop)

print("standard deviation for mean Per person electric consumption by Low vs.\
Medium vs. High income levels for Low population density areas")
std_lowpop = low_pop.groupby('percapitagdp').std()

print(std_lowpop)

Result I - impact of Low population density as a moderator on income and electricity electricity consumption

When examining the interaction between low urban rate,  per person electricity consumption and income levels, ANOVA revealed that globally,  Low Income(Mean=132.260069,s.d=122.047634 ) consumes less electricity relative to both  Medium Income(Mean = 606.876711, s.d = 481.566696) and High Income(2124.808392, s.d =1105.691412) levels. That is, urbanization has a moderating impact on electricity consumption as it relates to income levels.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2 and LOW population density areas, determined by one-way ANOVA F(2,65) = 46.44, p = 4.72e-13. 

Result II - impact of High population density as a moderator on income and electricity electricity consumption

When examining the interaction between high urban rate,  per person electricity consumption and income levels, ANOVA revealed that globally,  
Low Income(Mean=336.792431,s.d=476.29642 ) consumes less electricity relative to both  Medium Income(Mean = 831.909977, s.d = 620.584218) and High Income(3114.502681, s.d =2281.732505) levels. That is, "urbanrate" can be considered as a "lurking" factor in explaining the relationship between  electricity consumption and income levels.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2 and HIGH population density areas, determined by one-way ANOVA F(2,65) = 17.22p = 1.13e-06

Pearson Correlation Coefficient Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"

print('association between incomeperperson and relectricperperson for LOW \
population density countries')
(r, p) =\
scipy.stats.pearsonr(low_pop['incomeperperson'],
low_pop['relectricperperson'])

print('(r = %f, p-value = %s)' % (r, p))
print('r2 = %f' % r**2)

scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
fit_reg=False, data=low_pop)
plt.xlabel('Income Per Person')
plt.ylabel('Per Person Electricity Consumption')
plt.title('Scatterplot for the association between Electric Consumption and \
Per Capita Income for LOW population density countries')

print(scat1)

Result I - impact of Low population density as a moderator on income and electricity electricity consumption

Correlation Coefficient 

(r = 0.873841, p-value = 2.12383414237e-21)
r2 = 0.763597

The generated correlation coefficient is positively correlated and is very strong with a value of 0.87 and is also statistically significant(p-value < 0.05). This goes to show that, it is highly unlikely that the relationship of this magnitude between income and electricity in the presence of low urbanrate globally is due to chance alone.

With an r-squared = 0.76, this suggests that, if we know the income per person, we can predict 76% of the variability we observe in per person electricity consumption globally in the presence of low urban rate. 

print('association between incomeperperson and relectricperperson for HIGH \
population density countries')
(r, p) =\
scipy.stats.pearsonr(high_pop['incomeperperson'],
high_pop['relectricperperson'])
print(' ')
print('(r = %f, p-value = %s)' % (r, p))
print('r2 = %f' % r**2)

scat2 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
fit_reg=False, data=high_pop)
plt.xlabel('Income Per Person')
plt.ylabel('Per Person Electricity Consumption')
plt.title('Scatterplot for the association between Electric Consumption and \
Per Capita Income for High population density countries')
print(scat2)

Result I - impact of High population density as a moderator on income and electricity electricity consumption

Correlation Coefficient 

(r = 0.520095, p-value = 8.98036750462e-06)
r2 = 0.270499

The generated correlation coefficient is positively correlated and moderately strong with a value of 0.52 and is also statistically significant(p-value < 0.05). This goes to show that, it is highly unlikely that the relationship of this magnitude between income and electricity in the presence of high urban rate globally is due to chance alone.

With an r-squared = 0.27, this suggests that, if we know the income per person, we can predict 27% of the variability we observe in per person electricity consumption globally in the presence of low urban rate. 

Thus, we can hypothesize that, "urbanrate" is a potential moderator on relationship between "relectricperperson" and "incomeperperson"

No comments:

Post a Comment