Introduction
In this assignment, I will attempt to identify the impact of population density on relationship between "incomeperperson" and "relectricperperson" based on the gapminder data. My earlier analysis revealed a strong relationship between per capita income and per person electric consumption globally.
Variables
Explanatory: "relectricperperson"Response: "incomeperperson"
Moderator: "urbanrate"
Data Analysis
import pandasimport seaborn
import scipy
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
# define method to load data of interest
def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data
# loading data
data = load_data('data', 'gapminder.csv')
# Extracting data pertinent to variables of interest
data = data[['incomeperperson', 'relectricperperson', 'urbanrate']]
# setting variables of interest to numeric
data['incomeperperson'] = \
pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['relectricperperson'] = \
pandas.to_numeric(data['relectricperperson'], errors='coerce')
data['urbanrate'] = \
pandas.to_numeric(data['urbanrate'], errors='coerce')
print(data.head(20))
data = data.dropna()
# Splitting per capita income in 3 tranches
def percapitagdp(row):
if row['incomeperperson'] <= data['incomeperperson'].quantile(.25):
return 0
if row['incomeperperson'] > data['incomeperperson'].quantile(.25) and\
row['incomeperperson'] <= data['incomeperperson'].quantile(.75):
return 1
if row['incomeperperson'] > data['incomeperperson'].quantile(.75):
return 2
data['percapitagdp'] = data.apply(lambda row: percapitagdp(row), axis=1)
# Creating dataframe for each income group: 0 : Low Income, 1 : Middle Income
# 2: High Income
low_gdp = data[(data['percapitagdp'] == 0)]
print(low_gdp)
medium_gdp = data[(data['percapitagdp'] == 1)]
print(medium_gdp)
high_gdp = data[(data['percapitagdp'] == 2)]
print(high_gdp)
# Splitting household electiricity consumption 2 tranches - 0: low per person
# electricity consumption, 1: high per person electricity consumption
def percapitakwh(row):
if row['relectricperperson'] <= data['relectricperperson'].quantile(.5):
return 0
if row['relectricperperson'] > data['relectricperperson'].quantile(.5):
return 1
data['percapitakWh'] = data.apply(lambda row: percapitakwh(row), axis=1)
# Count items per income group
electricgrp = data['percapitakWh'].value_counts(sort=False, dropna=False)
print(electricgrp)
# Creating dataframe for each per person electric consumption group
low_kWh = data[(data['percapitakWh'] == 0)]
print(low_kWh)
high_kWh = data[(data['percapitakWh'] == 1)]
print(high_kWh)
# Splitting urbanrate into 2 tranches - 0: low population density area
# 1: high population density area
def areatype(row):
if row['urbanrate'] <= data['urbanrate'].quantile(.5):
return 0
if row['urbanrate'] > data['urbanrate'].quantile(.5):
return 1
data['areatype'] = data.apply(lambda row: areatype(row), axis=1)
areagrp = data['areatype'].value_counts(sort=False, dropna=True)
print(areagrp)
Chi Square Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"
# contingency table of observed counts
gdp_kWh = pandas.crosstab(data['percapitagdp'], data['percapitakWh'])
print(gdp_kWh)
# column percentages
colsum = gdp_kWh.sum(axis=0)
colpct = gdp_kWh/colsum
print(colpct)
# chi-square
print('chi-square value, p value, expected counts')
cs_gdpkWh = scipy.stats.chi2_contingency(gdp_kWh)
print(cs_gdpkWh)
Test Result
chi-square value = 62.183712121212125
p value = 3.1403530858162606e-14
From the Chi Square test, we can say that there is a statistically significant association between "incomeperperson" explanatory variable and "relectricperperson" response variable as evidenced by p-value < 0.05
# set variable types
data['percapitakWh'] = data['percapitakWh'].astype('category')
data['percapitagdp'] =\
pandas.to_numeric(data['percapitagdp'], errors='coerce')
# bivariate bar graph
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=data, kind="bar",
ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
# Creating dataframe for low and high population density groups
low_pop = data[(data['areatype'] == 0)]
high_pop = data[(data['areatype'] == 1)]
print('association between per capita income and per person electric\
consumption for residents in low population density areas')
# contingency table of observed counts
lowpop_gdpkWh = pandas.crosstab(low_pop['percapitagdp'],
low_pop['percapitakWh'])
print(lowpop_gdpkWh)
# column percentages
colsum = lowpop_gdpkWh.sum(axis=0)
colpct = lowpop_gdpkWh/colsum
print(colpct)
# chi-square
print('chi-square value, p value, expected counts')
cs_lowpop_gdpkWh = scipy.stats.chi2_contingency(lowpop_gdpkWh)
print(cs_lowpop_gdpkWh)
data['percapitakWh'] = data['percapitakWh'].astype('category')
data['percapitagdp'] =\
pandas.to_numeric(data['percapitagdp'], errors='coerce')
# bivariate bar graph
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=data, kind="bar",
ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
low_pop = data[(data['areatype'] == 0)]
high_pop = data[(data['areatype'] == 1)]
print('association between per capita income and per person electric\
consumption for residents in low population density areas')
# contingency table of observed counts
lowpop_gdpkWh = pandas.crosstab(low_pop['percapitagdp'],
low_pop['percapitakWh'])
print(lowpop_gdpkWh)
# column percentages
colsum = lowpop_gdpkWh.sum(axis=0)
colpct = lowpop_gdpkWh/colsum
print(colpct)
# chi-square
print('chi-square value, p value, expected counts')
cs_lowpop_gdpkWh = scipy.stats.chi2_contingency(lowpop_gdpkWh)
print(cs_lowpop_gdpkWh)
Result I - impact of Low population density as a moderator on income and electricity consumption
chi-square value = 28.577586206896555
p value = 6.2295403346060193e-07
The result is statistically significant as evidenced by p-value < 0.05. That is, low population density has a moderating influence on the relationship between "incomeperperson" and "relectricperperson"
print('association between per capita income and per person electric\
consumption for residents in high population density areas')
# contingency table of observed counts
ct_highpop_gdpkWh = pandas.crosstab(high_pop['percapitagdp'],
high_pop['percapitakWh'])
print(ct_highpop_gdpkWh)
# column percentages
colsum = ct_highpop_gdpkWh.sum(axis=0)
colpct = ct_highpop_gdpkWh/colsum
print(colpct)
# chi-square
print('chi-square value, p value, expected counts')
cs_highpop_gdpkWh = scipy.stats.chi2_contingency(ct_highpop_gdpkWh)
print(cs_highpop_gdpkWh)
consumption for residents in high population density areas')
# contingency table of observed counts
ct_highpop_gdpkWh = pandas.crosstab(high_pop['percapitagdp'],
high_pop['percapitakWh'])
print(ct_highpop_gdpkWh)
# column percentages
colsum = ct_highpop_gdpkWh.sum(axis=0)
colpct = ct_highpop_gdpkWh/colsum
print(colpct)
# chi-square
print('chi-square value, p value, expected counts')
cs_highpop_gdpkWh = scipy.stats.chi2_contingency(ct_highpop_gdpkWh)
print(cs_highpop_gdpkWh)
Result II - High population density as a moderator on income and electricity consumption
chi-square value = 17.439163165266105
p value = 0.0001633555274138804
The result is statistically significant as evidenced by p-value < 0.05. That is, high population density has a moderating influence on the relationship between "incomeperperson" and "relectricperperson"
# Line Chart
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=low_pop,
kind="point", ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
plt.title('association between income and electricity consumption for \
residents of LOW population density areas.')
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=high_pop,
kind="point", ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
plt.title('association between income and electricity consumption for \
residents of HIGH population density areas.')
Both graphs show positive trend, that is, high per person electricity consumption is associated with higher income. However, people living in high population areas tend relatively to spend more of their income on electricity compare to those in low population density areas.
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=low_pop,
kind="point", ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
plt.title('association between income and electricity consumption for \
residents of LOW population density areas.')
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=high_pop,
kind="point", ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
plt.title('association between income and electricity consumption for \
residents of HIGH population density areas.')
Both graphs show positive trend, that is, high per person electricity consumption is associated with higher income. However, people living in high population areas tend relatively to spend more of their income on electricity compare to those in low population density areas.
ANOVA Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"
# using ols function for calculating the F-statistic and associated p value
gdpkWh_model = smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=data)
gdpkWh_res = gdpkWh_model.fit()
print(gdpkWh_res.summary())
print("standard deviation for mean Per person electric consumption by Low vs.\
Medium vs. High income levels")
st1 = gdpkWh.groupby('percapitagdp').std()
print(st1)
# Bar plot
seaborn.factorplot(x='percapitagdp', y='relectricperperson', data=data,
kind='bar', ci=None)
plt.xlabel('Levels Income Per Person')
plt.ylabel('Mean Per Person Electric Consumption')
print('association between per person electricity consumption and per capita \
income for those living in low population density areas')
gdpkWh_lowpop_model =\
smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=low_pop)
print('association between per person electricity consumption and per capita \
income for those living in low population density areas')
gdpkWh_lowpop_model =\
smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=low_pop)
gdpkWh_lowpop_res = gdpkWh_lowpop_model.fit()
print(gdpkWh_lowpop_res.summary())
print("means for Per Person Electric Consumption by per capita income \
0 vs. 1 vs. 2 for low population density")
mean_lowpop = low_pop.groupby('percapitagdp').mean()
print(mean_lowpop)
print("standard deviation for mean Per person electric consumption by Low vs.\
Medium vs. High income levels for Low population density areas")
std_lowpop = low_pop.groupby('percapitagdp').std()
print(std_lowpop)
gdpkWh_model = smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=data)
gdpkWh_res = gdpkWh_model.fit()
print(gdpkWh_res.summary())
print("standard deviation for mean Per person electric consumption by Low vs.\
Medium vs. High income levels")
st1 = gdpkWh.groupby('percapitagdp').std()
print(st1)
ANOVA Test Result
When examining the relationship between per person electricity consumption across various levels of income, ANOVA revealed that globally,
Low Income(Mean=144.655970,s.d=153.330045 ) consumes less electricity relative to both Medium Income(Mean=729.941778, s.d=568.913501) and High Income(2964.549001, s.d=2162.289249) levels.
There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2, determined by one-way ANOVA F(2,130) = 57.01, p = 2.15e-18.
There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2, determined by one-way ANOVA F(2,130) = 57.01, p = 2.15e-18.
# Bar plot
seaborn.factorplot(x='percapitagdp', y='relectricperperson', data=data,
kind='bar', ci=None)
plt.xlabel('Levels Income Per Person')
plt.ylabel('Mean Per Person Electric Consumption')
print('association between per person electricity consumption and per capita \
income for those living in low population density areas')
gdpkWh_lowpop_model =\
smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=low_pop)
print('association between per person electricity consumption and per capita \
income for those living in low population density areas')
gdpkWh_lowpop_model =\
smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=low_pop)
gdpkWh_lowpop_res = gdpkWh_lowpop_model.fit()
print(gdpkWh_lowpop_res.summary())
print("means for Per Person Electric Consumption by per capita income \
0 vs. 1 vs. 2 for low population density")
mean_lowpop = low_pop.groupby('percapitagdp').mean()
print(mean_lowpop)
print("standard deviation for mean Per person electric consumption by Low vs.\
Medium vs. High income levels for Low population density areas")
std_lowpop = low_pop.groupby('percapitagdp').std()
print(std_lowpop)
Result I - impact of Low population density as a moderator on income and electricity electricity consumption
When examining the interaction between low urban rate, per person electricity consumption and income levels, ANOVA revealed that globally, Low Income(Mean=132.260069,s.d=122.047634 ) consumes less electricity relative to both Medium Income(Mean = 606.876711, s.d = 481.566696) and High Income(2124.808392, s.d =1105.691412) levels. That is, urbanization has a moderating impact on electricity consumption as it relates to income levels.
There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2 and LOW population density areas, determined by one-way ANOVA F(2,65) = 46.44, p = 4.72e-13.
There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2 and LOW population density areas, determined by one-way ANOVA F(2,65) = 46.44, p = 4.72e-13.
Result II - impact of High population density as a moderator on income and electricity electricity consumption
When examining the interaction between high urban rate, per person electricity consumption and income levels, ANOVA revealed that globally,
Low Income(Mean=336.792431,s.d=476.29642 ) consumes less electricity relative to both Medium Income(Mean = 831.909977, s.d = 620.584218) and High Income(3114.502681, s.d =2281.732505) levels. That is, "urbanrate" can be considered as a "lurking" factor in explaining the relationship between electricity consumption and income levels.
There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2 and HIGH population density areas, determined by one-way ANOVA F(2,65) = 17.22, p = 1.13e-06.
There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2 and HIGH population density areas, determined by one-way ANOVA F(2,65) = 17.22, p = 1.13e-06.
Pearson Correlation Coefficient Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"
print('association between incomeperperson and relectricperperson for LOW \
population density countries')
(r, p) =\
scipy.stats.pearsonr(low_pop['incomeperperson'],
low_pop['relectricperperson'])
print('(r = %f, p-value = %s)' % (r, p))
print('r2 = %f' % r**2)
scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
fit_reg=False, data=low_pop)
plt.xlabel('Income Per Person')
plt.ylabel('Per Person Electricity Consumption')
plt.title('Scatterplot for the association between Electric Consumption and \
Per Capita Income for LOW population density countries')
print(scat1)
population density countries')
(r, p) =\
scipy.stats.pearsonr(low_pop['incomeperperson'],
low_pop['relectricperperson'])
print('(r = %f, p-value = %s)' % (r, p))
print('r2 = %f' % r**2)
scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
fit_reg=False, data=low_pop)
plt.xlabel('Income Per Person')
plt.ylabel('Per Person Electricity Consumption')
plt.title('Scatterplot for the association between Electric Consumption and \
Per Capita Income for LOW population density countries')
print(scat1)
Result I - impact of Low population density as a moderator on income and electricity electricity consumption
Correlation Coefficient
(r = 0.873841, p-value = 2.12383414237e-21)
r2 = 0.763597
The generated correlation coefficient is positively correlated and is very strong with a value of 0.87 and is also statistically significant(p-value < 0.05). This goes to show that, it is highly unlikely that the relationship of this magnitude between income and electricity in the presence of low urbanrate globally is due to chance alone.
With an r-squared = 0.76, this suggests that, if we know the income per person, we can predict 76% of the variability we observe in per person electricity consumption globally in the presence of low urban rate.
print('association between incomeperperson and relectricperperson for HIGH \
population density countries')
(r, p) =\
scipy.stats.pearsonr(high_pop['incomeperperson'],
high_pop['relectricperperson'])
print(' ')
print('(r = %f, p-value = %s)' % (r, p))
print('r2 = %f' % r**2)
scat2 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
fit_reg=False, data=high_pop)
plt.xlabel('Income Per Person')
plt.ylabel('Per Person Electricity Consumption')
plt.title('Scatterplot for the association between Electric Consumption and \
Per Capita Income for High population density countries')
print(scat2)
fit_reg=False, data=high_pop)
plt.xlabel('Income Per Person')
plt.ylabel('Per Person Electricity Consumption')
plt.title('Scatterplot for the association between Electric Consumption and \
Per Capita Income for High population density countries')
print(scat2)
Result I - impact of High population density as a moderator on income and electricity electricity consumption
Correlation Coefficient
(r = 0.520095, p-value = 8.98036750462e-06)
r2 = 0.270499
The generated correlation coefficient is positively correlated and moderately strong with a value of 0.52 and is also statistically significant(p-value < 0.05). This goes to show that, it is highly unlikely that the relationship of this magnitude between income and electricity in the presence of high urban rate globally is due to chance alone.
With an r-squared = 0.27, this suggests that, if we know the income per person, we can predict 27% of the variability we observe in per person electricity consumption globally in the presence of low urban rate.
Thus, we can hypothesize that, "urbanrate" is a potential moderator on relationship between "relectricperperson" and "incomeperperson"









