Introduction
In this week's assignment, the ask is to test for independence( no relationship between income and electricity consumption) in Sub Sahara Africa. I will be using the gapminder data.Ho: The relative proportion of per person electricity consumption is the same regardless of income level in Sub Sahara Africa and they are independent.
Ha: The relative proportion of per person electricity consumption changes with respect to changes in income levels in Sub Sahara Africa and they are dependent.
Code
import pandas
import os
import numpy as np
import seaborn
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import matplotlib.pyplot as plt
import scipy.stats
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
# Reading data into DataFrame
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
print(econ_data.shape)
# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')
# Extract only countries in Sub-Sahara Africa.
# For this, I addeded a new column - continent to gapminder dataframe
sub_sahara_africa = econ_data[econ_data['continent'] == 'Africa']
sub_sahara_africa2 = sub_sahara_africa.copy()
print(sub_sahara_africa2.shape)
# Extracting data pertinent to variables of interest
income_electric_filter = ['incomeperperson', 'relectricperperson']
sub_sahara_africa2 = sub_sahara_africa2[income_electric_filter]
print(sub_sahara_africa2)
# Categorizing numeric data
def perpersonkwh(row):
if row['relectricperperson'] <=\
sub_sahara_africa2['relectricperperson'].quantile(.5):
return 'LowkWh'
if row['relectricperperson'] >\
sub_sahara_africa2['relectricperperson'].quantile(.5):
return 'HighkWh'
sub_sahara_africa2['perpersonkWh'] =\
sub_sahara_africa2.apply(lambda row: perpersonkwh(row), axis=1)
def percapitagdp(row):
if row['incomeperperson'] <=\
sub_sahara_africa2['incomeperperson'].quantile(.25):
return 'LowGDP'
if row['incomeperperson'] >\
sub_sahara_africa2['incomeperperson'].quantile(.25) and \
row['incomeperperson'] <=\
sub_sahara_africa2['incomeperperson'].quantile(.75):
return 'MediumGDP'
if row['incomeperperson'] >\
sub_sahara_africa2['incomeperperson'].quantile(.75):
return 'HighGDP'
sub_sahara_africa2['percapitaGDP'] =\
sub_sahara_africa2.apply(lambda row: percapitagdp(row), axis=1)
print(sub_sahara_africa2['perpersonkWh'].head(10))
print(sub_sahara_africa2['percapitaGDP'].head(10))
# recoding values for perpersonkWh into a new variable, PERPRSKWH
kWh_recode = {'LowkWh': 0, 'HighkWh': 1}
sub_sahara_africa2['PERPRSKWH'] =\
sub_sahara_africa2['perpersonkWh'].map(kWh_recode)
# recoding values for percapitaGDP into a new variable, PERCAPGDP
gdp_recode = {'LowGDP': 1, 'MediumGDP': 2, 'HighGDP': 3}
sub_sahara_africa2['PERCAPGDP'] =\
sub_sahara_africa2['percapitaGDP'].map(gdp_recode)
print(sub_sahara_africa2['PERPRSKWH'].head(10))
print(sub_sahara_africa2['PERCAPGDP'].head(10))
# contingency/cross classification table of observed counts
ssa_cont_tbl = pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['PERCAPGDP'])
print(ssa_cont_tbl)
# column percentages
colsum = ssa_cont_tbl.sum(axis=0)
print(colsum)
colpct = ssa_cont_tbl/colsum
print(colpct)
# chi-square
print('chi-square value, p value, expected counts')
ssa_chisquare = scipy.stats.chi2_contingency(ssa_cont_tbl)
print(ssa_chisquare)
# set variable types
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['PERPRSKWH'].head(10))
sub_sahara_africa2['PERCAPGDP'] =\
sub_sahara_africa2['PERCAPGDP'].astype('category')
sub_sahara_africa2['PERPRSKWH'] =\
pandas.to_numeric(sub_sahara_africa2['PERPRSKWH'], errors='coerce')
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['PERPRSKWH'].head(10))
# Factor plot
seaborn.factorplot(x='PERCAPGDP', y='PERPRSKWH', data=sub_sahara_africa2,
kind='bar', ci=None)
plt.xlabel('Per Capita GDP')
plt.ylabel('Proportion of Power Consumption')
Graphically, there seem to be some relationship between proportion of per person electricity consumption and per capita income. Specifically, the higher the income level, the more electricity consumed(LowGDP(0.0%) < MidiumGDP(54%) < HighGDP(67%)
import os
import numpy as np
import seaborn
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import matplotlib.pyplot as plt
import scipy.stats
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
# Reading data into DataFrame
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
print(econ_data.shape)
# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')
# Extract only countries in Sub-Sahara Africa.
# For this, I addeded a new column - continent to gapminder dataframe
sub_sahara_africa = econ_data[econ_data['continent'] == 'Africa']
sub_sahara_africa2 = sub_sahara_africa.copy()
print(sub_sahara_africa2.shape)
# Extracting data pertinent to variables of interest
income_electric_filter = ['incomeperperson', 'relectricperperson']
sub_sahara_africa2 = sub_sahara_africa2[income_electric_filter]
print(sub_sahara_africa2)
# Categorizing numeric data
def perpersonkwh(row):
if row['relectricperperson'] <=\
sub_sahara_africa2['relectricperperson'].quantile(.5):
return 'LowkWh'
if row['relectricperperson'] >\
sub_sahara_africa2['relectricperperson'].quantile(.5):
return 'HighkWh'
sub_sahara_africa2['perpersonkWh'] =\
sub_sahara_africa2.apply(lambda row: perpersonkwh(row), axis=1)
def percapitagdp(row):
if row['incomeperperson'] <=\
sub_sahara_africa2['incomeperperson'].quantile(.25):
return 'LowGDP'
if row['incomeperperson'] >\
sub_sahara_africa2['incomeperperson'].quantile(.25) and \
row['incomeperperson'] <=\
sub_sahara_africa2['incomeperperson'].quantile(.75):
return 'MediumGDP'
if row['incomeperperson'] >\
sub_sahara_africa2['incomeperperson'].quantile(.75):
return 'HighGDP'
sub_sahara_africa2['percapitaGDP'] =\
sub_sahara_africa2.apply(lambda row: percapitagdp(row), axis=1)
print(sub_sahara_africa2['perpersonkWh'].head(10))
print(sub_sahara_africa2['percapitaGDP'].head(10))
# recoding values for perpersonkWh into a new variable, PERPRSKWH
kWh_recode = {'LowkWh': 0, 'HighkWh': 1}
sub_sahara_africa2['PERPRSKWH'] =\
sub_sahara_africa2['perpersonkWh'].map(kWh_recode)
# recoding values for percapitaGDP into a new variable, PERCAPGDP
gdp_recode = {'LowGDP': 1, 'MediumGDP': 2, 'HighGDP': 3}
sub_sahara_africa2['PERCAPGDP'] =\
sub_sahara_africa2['percapitaGDP'].map(gdp_recode)
print(sub_sahara_africa2['PERPRSKWH'].head(10))
print(sub_sahara_africa2['PERCAPGDP'].head(10))
# contingency/cross classification table of observed counts
ssa_cont_tbl = pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['PERCAPGDP'])
print(ssa_cont_tbl)
PERCAPGDP 1.000000 2.000000 3.000000
PERPRSKWH
0.000000 3 6 2
1.000000 0 7 4
# column percentages
colsum = ssa_cont_tbl.sum(axis=0)
print(colsum)
colpct = ssa_cont_tbl/colsum
print(colpct)
PERCAPGDP 1.000000 2.000000 3.000000
PERPRSKWH
0.000000 1.000000 0.461538 0.333333
1.000000 0.000000 0.538462 0.666667
# chi-square
print('chi-square value, p value, expected counts')
ssa_chisquare = scipy.stats.chi2_contingency(ssa_cont_tbl)
print(ssa_chisquare)
# set variable types
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['PERPRSKWH'].head(10))
sub_sahara_africa2['PERCAPGDP'] =\
sub_sahara_africa2['PERCAPGDP'].astype('category')
sub_sahara_africa2['PERPRSKWH'] =\
pandas.to_numeric(sub_sahara_africa2['PERPRSKWH'], errors='coerce')
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['PERPRSKWH'].head(10))
# Factor plot
seaborn.factorplot(x='PERCAPGDP', y='PERPRSKWH', data=sub_sahara_africa2,
kind='bar', ci=None)
plt.xlabel('Per Capita GDP')
plt.ylabel('Proportion of Power Consumption')
Graphically, there seem to be some relationship between proportion of per person electricity consumption and per capita income. Specifically, the higher the income level, the more electricity consumed(LowGDP(0.0%) < MidiumGDP(54%) < HighGDP(67%)
Chi Square Result
There is no statistically significance as determined by Chi Square Test result at alpha = 0.05:
(X2 = 3.7435897435897436, p = 0.15384727771283108, d.f = 2) and we accept the null hypothesis. That is, the proportion of per person electricity consume is independent of per capita income levels across Sub Sahara Africa.
(X2 = 3.7435897435897436, p = 0.15384727771283108, d.f = 2) and we accept the null hypothesis. That is, the proportion of per person electricity consume is independent of per capita income levels across Sub Sahara Africa.
Bonferroni Adjustments
To minimize the familywise p value, we will now conduct pairwise comparison between per person electricity consumption and per capita income in Sub Sahara Africa. In this case we will need to conduct three such tests and compare the p-values from the tests to our Bonferroni Adjusted p-value 05/3 = 0.016666
mediumGDP_lowGDP = {2: 2, 1: 1}
sub_sahara_africa2['MediumLowGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(mediumGDP_lowGDP )
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['MediumLowGDP'].head(10))
# contingency table of observed counts
MediumGDPLowGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['MediumLowGDP'])
print(MediumGDPLowGDP)
# column percentages
colsum = MediumGDPLowGDP.sum(axis=0)
print(colsum)
colpct = MediumGDPLowGDP/colsum
print(colpct)
print('chi-square value, p value, expected counts')
medium_lowGDP_cont = scipy.stats.chi2_contingency(MediumGDPLowGDP)
print(medium_lowGDP_cont)
highGDP_lowGDP = {3: 3, 1: 1}
sub_sahara_africa2['HighLowGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(highGDP_lowGDP )
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['HighLowGDP'].head(10))
# contingency table of observed counts
HighGDPLowGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['HighLowGDP'])
print(HighGDPLowGDP)
# column percentages
colsum = HighGDPLowGDP.sum(axis=0)
print(colsum)
colpct = HighGDPLowGDP/colsum
print(colpct)
print('chi-square value, p value, expected counts')
high_lowGDP_cont = scipy.stats.chi2_contingency(HighGDPLowGDP)
print(high_lowGDP_cont)
highGDP_mediumGDP = {3: 3, 2: 2}
sub_sahara_africa2['HighMediumGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(highGDP_mediumGDP )
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['HighMediumGDP'].head(10))
# contingency table of observed counts
HighGDPMediumGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['HighMediumGDP'])
print(HighGDPMediumGDP)
# column percentages
colsum = HighGDPMediumGDP.sum(axis=0)
print(colsum)
colpct = HighGDPMediumGDP/colsum
print(colpct)
print('chi-square value, p value, expected counts')
high_mediumGDP_cont = scipy.stats.chi2_contingency(HighGDPMediumGDP)
print(high_mediumGDP_cont)
mediumGDP_lowGDP = {2: 2, 1: 1}
sub_sahara_africa2['MediumLowGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(mediumGDP_lowGDP )
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['MediumLowGDP'].head(10))
# contingency table of observed counts
MediumGDPLowGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['MediumLowGDP'])
print(MediumGDPLowGDP)
colsum = MediumGDPLowGDP.sum(axis=0)
print(colsum)
colpct = MediumGDPLowGDP/colsum
print(colpct)
print('chi-square value, p value, expected counts')
medium_lowGDP_cont = scipy.stats.chi2_contingency(MediumGDPLowGDP)
print(medium_lowGDP_cont)
MediumLowGDP 1.000000 2.000000
PERPRSKWH
0.000000 3 6
1.000000 0 7
MediumLowGDP 1.000000 2.000000
PERPRSKWH
0.000000 1.000000 0.461538
1.000000 0.000000 0.538462
X2 = 1.1005291005291005, P = 0.29415001818473019, d.f = 1
Proportionally, the higher the per capita income the more electricity consumed(54% vs. 0.0%)highGDP_lowGDP = {3: 3, 1: 1}
sub_sahara_africa2['HighLowGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(highGDP_lowGDP )
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['HighLowGDP'].head(10))
# contingency table of observed counts
HighGDPLowGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['HighLowGDP'])
print(HighGDPLowGDP)
# column percentages
colsum = HighGDPLowGDP.sum(axis=0)
print(colsum)
colpct = HighGDPLowGDP/colsum
print(colpct)
print('chi-square value, p value, expected counts')
high_lowGDP_cont = scipy.stats.chi2_contingency(HighGDPLowGDP)
print(high_lowGDP_cont)
HighLowGDP 1.000000 3.000000
PERPRSKWH
0.000000 3 2
1.000000 0 4
HighLowGDP 1.000000 3.000000
PERPRSKWH
0.000000 1.000000 0.333333
1.000000 0.000000 0.666667
(chi-square=1.40625,p-value = 0.23567991342903749,d.f = 1)
Proportionally, the higher the per capita income the more electricity consumed(67% vs. 0.0%)highGDP_mediumGDP = {3: 3, 2: 2}
sub_sahara_africa2['HighMediumGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(highGDP_mediumGDP )
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['HighMediumGDP'].head(10))
# contingency table of observed counts
HighGDPMediumGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['HighMediumGDP'])
print(HighGDPMediumGDP)
# column percentages
colsum = HighGDPMediumGDP.sum(axis=0)
print(colsum)
colpct = HighGDPMediumGDP/colsum
print(colpct)
print('chi-square value, p value, expected counts')
high_mediumGDP_cont = scipy.stats.chi2_contingency(HighGDPMediumGDP)
print(high_mediumGDP_cont)
HighMediumGDP 2.000000 3.000000
PERPRSKWH
0.000000 6 2
1.000000 7 4
HighMediumGDP 2.000000 3.000000
PERPRSKWH
0.000000 0.461538 0.333333
1.000000 0.538462 0.666667
chi-square = 0.00069201631701630963,
p-value = 0.97901310733501912,
d.f = 1
Proportionally, the higher the per capita income the more electricity consumed(67% vs. 54.0%)
Conclusion
Based on the pairwise analysis, it appears per income capita has no impact on proportion of per person electricity consumption in Sub Sahara Africa, since we accept all pairwise comparison between income and per person electricity consumption at adjusted p-value of 0.016666. Thus, per person electricity consumption is independent of per capita income.

No comments:
Post a Comment