Introduction
This marks the continuation of "Data Management and Visualization". I will continue to use the gapminder dataset to answer my hypothesis - Is per per person electricity consumption a proxy for economic development as measured by per capita income.
For this particular assignment the goal is to test this hypothesis;
H0: there is no difference in mean per person electricity consumption across categories of consumers
Ha: there are differences
Code
import pandas
import os
import numpy as np
import seaborn
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import matplotlib.pyplot as plt
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
# Reading data into DataFrame
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
print(econ_data.shape)
# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')
# Extract only countries in Sub-Sahara Africa.
# For this, I addeded a new column - continent to gapminder dataframe
sub_sahara_africa = econ_data[econ_data['continent'] == 'Africa']
# Extracting data pertinent to variables of interest
sub_sahara_africa = \
sub_sahara_africa[['incomeperperson', 'relectricperperson',
'urbanrate']]
print(sub_sahara_africa)
print('Split relectricperperson data into 3 categories \
- LowkWh, MediumkWh, HighkWh')
sub_sahara_africa['perpersonkWh'] =\
pandas.qcut(sub_sahara_africa.relectricperperson, 3,
labels=["LowkWh", "MediumkWh", "HighkWh"])
# Box plot
seaborn.boxplot(x='perpersonkWh', y='relectricperperson',
hue='perpersonkWh',
data=sub_sahara_africa, saturation=1, orient='v')
# using ols function for calculating the F-statistic and associated p value
# To handle this issue: PatsyError: Error evaluating factor: TypeError:
# 'ClassRegistry' object is not callable incomeperperson ~ C(perpersonkWh), run
# del c on the command line.
# Getting 'typeerror data type not understood' so I convert the categorical
# value to type object prior to running ols
sub_sahara_africa['perpersonkWh_fixed'] =\
sub_sahara_africa.perpersonkWh.astype(np.object)
ssa_model =\
smf.ols(formula='incomeperperson ~ C(perpersonkWh_fixed)',
data=sub_sahara_africa).fit()
print(ssa_model.summary())
ssa_dropna =\
sub_sahara_africa[['incomeperperson', 'perpersonkWh']].dropna()
print(ssa_dropna['perpersonkWh'])
print(ssa_dropna['incomeperperson'])
print('means for incomeperperson by per person kWh status')
ssa_kWh_mean = ssa_dropna.groupby('perpersonkWh').mean()
print(ssa_kWh_mean)
# To plot line chart of categorical mean values
plt.plot(ssa_kWh_mean, 'bD')
print('std for incomeperperson by per person kWh status')
ssa_kWh_std = ssa_dropna.groupby('perpersonkWh').std()
print(ssa_kWh_std)
ssa_mc = multi.MultiComparison(ssa_dropna['incomeperperson'],
ssa_dropna['perpersonkWh'])
ssa_res2 = ssa_mc.tukeyhsd()
print(ssa_res2.summary())
import os
import numpy as np
import seaborn
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import matplotlib.pyplot as plt
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
# Reading data into DataFrame
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
print(econ_data.shape)
# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')
# Extract only countries in Sub-Sahara Africa.
# For this, I addeded a new column - continent to gapminder dataframe
sub_sahara_africa = econ_data[econ_data['continent'] == 'Africa']
# Extracting data pertinent to variables of interest
sub_sahara_africa = \
sub_sahara_africa[['incomeperperson', 'relectricperperson',
'urbanrate']]
print(sub_sahara_africa)
print('Split relectricperperson data into 3 categories \
- LowkWh, MediumkWh, HighkWh')
sub_sahara_africa['perpersonkWh'] =\
pandas.qcut(sub_sahara_africa.relectricperperson, 3,
labels=["LowkWh", "MediumkWh", "HighkWh"])
# Box plot
seaborn.boxplot(x='perpersonkWh', y='relectricperperson',
hue='perpersonkWh',
data=sub_sahara_africa, saturation=1, orient='v')
# using ols function for calculating the F-statistic and associated p value
# To handle this issue: PatsyError: Error evaluating factor: TypeError:
# 'ClassRegistry' object is not callable incomeperperson ~ C(perpersonkWh), run
# del c on the command line.
# Getting 'typeerror data type not understood' so I convert the categorical
# value to type object prior to running ols
sub_sahara_africa['perpersonkWh_fixed'] =\
sub_sahara_africa.perpersonkWh.astype(np.object)
ssa_model =\
smf.ols(formula='incomeperperson ~ C(perpersonkWh_fixed)',
data=sub_sahara_africa).fit()
print(ssa_model.summary())
ssa_dropna =\
sub_sahara_africa[['incomeperperson', 'perpersonkWh']].dropna()
print(ssa_dropna['perpersonkWh'])
print(ssa_dropna['incomeperperson'])
print('means for incomeperperson by per person kWh status')
ssa_kWh_mean = ssa_dropna.groupby('perpersonkWh').mean()
print(ssa_kWh_mean)
# To plot line chart of categorical mean values
plt.plot(ssa_kWh_mean, 'bD')
print('std for incomeperperson by per person kWh status')
ssa_kWh_std = ssa_dropna.groupby('perpersonkWh').std()
print(ssa_kWh_std)
ssa_mc = multi.MultiComparison(ssa_dropna['incomeperperson'],
ssa_dropna['perpersonkWh'])
ssa_res2 = ssa_mc.tukeyhsd()
print(ssa_res2.summary())
Graph
This boxplot shows that the mean consumption of electricity across per person electricity consumption is not equal across the groups. The line chart shows the mean per person electricity consumption across categories of electric power consumers.
ANOVA Results
When examining the relationship between per person electricity consumption across groups of interest, ANOVA revealed that among countries in Sub Sahara Africa, LowkWh(Mean=601.949389,s.d=846.373070 ) groups consumes less electricity compared to both MediumkWh(Mean=639.068846, s.d=299.583651) and HighkWh(Mean=2086.976328, s.d=1866.259027) groups of electricity consumers.
There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across groups - LowkWh, MediumkWh and HighkWh, determined by one-way ANOVA F(2,19) = 3.694,
p = 0.0441.
There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across groups - LowkWh, MediumkWh and HighkWh, determined by one-way ANOVA F(2,19) = 3.694,
p = 0.0441.
In view of the above result, I will perform Post Hoc test using Tukey Highly Significance Difference test.
A Tukey Post hoc test(see table below) revealed that there are no statistically significance differences in per person electricity consumption in Sub Sahara Africa between groups of interest. Thus, pairs of groups consume the same quantity of electricity on average.
Multiple Comparison of Means - Tukey HSD,FWER=0.05
========================================================
group1 group2 meandiff lower upper reject
--------------------------------------------------------------------------------------------------
HighkWh LowkWh -1485.0269 -3035.9813 65.9274 False
HighkWh MediumkWh -1447.9075 -3049.7262 153.9113 False
LowkWh MediumkWh 37.1195 -1513.8349 1588.0738 False
--------------------------------------------------------------------------------------------------
Multiple Comparison of Means - Tukey HSD,FWER=0.05
========================================================
group1 group2 meandiff lower upper reject
--------------------------------------------------------------------------------------------------
HighkWh LowkWh -1485.0269 -3035.9813 65.9274 False
HighkWh MediumkWh -1447.9075 -3049.7262 153.9113 False
LowkWh MediumkWh 37.1195 -1513.8349 1588.0738 False
--------------------------------------------------------------------------------------------------

No comments:
Post a Comment