Thursday, December 3, 2015

Running an analysis of variance

Introduction

This marks the continuation of "Data Management and Visualization". I will continue to use the gapminder dataset to answer my hypothesis - Is per per person electricity consumption a proxy for economic development as measured by per capita income.

For this particular assignment the goal is to test this hypothesis;
H0: there is no difference in mean per person electricity consumption across categories of consumers
Ha: there are differences

Code

import pandas
import os
import numpy as np
import seaborn
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import matplotlib.pyplot as plt


# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# Reading data into DataFrame
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
print(econ_data.shape)

# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')

# Extract only countries in Sub-Sahara Africa.
# For this, I addeded a new column - continent to gapminder dataframe
sub_sahara_africa = econ_data[econ_data['continent'] == 'Africa']

# Extracting data pertinent to variables of interest
sub_sahara_africa = \
sub_sahara_africa[['incomeperperson', 'relectricperperson',
'urbanrate']]

print(sub_sahara_africa)

print('Split relectricperperson data into 3 categories \
- LowkWh, MediumkWh, HighkWh')
sub_sahara_africa['perpersonkWh'] =\
pandas.qcut(sub_sahara_africa.relectricperperson, 3,
labels=["LowkWh", "MediumkWh", "HighkWh"])

# Box plot
seaborn.boxplot(x='perpersonkWh', y='relectricperperson',
hue='perpersonkWh',
data=sub_sahara_africa, saturation=1, orient='v')

# using ols function for calculating the F-statistic and associated p value
# To handle this issue: PatsyError: Error evaluating factor: TypeError:
# 'ClassRegistry' object is not callable incomeperperson ~ C(perpersonkWh), run
# del c on the command line.

# Getting 'typeerror data type not understood' so I convert the categorical
# value to type object prior to running ols
sub_sahara_africa['perpersonkWh_fixed'] =\
sub_sahara_africa.perpersonkWh.astype(np.object)

ssa_model =\
smf.ols(formula='incomeperperson ~ C(perpersonkWh_fixed)',
data=sub_sahara_africa).fit()

print(ssa_model.summary())

ssa_dropna =\
sub_sahara_africa[['incomeperperson', 'perpersonkWh']].dropna()
print(ssa_dropna['perpersonkWh'])
print(ssa_dropna['incomeperperson'])

print('means for incomeperperson by per person kWh status')
ssa_kWh_mean = ssa_dropna.groupby('perpersonkWh').mean()
print(ssa_kWh_mean)

# To plot line chart of categorical mean values
plt.plot(ssa_kWh_mean, 'bD')

print('std for incomeperperson by per person kWh status')
ssa_kWh_std = ssa_dropna.groupby('perpersonkWh').std()
print(ssa_kWh_std)

ssa_mc = multi.MultiComparison(ssa_dropna['incomeperperson'],
ssa_dropna['perpersonkWh'])
ssa_res2 = ssa_mc.tukeyhsd()

print(ssa_res2.summary())

Graph


This boxplot shows that the mean consumption of electricity across per person electricity consumption is not equal across the groups. The line chart shows the mean per person electricity consumption across categories of electric power consumers.


ANOVA Results

When examining the relationship between per person electricity consumption across groups of interest, ANOVA revealed that among countries in Sub Sahara Africa,  LowkWh(Mean=601.949389,s.d=846.373070 ) groups consumes less electricity compared to both  MediumkWh(Mean=639.068846, s.d=299.583651) and HighkWh(Mean=2086.976328, s.d=1866.259027) groups of electricity consumers.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across groups - LowkWh, MediumkWh and HighkWh, determined by one-way ANOVA F(2,19) = 3.694,
p = 0.0441. 

In view of the above result, I will perform Post Hoc test using Tukey Highly Significance Difference test. 

A Tukey Post hoc test(see table below) revealed that there are no statistically significance differences in per person electricity consumption in Sub Sahara Africa between  groups of interest. Thus, pairs of groups consume the same quantity of electricity on average.

   Multiple Comparison of Means - Tukey HSD,FWER=0.05
========================================================
 group1            group2                meandiff       lower            upper         reject
--------------------------------------------------------------------------------------------------
HighkWh         LowkWh          -1485.0269     -3035.9813  65.9274      False
HighkWh         MediumkWh    -1447.9075    -3049.7262  153.9113     False
 LowkWh         MediumkWh    37.1195         -1513.8349 1588.0738    False
--------------------------------------------------------------------------------------------------

No comments:

Post a Comment