Saturday, December 26, 2015

Testing a potential Moderator

Introduction

In this assignment, I will attempt to identify the impact of population density on relationship between "incomeperperson" and "relectricperperson" based on the gapminder data. My earlier analysis revealed a strong relationship between per capita income and per person electric consumption globally. 

Variables

Explanatory: "relectricperperson"
Response: "incomeperperson"
Moderator: "urbanrate"


Data Analysis

import pandas
import seaborn
import scipy
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf


# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# define method to load data of interest


def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data


# loading data
data = load_data('data', 'gapminder.csv')

# Extracting data pertinent to variables of interest
data = data[['incomeperperson', 'relectricperperson', 'urbanrate']]

# setting variables of interest to numeric
data['incomeperperson'] = \
pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['relectricperperson'] = \
pandas.to_numeric(data['relectricperperson'], errors='coerce')
data['urbanrate'] = \
pandas.to_numeric(data['urbanrate'], errors='coerce')
print(data.head(20))

data = data.dropna()

# Splitting per capita income in 3 tranches
def percapitagdp(row):
if row['incomeperperson'] <= data['incomeperperson'].quantile(.25):
return 0
if row['incomeperperson'] > data['incomeperperson'].quantile(.25) and\
row['incomeperperson'] <= data['incomeperperson'].quantile(.75):
return 1
if row['incomeperperson'] > data['incomeperperson'].quantile(.75):
return 2


data['percapitagdp'] = data.apply(lambda row: percapitagdp(row), axis=1)

# Creating dataframe for each income group: 0 : Low Income, 1 : Middle Income
# 2: High Income
low_gdp = data[(data['percapitagdp'] == 0)]
print(low_gdp)

medium_gdp = data[(data['percapitagdp'] == 1)]
print(medium_gdp)

high_gdp = data[(data['percapitagdp'] == 2)]

print(high_gdp)

# Splitting household electiricity consumption 2 tranches - 0: low per person
# electricity consumption, 1: high per person electricity consumption
def percapitakwh(row):
if row['relectricperperson'] <= data['relectricperperson'].quantile(.5):
return 0
if row['relectricperperson'] > data['relectricperperson'].quantile(.5):
return 1

data['percapitakWh'] = data.apply(lambda row: percapitakwh(row), axis=1)

# Count items per income group
electricgrp = data['percapitakWh'].value_counts(sort=False, dropna=False)
print(electricgrp)

# Creating dataframe for each per person electric consumption group
low_kWh = data[(data['percapitakWh'] == 0)]
print(low_kWh)

high_kWh = data[(data['percapitakWh'] == 1)]

print(high_kWh)

# Splitting urbanrate into 2 tranches - 0: low population density area
# 1: high population density area


def areatype(row):
if row['urbanrate'] <= data['urbanrate'].quantile(.5):
return 0
if row['urbanrate'] > data['urbanrate'].quantile(.5):
return 1

data['areatype'] = data.apply(lambda row: areatype(row), axis=1)

areagrp = data['areatype'].value_counts(sort=False, dropna=True)

print(areagrp)

Chi Square Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"


# contingency table of observed counts
gdp_kWh = pandas.crosstab(data['percapitagdp'], data['percapitakWh'])
print(gdp_kWh)

# column percentages
colsum = gdp_kWh.sum(axis=0)
colpct = gdp_kWh/colsum
print(colpct)

# chi-square
print('chi-square value, p value, expected counts')
cs_gdpkWh = scipy.stats.chi2_contingency(gdp_kWh)

print(cs_gdpkWh)

Test Result

chi-square value =  62.183712121212125
p value = 3.1403530858162606e-14

From the Chi Square test, we can say that there is a statistically significant association between "incomeperperson" explanatory variable  and "relectricperperson" response variable as evidenced by p-value < 0.05


# set variable types
data['percapitakWh'] = data['percapitakWh'].astype('category')
data['percapitagdp'] =\
pandas.to_numeric(data['percapitagdp'], errors='coerce')

# bivariate bar graph
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=data, kind="bar",
ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')




# Creating dataframe for low and high population density groups
low_pop = data[(data['areatype'] == 0)]
high_pop = data[(data['areatype'] == 1)]

print('association between per capita income and per person electric\
consumption for residents in low population density areas')
# contingency table of observed counts
lowpop_gdpkWh = pandas.crosstab(low_pop['percapitagdp'],
low_pop['percapitakWh'])
print(lowpop_gdpkWh)

# column percentages
colsum = lowpop_gdpkWh.sum(axis=0)
colpct = lowpop_gdpkWh/colsum

print(colpct)

# chi-square
print('chi-square value, p value, expected counts')
cs_lowpop_gdpkWh = scipy.stats.chi2_contingency(lowpop_gdpkWh)

print(cs_lowpop_gdpkWh)

Result I - impact of Low population density as a moderator on income and electricity consumption

chi-square value = 28.577586206896555
p value =  6.2295403346060193e-07

The result is statistically significant as evidenced by p-value < 0.05. That is, low population density has a moderating influence on the relationship between "incomeperperson" and "relectricperperson"


print('association between per capita income and per person electric\
consumption for residents in high population density areas')
# contingency table of observed counts
ct_highpop_gdpkWh = pandas.crosstab(high_pop['percapitagdp'],
high_pop['percapitakWh'])
print(ct_highpop_gdpkWh)

# column percentages
colsum = ct_highpop_gdpkWh.sum(axis=0)
colpct = ct_highpop_gdpkWh/colsum
print(colpct)

# chi-square
print('chi-square value, p value, expected counts')
cs_highpop_gdpkWh = scipy.stats.chi2_contingency(ct_highpop_gdpkWh)
print(cs_highpop_gdpkWh)

Result II - High population density as a moderator on income and electricity consumption

chi-square value = 17.439163165266105
p value =  0.0001633555274138804

The result is statistically significant as evidenced by p-value < 0.05. That is, high population density has a moderating influence on the relationship between "incomeperperson" and "relectricperperson"

# Line Chart
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=low_pop,
kind="point", ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
plt.title('association between income and electricity consumption for \
residents of LOW population density areas.')

seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=high_pop,
kind="point", ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
plt.title('association between income and electricity consumption for \

residents of HIGH population density areas.')
Both graphs show positive trend, that is, high per person electricity consumption is associated with higher income. However, people living in high population areas tend relatively to spend more of their income on electricity compare to those in low population density areas.

ANOVA Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"


# using ols function for calculating the F-statistic and associated p value
gdpkWh_model = smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=data)
gdpkWh_res = gdpkWh_model.fit()

print(gdpkWh_res.summary())

print("standard deviation for mean Per person electric consumption by Low vs.\
Medium vs. High income levels")
st1 = gdpkWh.groupby('percapitagdp').std()

print(st1)

ANOVA Test Result

When examining the relationship between per person electricity consumption across various levels of income, ANOVA revealed that globally,  
Low Income(Mean=144.655970,s.d=153.330045 ) consumes less electricity relative to both  Medium Income(Mean=729.941778, s.d=568.913501) and High Income(2964.549001, s.d=2162.289249) levels.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2, determined by one-way ANOVA F(2,130) = 57.01, p = 2.15e-18

# Bar plot 
seaborn.factorplot(x='percapitagdp', y='relectricperperson', data=data,
kind='bar', ci=None)
plt.xlabel('Levels Income Per Person')

plt.ylabel('Mean Per Person Electric Consumption')



print('association between per person electricity consumption and per capita \
income for those living in low population density areas')
gdpkWh_lowpop_model =\
smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=low_pop)
print('association between per person electricity consumption and per capita \
income for those living in low population density areas')
gdpkWh_lowpop_model =\
smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=low_pop)
gdpkWh_lowpop_res = gdpkWh_lowpop_model.fit()
print(gdpkWh_lowpop_res.summary())

print("means for Per Person Electric Consumption by per capita income \
0 vs. 1 vs. 2 for low population density")
mean_lowpop = low_pop.groupby('percapitagdp').mean()
print(mean_lowpop)

print("standard deviation for mean Per person electric consumption by Low vs.\
Medium vs. High income levels for Low population density areas")
std_lowpop = low_pop.groupby('percapitagdp').std()

print(std_lowpop)

Result I - impact of Low population density as a moderator on income and electricity electricity consumption

When examining the interaction between low urban rate,  per person electricity consumption and income levels, ANOVA revealed that globally,  Low Income(Mean=132.260069,s.d=122.047634 ) consumes less electricity relative to both  Medium Income(Mean = 606.876711, s.d = 481.566696) and High Income(2124.808392, s.d =1105.691412) levels. That is, urbanization has a moderating impact on electricity consumption as it relates to income levels.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2 and LOW population density areas, determined by one-way ANOVA F(2,65) = 46.44, p = 4.72e-13. 

Result II - impact of High population density as a moderator on income and electricity electricity consumption

When examining the interaction between high urban rate,  per person electricity consumption and income levels, ANOVA revealed that globally,  
Low Income(Mean=336.792431,s.d=476.29642 ) consumes less electricity relative to both  Medium Income(Mean = 831.909977, s.d = 620.584218) and High Income(3114.502681, s.d =2281.732505) levels. That is, "urbanrate" can be considered as a "lurking" factor in explaining the relationship between  electricity consumption and income levels.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2 and HIGH population density areas, determined by one-way ANOVA F(2,65) = 17.22p = 1.13e-06

Pearson Correlation Coefficient Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"

print('association between incomeperperson and relectricperperson for LOW \
population density countries')
(r, p) =\
scipy.stats.pearsonr(low_pop['incomeperperson'],
low_pop['relectricperperson'])

print('(r = %f, p-value = %s)' % (r, p))
print('r2 = %f' % r**2)

scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
fit_reg=False, data=low_pop)
plt.xlabel('Income Per Person')
plt.ylabel('Per Person Electricity Consumption')
plt.title('Scatterplot for the association between Electric Consumption and \
Per Capita Income for LOW population density countries')

print(scat1)

Result I - impact of Low population density as a moderator on income and electricity electricity consumption

Correlation Coefficient 

(r = 0.873841, p-value = 2.12383414237e-21)
r2 = 0.763597

The generated correlation coefficient is positively correlated and is very strong with a value of 0.87 and is also statistically significant(p-value < 0.05). This goes to show that, it is highly unlikely that the relationship of this magnitude between income and electricity in the presence of low urbanrate globally is due to chance alone.

With an r-squared = 0.76, this suggests that, if we know the income per person, we can predict 76% of the variability we observe in per person electricity consumption globally in the presence of low urban rate. 

print('association between incomeperperson and relectricperperson for HIGH \
population density countries')
(r, p) =\
scipy.stats.pearsonr(high_pop['incomeperperson'],
high_pop['relectricperperson'])
print(' ')
print('(r = %f, p-value = %s)' % (r, p))
print('r2 = %f' % r**2)

scat2 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
fit_reg=False, data=high_pop)
plt.xlabel('Income Per Person')
plt.ylabel('Per Person Electricity Consumption')
plt.title('Scatterplot for the association between Electric Consumption and \
Per Capita Income for High population density countries')
print(scat2)

Result I - impact of High population density as a moderator on income and electricity electricity consumption

Correlation Coefficient 

(r = 0.520095, p-value = 8.98036750462e-06)
r2 = 0.270499

The generated correlation coefficient is positively correlated and moderately strong with a value of 0.52 and is also statistically significant(p-value < 0.05). This goes to show that, it is highly unlikely that the relationship of this magnitude between income and electricity in the presence of high urban rate globally is due to chance alone.

With an r-squared = 0.27, this suggests that, if we know the income per person, we can predict 27% of the variability we observe in per person electricity consumption globally in the presence of low urban rate. 

Thus, we can hypothesize that, "urbanrate" is a potential moderator on relationship between "relectricperperson" and "incomeperperson"

Monday, December 14, 2015

Generating a Correlation Coefficient

Introduction

In this week's assignment, the ask is to use provided datasets and generate  and interpret correlation coefficient using two quantitative variables. I will use the gapminder dataset and calculate correlation coefficient between "incomeperperson" and "relectricperperson".

Code

import pandas
import numpy
import seaborn
import scipy
import os
import matplotlib.pyplot as plt

# define method to load data of interest


def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data


# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)


# loading data
data = load_data('data', 'gapminder.csv')
print(data)

reg_data = data.copy()

# Extracting data pertinent to variables of interest
reg_data = \
reg_data[['incomeperperson', 'relectricperperson']]

print(reg_data)


# setting variables of interest to numeric
reg_data['incomeperperson'] = \
pandas.to_numeric(reg_data['incomeperperson'], errors='coerce')
reg_data['relectricperperson'] = \
pandas.to_numeric(reg_data['relectricperperson'], errors='coerce')


reg_data['incomeperperson'] =\
reg_data['incomeperperson'].replace(' ', numpy.nan)

reg_data['relectricperperson'] =\
reg_data['relectricperperson'].replace(' ', numpy.nan)

scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
fit_reg=True, data=reg_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
plt.title('Scatterplot for the Association Between income and\
electricity consumption globally')

Scatter Plot


# Remove all NAs from data
data_clean = reg_data.dropna()

print('association between income and electricity consumption')
(r, p) =\
scipy.stats.pearsonr(data_clean['incomeperperson'],
data_clean['relectricperperson'])

print('(r = %f, p-value = %s)' % (r, p))
print('r2 = %f' % r**2)

Correlation Coefficient

(r =  0.651637, p-value =  4.63071717359e-17)
r2 = 0.424631


Conclusion


The generated correlation coefficient is positively correlated and modestly strong with a value of 0.65 and is also statistically significant(p-value < 0.05). This goes to show that, it is highly unlikely that the relationship of this magnitude between income and electricity use globally is due to chance alone.

With an r-squared = 0.42, this suggests that, if we know the income per person, we can predict 42% of the variability we observe in per person electricity consumption globally. 

Wednesday, December 9, 2015

Chi Square Test of Independence

Introduction

In this week's assignment, the ask is to test for independence( no relationship between income and electricity consumption) in Sub Sahara  Africa. I will be using the gapminder data.

Ho: The relative proportion of  per person electricity consumption is the same regardless of income level in Sub Sahara Africa and they are independent.

Ha: The relative proportion of  per person electricity consumption changes with respect to changes in  income levels in Sub Sahara Africa and they are dependent.

Code

import pandas
import os
import numpy as np
import seaborn
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import matplotlib.pyplot as plt
import scipy.stats


# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# Reading data into DataFrame
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
print(econ_data.shape)

# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')

# Extract only countries in Sub-Sahara Africa.
# For this, I addeded a new column - continent to gapminder dataframe
sub_sahara_africa = econ_data[econ_data['continent'] == 'Africa']
sub_sahara_africa2 = sub_sahara_africa.copy()
print(sub_sahara_africa2.shape)

# Extracting data pertinent to variables of interest
income_electric_filter = ['incomeperperson', 'relectricperperson']
sub_sahara_africa2 = sub_sahara_africa2[income_electric_filter]
print(sub_sahara_africa2)


# Categorizing numeric data

def perpersonkwh(row):
if row['relectricperperson'] <=\
sub_sahara_africa2['relectricperperson'].quantile(.5):
return 'LowkWh'
if row['relectricperperson'] >\
sub_sahara_africa2['relectricperperson'].quantile(.5):
return 'HighkWh'

sub_sahara_africa2['perpersonkWh'] =\
sub_sahara_africa2.apply(lambda row: perpersonkwh(row), axis=1)


def percapitagdp(row):
if row['incomeperperson'] <=\
sub_sahara_africa2['incomeperperson'].quantile(.25):
return 'LowGDP'
if row['incomeperperson'] >\
sub_sahara_africa2['incomeperperson'].quantile(.25) and \
row['incomeperperson'] <=\
sub_sahara_africa2['incomeperperson'].quantile(.75):
return 'MediumGDP'
if row['incomeperperson'] >\
sub_sahara_africa2['incomeperperson'].quantile(.75):

return 'HighGDP'

sub_sahara_africa2['percapitaGDP'] =\
sub_sahara_africa2.apply(lambda row: percapitagdp(row), axis=1)

print(sub_sahara_africa2['perpersonkWh'].head(10))

print(sub_sahara_africa2['percapitaGDP'].head(10))

# recoding values for perpersonkWh into a new variable, PERPRSKWH
kWh_recode = {'LowkWh': 0, 'HighkWh': 1}
sub_sahara_africa2['PERPRSKWH'] =\
sub_sahara_africa2['perpersonkWh'].map(kWh_recode)

# recoding values for percapitaGDP into a new variable, PERCAPGDP
gdp_recode = {'LowGDP': 1, 'MediumGDP': 2, 'HighGDP': 3}
sub_sahara_africa2['PERCAPGDP'] =\
sub_sahara_africa2['percapitaGDP'].map(gdp_recode)

print(sub_sahara_africa2['PERPRSKWH'].head(10))

print(sub_sahara_africa2['PERCAPGDP'].head(10))

# contingency/cross classification table of observed counts
ssa_cont_tbl = pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['PERCAPGDP'])

print(ssa_cont_tbl)

          PERCAPGDP  1.000000  2.000000  3.000000

PERPRSKWH  

0.000000                3          6        2

1.000000                0          7        4



# column percentages
colsum = ssa_cont_tbl.sum(axis=0)
print(colsum)
colpct = ssa_cont_tbl/colsum
print(colpct)

          PERCAPGDP  1.000000  2.000000  3.000000

PERPRSKWH

0.000000             1.000000  0.461538  0.333333

1.000000             0.000000  0.538462  0.666667


# chi-square
print('chi-square value, p value, expected counts')
ssa_chisquare = scipy.stats.chi2_contingency(ssa_cont_tbl)

print(ssa_chisquare)

# set variable types
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['PERPRSKWH'].head(10))

sub_sahara_africa2['PERCAPGDP'] =\
sub_sahara_africa2['PERCAPGDP'].astype('category')
sub_sahara_africa2['PERPRSKWH'] =\
pandas.to_numeric(sub_sahara_africa2['PERPRSKWH'], errors='coerce')

print(sub_sahara_africa2['PERCAPGDP'].head(10))

print(sub_sahara_africa2['PERPRSKWH'].head(10))

# Factor plot
seaborn.factorplot(x='PERCAPGDP', y='PERPRSKWH', data=sub_sahara_africa2,
kind='bar', ci=None)
plt.xlabel('Per Capita GDP')

plt.ylabel('Proportion of Power Consumption')

Graphically, there seem to be some relationship between proportion of per person electricity consumption and per capita income. Specifically, the higher the income level, the more electricity consumed(LowGDP(0.0%) < MidiumGDP(54%) < HighGDP(67%)

Chi Square Result

There is no statistically significance as determined by  Chi Square Test result at alpha = 0.05:
(X2 = 3.7435897435897436, p = 0.15384727771283108, d.f  =  2) and we accept the null hypothesis. That is, the proportion of per person electricity consume is independent of per capita income levels across Sub Sahara Africa.

Bonferroni Adjustments

To minimize the familywise p value, we will now conduct pairwise comparison between per person electricity consumption and per capita income in Sub Sahara Africa. In this case we will need to conduct three such tests and compare the p-values from the tests to our Bonferroni Adjusted  p-value 05/3 = 0.016666

mediumGDP_lowGDP = {2: 2, 1: 1}
sub_sahara_africa2['MediumLowGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(mediumGDP_lowGDP )

print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['MediumLowGDP'].head(10))
# contingency table of observed counts
MediumGDPLowGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['MediumLowGDP'])
print(MediumGDPLowGDP)

# column percentages
colsum = MediumGDPLowGDP.sum(axis=0)
print(colsum)
colpct = MediumGDPLowGDP/colsum
print(colpct)

print('chi-square value, p value, expected counts')
medium_lowGDP_cont = scipy.stats.chi2_contingency(MediumGDPLowGDP)
print(medium_lowGDP_cont)

             MediumLowGDP  1.000000  2.000000

PERPRSKWH 

0.000000                      3         6

1.000000                      0         7


             MediumLowGDP  1.000000  2.000000

PERPRSKWH

0.000000                   1.000000  0.461538

1.000000                   0.000000  0.538462

X2 = 1.1005291005291005, P = 0.29415001818473019, d.f = 1

Proportionally, the higher the per capita income the more electricity consumed(54% vs. 0.0%)

highGDP_lowGDP = {3: 3, 1: 1}
sub_sahara_africa2['HighLowGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(highGDP_lowGDP )

print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['HighLowGDP'].head(10))
# contingency table of observed counts
HighGDPLowGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['HighLowGDP'])
print(HighGDPLowGDP)

# column percentages
colsum = HighGDPLowGDP.sum(axis=0)
print(colsum)
colpct = HighGDPLowGDP/colsum
print(colpct)

print('chi-square value, p value, expected counts')
high_lowGDP_cont = scipy.stats.chi2_contingency(HighGDPLowGDP)
print(high_lowGDP_cont)

               HighLowGDP  1.000000  3.000000

PERPRSKWH

0.000000                      3         2

1.000000                      0         4


              HighLowGDP  1.000000  3.000000
PERPRSKWH

0.000000                  1.000000  0.333333

1.000000                  0.000000  0.666667

(chi-square=1.40625,p-value = 0.23567991342903749,d.f = 1)

Proportionally, the higher the per capita income the more electricity consumed(67% vs. 0.0%)

highGDP_mediumGDP = {3: 3, 2: 2}
sub_sahara_africa2['HighMediumGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(highGDP_mediumGDP )

print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['HighMediumGDP'].head(10))
# contingency table of observed counts
HighGDPMediumGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['HighMediumGDP'])
print(HighGDPMediumGDP)

# column percentages
colsum = HighGDPMediumGDP.sum(axis=0)
print(colsum)
colpct = HighGDPMediumGDP/colsum
print(colpct)

print('chi-square value, p value, expected counts')
high_mediumGDP_cont = scipy.stats.chi2_contingency(HighGDPMediumGDP)
print(high_mediumGDP_cont)


                 HighMediumGDP  2.000000  3.000000
PERPRSKWH

0.000000                           6         2

1.000000                           7         4

                 HighMediumGDP  2.000000  3.000000

PERPRSKWH

0.000000                        0.461538  0.333333

1.000000                        0.538462  0.666667


chi-square = 0.00069201631701630963,

p-value = 0.97901310733501912, 

d.f = 1

Proportionally, the higher the per capita income the more electricity consumed(67% vs. 54.0%)

Conclusion

Based on the pairwise analysis, it appears per income capita has no impact on proportion of per person electricity consumption in Sub Sahara Africa, since we accept all pairwise comparison between income and per person electricity consumption at adjusted p-value of  0.016666. Thus, per person electricity consumption is independent of per capita income.

Thursday, December 3, 2015

Running an analysis of variance

Introduction

This marks the continuation of "Data Management and Visualization". I will continue to use the gapminder dataset to answer my hypothesis - Is per per person electricity consumption a proxy for economic development as measured by per capita income.

For this particular assignment the goal is to test this hypothesis;
H0: there is no difference in mean per person electricity consumption across categories of consumers
Ha: there are differences

Code

import pandas
import os
import numpy as np
import seaborn
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import matplotlib.pyplot as plt


# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# Reading data into DataFrame
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
print(econ_data.shape)

# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')

# Extract only countries in Sub-Sahara Africa.
# For this, I addeded a new column - continent to gapminder dataframe
sub_sahara_africa = econ_data[econ_data['continent'] == 'Africa']

# Extracting data pertinent to variables of interest
sub_sahara_africa = \
sub_sahara_africa[['incomeperperson', 'relectricperperson',
'urbanrate']]

print(sub_sahara_africa)

print('Split relectricperperson data into 3 categories \
- LowkWh, MediumkWh, HighkWh')
sub_sahara_africa['perpersonkWh'] =\
pandas.qcut(sub_sahara_africa.relectricperperson, 3,
labels=["LowkWh", "MediumkWh", "HighkWh"])

# Box plot
seaborn.boxplot(x='perpersonkWh', y='relectricperperson',
hue='perpersonkWh',
data=sub_sahara_africa, saturation=1, orient='v')

# using ols function for calculating the F-statistic and associated p value
# To handle this issue: PatsyError: Error evaluating factor: TypeError:
# 'ClassRegistry' object is not callable incomeperperson ~ C(perpersonkWh), run
# del c on the command line.

# Getting 'typeerror data type not understood' so I convert the categorical
# value to type object prior to running ols
sub_sahara_africa['perpersonkWh_fixed'] =\
sub_sahara_africa.perpersonkWh.astype(np.object)

ssa_model =\
smf.ols(formula='incomeperperson ~ C(perpersonkWh_fixed)',
data=sub_sahara_africa).fit()

print(ssa_model.summary())

ssa_dropna =\
sub_sahara_africa[['incomeperperson', 'perpersonkWh']].dropna()
print(ssa_dropna['perpersonkWh'])
print(ssa_dropna['incomeperperson'])

print('means for incomeperperson by per person kWh status')
ssa_kWh_mean = ssa_dropna.groupby('perpersonkWh').mean()
print(ssa_kWh_mean)

# To plot line chart of categorical mean values
plt.plot(ssa_kWh_mean, 'bD')

print('std for incomeperperson by per person kWh status')
ssa_kWh_std = ssa_dropna.groupby('perpersonkWh').std()
print(ssa_kWh_std)

ssa_mc = multi.MultiComparison(ssa_dropna['incomeperperson'],
ssa_dropna['perpersonkWh'])
ssa_res2 = ssa_mc.tukeyhsd()

print(ssa_res2.summary())

Graph


This boxplot shows that the mean consumption of electricity across per person electricity consumption is not equal across the groups. The line chart shows the mean per person electricity consumption across categories of electric power consumers.


ANOVA Results

When examining the relationship between per person electricity consumption across groups of interest, ANOVA revealed that among countries in Sub Sahara Africa,  LowkWh(Mean=601.949389,s.d=846.373070 ) groups consumes less electricity compared to both  MediumkWh(Mean=639.068846, s.d=299.583651) and HighkWh(Mean=2086.976328, s.d=1866.259027) groups of electricity consumers.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across groups - LowkWh, MediumkWh and HighkWh, determined by one-way ANOVA F(2,19) = 3.694,
p = 0.0441. 

In view of the above result, I will perform Post Hoc test using Tukey Highly Significance Difference test. 

A Tukey Post hoc test(see table below) revealed that there are no statistically significance differences in per person electricity consumption in Sub Sahara Africa between  groups of interest. Thus, pairs of groups consume the same quantity of electricity on average.

   Multiple Comparison of Means - Tukey HSD,FWER=0.05
========================================================
 group1            group2                meandiff       lower            upper         reject
--------------------------------------------------------------------------------------------------
HighkWh         LowkWh          -1485.0269     -3035.9813  65.9274      False
HighkWh         MediumkWh    -1447.9075    -3049.7262  153.9113     False
 LowkWh         MediumkWh    37.1195         -1513.8349 1588.0738    False
--------------------------------------------------------------------------------------------------