Data Analysis and Interpretation: Chi Square Test of Independence

Introduction

In this week's assignment, the ask is to test for independence( no relationship between income and electricity consumption) in Sub Sahara Africa. I will be using the gapminder data.

Ho: The relative proportion of per person electricity consumption is the same regardless of income level in Sub Sahara Africa and they are independent.

Ha: The relative proportion of per person electricity consumption changes with respect to changes in income levels in Sub Sahara Africa and they are dependent.

Code

import pandas
import os
import numpy as np
import seaborn
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import matplotlib.pyplot as plt
import scipy.stats

# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# Reading data into DataFrame
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
print(econ_data.shape)

# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')

# Extract only countries in Sub-Sahara Africa.
# For this, I addeded a new column - continent to gapminder dataframe
sub_sahara_africa = econ_data[econ_data['continent'] == 'Africa']
sub_sahara_africa2 = sub_sahara_africa.copy()
print(sub_sahara_africa2.shape)

# Extracting data pertinent to variables of interest
income_electric_filter = ['incomeperperson', 'relectricperperson']
sub_sahara_africa2 = sub_sahara_africa2[income_electric_filter]
print(sub_sahara_africa2)

# Categorizing numeric data

def perpersonkwh(row):
if row['relectricperperson'] <=\
sub_sahara_africa2['relectricperperson'].quantile(.5):
return 'LowkWh'
if row['relectricperperson'] >\
sub_sahara_africa2['relectricperperson'].quantile(.5):
return 'HighkWh'

sub_sahara_africa2['perpersonkWh'] =\
sub_sahara_africa2.apply(lambda row: perpersonkwh(row), axis=1)

def percapitagdp(row):
if row['incomeperperson'] <=\
sub_sahara_africa2['incomeperperson'].quantile(.25):
return 'LowGDP'
if row['incomeperperson'] >\
sub_sahara_africa2['incomeperperson'].quantile(.25) and \
row['incomeperperson'] <=\
sub_sahara_africa2['incomeperperson'].quantile(.75):
return 'MediumGDP'
if row['incomeperperson'] >\
sub_sahara_africa2['incomeperperson'].quantile(.75):

return 'HighGDP'

sub_sahara_africa2['percapitaGDP'] =\
sub_sahara_africa2.apply(lambda row: percapitagdp(row), axis=1)

print(sub_sahara_africa2['perpersonkWh'].head(10))

print(sub_sahara_africa2['percapitaGDP'].head(10))

# recoding values for perpersonkWh into a new variable, PERPRSKWH
kWh_recode = {'LowkWh': 0, 'HighkWh': 1}
sub_sahara_africa2['PERPRSKWH'] =\
sub_sahara_africa2['perpersonkWh'].map(kWh_recode)

# recoding values for percapitaGDP into a new variable, PERCAPGDP
gdp_recode = {'LowGDP': 1, 'MediumGDP': 2, 'HighGDP': 3}
sub_sahara_africa2['PERCAPGDP'] =\
sub_sahara_africa2['percapitaGDP'].map(gdp_recode)

print(sub_sahara_africa2['PERPRSKWH'].head(10))

print(sub_sahara_africa2['PERCAPGDP'].head(10))

# contingency/cross classification table of observed counts
ssa_cont_tbl = pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['PERCAPGDP'])

print(ssa_cont_tbl)

PERCAPGDP 1.000000 2.000000 3.000000

PERPRSKWH

0.000000 3 6 2

1.000000 0 7 4

# column percentages
colsum = ssa_cont_tbl.sum(axis=0)
print(colsum)
colpct = ssa_cont_tbl/colsum
print(colpct)

PERCAPGDP 1.000000 2.000000 3.000000

PERPRSKWH

0.000000 1.000000 0.461538 0.333333

1.000000 0.000000 0.538462 0.666667

# chi-square
print('chi-square value, p value, expected counts')
ssa_chisquare = scipy.stats.chi2_contingency(ssa_cont_tbl)

print(ssa_chisquare)

# set variable types
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['PERPRSKWH'].head(10))

sub_sahara_africa2['PERCAPGDP'] =\
sub_sahara_africa2['PERCAPGDP'].astype('category')
sub_sahara_africa2['PERPRSKWH'] =\
pandas.to_numeric(sub_sahara_africa2['PERPRSKWH'], errors='coerce')

print(sub_sahara_africa2['PERCAPGDP'].head(10))

print(sub_sahara_africa2['PERPRSKWH'].head(10))

# Factor plot
seaborn.factorplot(x='PERCAPGDP', y='PERPRSKWH', data=sub_sahara_africa2,
kind='bar', ci=None)
plt.xlabel('Per Capita GDP')

plt.ylabel('Proportion of Power Consumption')

Graphically, there seem to be some relationship between proportion of per person electricity consumption and per capita income. Specifically, the higher the income level, the more electricity consumed(LowGDP(0.0%) < MidiumGDP(54%) < HighGDP(67%)

Chi Square Result

There is no statistically significance as determined by Chi Square Test result at alpha = 0.05:
(X2 = 3.7435897435897436, p = 0.15384727771283108, d.f = 2) and we accept the null hypothesis. That is, the proportion of per person electricity consume is independent of per capita income levels across Sub Sahara Africa.

Bonferroni Adjustments

To minimize the familywise p value, we will now conduct pairwise comparison between per person electricity consumption and per capita income in Sub Sahara Africa. In this case we will need to conduct three such tests and compare the p-values from the tests to our Bonferroni Adjusted p-value 05/3 = 0.016666

mediumGDP_lowGDP = {2: 2, 1: 1}
sub_sahara_africa2['MediumLowGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(mediumGDP_lowGDP )

print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['MediumLowGDP'].head(10))
# contingency table of observed counts
MediumGDPLowGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['MediumLowGDP'])
print(MediumGDPLowGDP)

# column percentages
colsum = MediumGDPLowGDP.sum(axis=0)
print(colsum)
colpct = MediumGDPLowGDP/colsum
print(colpct)

print('chi-square value, p value, expected counts')
medium_lowGDP_cont = scipy.stats.chi2_contingency(MediumGDPLowGDP)
print(medium_lowGDP_cont)

MediumLowGDP 1.000000 2.000000

PERPRSKWH

0.000000 3 6

1.000000 0 7

MediumLowGDP 1.000000 2.000000

PERPRSKWH

0.000000 1.000000 0.461538

1.000000 0.000000 0.538462

X2 = 1.1005291005291005, P = 0.29415001818473019, d.f = 1

Proportionally, the higher the per capita income the more electricity consumed(54% vs. 0.0%)

highGDP_lowGDP = {3: 3, 1: 1}
sub_sahara_africa2['HighLowGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(highGDP_lowGDP )

print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['HighLowGDP'].head(10))
# contingency table of observed counts
HighGDPLowGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['HighLowGDP'])
print(HighGDPLowGDP)

# column percentages
colsum = HighGDPLowGDP.sum(axis=0)
print(colsum)
colpct = HighGDPLowGDP/colsum
print(colpct)

print('chi-square value, p value, expected counts')
high_lowGDP_cont = scipy.stats.chi2_contingency(HighGDPLowGDP)
print(high_lowGDP_cont)

HighLowGDP 1.000000 3.000000

PERPRSKWH

0.000000 3 2

1.000000 0 4

HighLowGDP 1.000000 3.000000
PERPRSKWH

0.000000 1.000000 0.333333

1.000000 0.000000 0.666667

(chi-square=1.40625,p-value = 0.23567991342903749,d.f = 1)

Proportionally, the higher the per capita income the more electricity consumed(67% vs. 0.0%)

highGDP_mediumGDP = {3: 3, 2: 2}
sub_sahara_africa2['HighMediumGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(highGDP_mediumGDP )

print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['HighMediumGDP'].head(10))
# contingency table of observed counts
HighGDPMediumGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['HighMediumGDP'])
print(HighGDPMediumGDP)

# column percentages
colsum = HighGDPMediumGDP.sum(axis=0)
print(colsum)
colpct = HighGDPMediumGDP/colsum
print(colpct)

print('chi-square value, p value, expected counts')
high_mediumGDP_cont = scipy.stats.chi2_contingency(HighGDPMediumGDP)
print(high_mediumGDP_cont)

HighMediumGDP 2.000000 3.000000
PERPRSKWH

0.000000 6 2

1.000000 7 4

HighMediumGDP 2.000000 3.000000

PERPRSKWH

0.000000 0.461538 0.333333

1.000000 0.538462 0.666667

chi-square = 0.00069201631701630963,

p-value = 0.97901310733501912,

d.f = 1

Proportionally, the higher the per capita income the more electricity consumed(67% vs. 54.0%)

Conclusion

Based on the pairwise analysis, it appears per income capita has no impact on proportion of per person electricity consumption in Sub Sahara Africa, since we accept all pairwise comparison between income and per person electricity consumption at adjusted p-value of 0.016666. Thus, per person electricity consumption is independent of per capita income.

Wednesday, December 9, 2015

Chi Square Test of Independence

Introduction

Code

PERCAPGDP 1.000000 2.000000 3.000000

PERPRSKWH

0.000000 3 6 2

1.000000 0 7 4

PERCAPGDP 1.000000 2.000000 3.000000

PERPRSKWH

0.000000 1.000000 0.461538 0.333333

1.000000 0.000000 0.538462 0.666667

Chi Square Result

Bonferroni Adjustments

MediumLowGDP 1.000000 2.000000

PERPRSKWH

0.000000 3 6

1.000000 0 7

MediumLowGDP 1.000000 2.000000

PERPRSKWH

0.000000 1.000000 0.461538

1.000000 0.000000 0.538462

X2 = 1.1005291005291005, P = 0.29415001818473019, d.f = 1

HighLowGDP 1.000000 3.000000

PERPRSKWH

0.000000 3 2

1.000000 0 4

HighLowGDP 1.000000 3.000000PERPRSKWH

0.000000 1.000000 0.333333

1.000000 0.000000 0.666667

(chi-square=1.40625,p-value = 0.23567991342903749,d.f = 1)

HighMediumGDP 2.000000 3.000000PERPRSKWH

0.000000 6 2

1.000000 7 4

HighMediumGDP 2.000000 3.000000

PERPRSKWH

0.000000 0.461538 0.333333

1.000000 0.538462 0.666667

chi-square = 0.00069201631701630963,

p-value = 0.97901310733501912,

d.f = 1

Conclusion

No comments:

Post a Comment

HighLowGDP 1.000000 3.000000
PERPRSKWH

HighMediumGDP 2.000000 3.000000
PERPRSKWH