Data Analysis and Interpretation: 2015

Saturday, December 26, 2015

Testing a potential Moderator

Introduction

In this assignment, I will attempt to identify the impact of population density on relationship between "incomeperperson" and "relectricperperson" based on the gapminder data. My earlier analysis revealed a strong relationship between per capita income and per person electric consumption globally.

Variables

Explanatory: "relectricperperson"
Response: "incomeperperson"
Moderator: "urbanrate"

Data Analysis

import pandas
import seaborn
import scipy
import os
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# define method to load data of interest

def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data

# loading data
data = load_data('data', 'gapminder.csv')

# Extracting data pertinent to variables of interest
data = data[['incomeperperson', 'relectricperperson', 'urbanrate']]

# setting variables of interest to numeric
data['incomeperperson'] = \
pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['relectricperperson'] = \
pandas.to_numeric(data['relectricperperson'], errors='coerce')
data['urbanrate'] = \
pandas.to_numeric(data['urbanrate'], errors='coerce')
print(data.head(20))

data = data.dropna()

# Splitting per capita income in 3 tranches
def percapitagdp(row):
if row['incomeperperson'] <= data['incomeperperson'].quantile(.25):
return 0
if row['incomeperperson'] > data['incomeperperson'].quantile(.25) and\
row['incomeperperson'] <= data['incomeperperson'].quantile(.75):
return 1
if row['incomeperperson'] > data['incomeperperson'].quantile(.75):
return 2

data['percapitagdp'] = data.apply(lambda row: percapitagdp(row), axis=1)

# Creating dataframe for each income group: 0 : Low Income, 1 : Middle Income
# 2: High Income
low_gdp = data[(data['percapitagdp'] == 0)]
print(low_gdp)

medium_gdp = data[(data['percapitagdp'] == 1)]
print(medium_gdp)

high_gdp = data[(data['percapitagdp'] == 2)]

print(high_gdp)

# Splitting household electiricity consumption 2 tranches - 0: low per person
# electricity consumption, 1: high per person electricity consumption
def percapitakwh(row):
if row['relectricperperson'] <= data['relectricperperson'].quantile(.5):
return 0
if row['relectricperperson'] > data['relectricperperson'].quantile(.5):
return 1

data['percapitakWh'] = data.apply(lambda row: percapitakwh(row), axis=1)

# Count items per income group
electricgrp = data['percapitakWh'].value_counts(sort=False, dropna=False)
print(electricgrp)

# Creating dataframe for each per person electric consumption group
low_kWh = data[(data['percapitakWh'] == 0)]
print(low_kWh)

high_kWh = data[(data['percapitakWh'] == 1)]

print(high_kWh)

# Splitting urbanrate into 2 tranches - 0: low population density area
# 1: high population density area

def areatype(row):
if row['urbanrate'] <= data['urbanrate'].quantile(.5):
return 0
if row['urbanrate'] > data['urbanrate'].quantile(.5):
return 1

data['areatype'] = data.apply(lambda row: areatype(row), axis=1)

areagrp = data['areatype'].value_counts(sort=False, dropna=True)

print(areagrp)

Chi Square Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"

# contingency table of observed counts
gdp_kWh = pandas.crosstab(data['percapitagdp'], data['percapitakWh'])
print(gdp_kWh)

# column percentages
colsum = gdp_kWh.sum(axis=0)
colpct = gdp_kWh/colsum
print(colpct)

# chi-square
print('chi-square value, p value, expected counts')
cs_gdpkWh = scipy.stats.chi2_contingency(gdp_kWh)

print(cs_gdpkWh)

Test Result

chi-square value = 62.183712121212125

p value = 3.1403530858162606e-14

From the Chi Square test, we can say that there is a statistically significant association between "incomeperperson" explanatory variable and "relectricperperson" response variable as evidenced by p-value < 0.05

# set variable types
data['percapitakWh'] = data['percapitakWh'].astype('category')
data['percapitagdp'] =\
pandas.to_numeric(data['percapitagdp'], errors='coerce')

# bivariate bar graph
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=data, kind="bar",
ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')

# Creating dataframe for low and high population density groups
low_pop = data[(data['areatype'] == 0)]
high_pop = data[(data['areatype'] == 1)]

print('association between per capita income and per person electric\
consumption for residents in low population density areas')
# contingency table of observed counts
lowpop_gdpkWh = pandas.crosstab(low_pop['percapitagdp'],
low_pop['percapitakWh'])
print(lowpop_gdpkWh)

# column percentages
colsum = lowpop_gdpkWh.sum(axis=0)
colpct = lowpop_gdpkWh/colsum

print(colpct)

# chi-square
print('chi-square value, p value, expected counts')
cs_lowpop_gdpkWh = scipy.stats.chi2_contingency(lowpop_gdpkWh)

print(cs_lowpop_gdpkWh)

Result I - impact of Low population density as a moderator on income and electricity consumption

chi-square value = 28.577586206896555

p value = 6.2295403346060193e-07

The result is statistically significant as evidenced by p-value < 0.05. That is, low population density has a moderating influence on the relationship between "incomeperperson" and "relectricperperson"

print('association between per capita income and per person electric\
consumption for residents in high population density areas')
# contingency table of observed counts
ct_highpop_gdpkWh = pandas.crosstab(high_pop['percapitagdp'],
high_pop['percapitakWh'])
print(ct_highpop_gdpkWh)

# column percentages
colsum = ct_highpop_gdpkWh.sum(axis=0)
colpct = ct_highpop_gdpkWh/colsum
print(colpct)

# chi-square
print('chi-square value, p value, expected counts')
cs_highpop_gdpkWh = scipy.stats.chi2_contingency(ct_highpop_gdpkWh)
print(cs_highpop_gdpkWh)

Result II - High population density as a moderator on income and electricity consumption

chi-square value = 17.439163165266105

p value = 0.0001633555274138804

The result is statistically significant as evidenced by p-value < 0.05. That is, high population density has a moderating influence on the relationship between "incomeperperson" and "relectricperperson"

# Line Chart
seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=low_pop,
kind="point", ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
plt.title('association between income and electricity consumption for \
residents of LOW population density areas.')

seaborn.factorplot(x="percapitakWh", y="percapitagdp", data=high_pop,
kind="point", ci=None)
plt.xlabel('Per Person Electric Consumption')
plt.ylabel('Proportion of Income Spent')
plt.title('association between income and electricity consumption for \

residents of HIGH population density areas.')

Both graphs show positive trend, that is, high per person electricity consumption is associated with higher income. However, people living in high population areas tend relatively to spend more of their income on electricity compare to those in low population density areas.

ANOVA Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"

# using ols function for calculating the F-statistic and associated p value
gdpkWh_model = smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=data)
gdpkWh_res = gdpkWh_model.fit()

print(gdpkWh_res.summary())

print("standard deviation for mean Per person electric consumption by Low vs.\
Medium vs. High income levels")
st1 = gdpkWh.groupby('percapitagdp').std()

print(st1)

ANOVA Test Result

When examining the relationship between per person electricity consumption across various levels of income, ANOVA revealed that globally,

Low Income(Mean=144.655970,s.d=153.330045 ) consumes less electricity relative to both Medium Income(Mean=729.941778, s.d=568.913501) and High Income(2964.549001, s.d=2162.289249) levels.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2, determined by one-way ANOVA F(2,130) = 57.01, p = 2.15e-18.

# Bar plot
seaborn.factorplot(x='percapitagdp', y='relectricperperson', data=data,
kind='bar', ci=None)
plt.xlabel('Levels Income Per Person')

plt.ylabel('Mean Per Person Electric Consumption')

print('association between per person electricity consumption and per capita \
income for those living in low population density areas')
gdpkWh_lowpop_model =\
smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=low_pop)
print('association between per person electricity consumption and per capita \
income for those living in low population density areas')
gdpkWh_lowpop_model =\
smf.ols(formula='relectricperperson ~ C(percapitagdp)', data=low_pop)
gdpkWh_lowpop_res = gdpkWh_lowpop_model.fit()
print(gdpkWh_lowpop_res.summary())

print("means for Per Person Electric Consumption by per capita income \
0 vs. 1 vs. 2 for low population density")
mean_lowpop = low_pop.groupby('percapitagdp').mean()
print(mean_lowpop)

print("standard deviation for mean Per person electric consumption by Low vs.\
Medium vs. High income levels for Low population density areas")
std_lowpop = low_pop.groupby('percapitagdp').std()

print(std_lowpop)

Result I - impact of Low population density as a moderator on income and electricity electricity consumption

When examining the interaction between low urban rate, per person electricity consumption and income levels, ANOVA revealed that globally, Low Income(Mean=132.260069,s.d=122.047634 ) consumes less electricity relative to both Medium Income(Mean = 606.876711, s.d = 481.566696) and High Income(2124.808392, s.d =1105.691412) levels. That is, urbanization has a moderating impact on electricity consumption as it relates to income levels.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2 and LOW population density areas, determined by one-way ANOVA F(2,65) = 46.44, p = 4.72e-13.

Result II - impact of High population density as a moderator on income and electricity electricity consumption

When examining the interaction between high urban rate, per person electricity consumption and income levels, ANOVA revealed that globally,

Low Income(Mean=336.792431,s.d=476.29642 ) consumes less electricity relative to both Medium Income(Mean = 831.909977, s.d = 620.584218) and High Income(3114.502681, s.d =2281.732505) levels. That is, "urbanrate" can be considered as a "lurking" factor in explaining the relationship between electricity consumption and income levels.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across income levels- 0, 1 and 2 and HIGH population density areas, determined by one-way ANOVA F(2,65) = 17.22, p = 1.13e-06.

Pearson Correlation Coefficient Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"

print('association between incomeperperson and relectricperperson for LOW \
population density countries')
(r, p) =\
scipy.stats.pearsonr(low_pop['incomeperperson'],
low_pop['relectricperperson'])

print('(r = %f, p-value = %s)' % (r, p))
print('r2 = %f' % r**2)

scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
fit_reg=False, data=low_pop)
plt.xlabel('Income Per Person')
plt.ylabel('Per Person Electricity Consumption')
plt.title('Scatterplot for the association between Electric Consumption and \
Per Capita Income for LOW population density countries')

print(scat1)

Result I - impact of Low population density as a moderator on income and electricity electricity consumption

Correlation Coefficient

(r = 0.873841, p-value = 2.12383414237e-21)

r2 = 0.763597

The generated correlation coefficient is positively correlated and is very strong with a value of 0.87 and is also statistically significant(p-value < 0.05). This goes to show that, it is highly unlikely that the relationship of this magnitude between income and electricity in the presence of low urbanrate globally is due to chance alone.

With an r-squared = 0.76, this suggests that, if we know the income per person, we can predict 76% of the variability we observe in per person electricity consumption globally in the presence of low urban rate.

print('association between incomeperperson and relectricperperson for HIGH \
population density countries')
(r, p) =\
scipy.stats.pearsonr(high_pop['incomeperperson'],
high_pop['relectricperperson'])
print(' ')
print('(r = %f, p-value = %s)' % (r, p))
print('r2 = %f' % r**2)

scat2 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
fit_reg=False, data=high_pop)
plt.xlabel('Income Per Person')
plt.ylabel('Per Person Electricity Consumption')
plt.title('Scatterplot for the association between Electric Consumption and \
Per Capita Income for High population density countries')
print(scat2)

Result I - impact of High population density as a moderator on income and electricity electricity consumption

Correlation Coefficient

(r = 0.520095, p-value = 8.98036750462e-06)

r2 = 0.270499

The generated correlation coefficient is positively correlated and moderately strong with a value of 0.52 and is also statistically significant(p-value < 0.05). This goes to show that, it is highly unlikely that the relationship of this magnitude between income and electricity in the presence of high urban rate globally is due to chance alone.

With an r-squared = 0.27, this suggests that, if we know the income per person, we can predict 27% of the variability we observe in per person electricity consumption globally in the presence of low urban rate.

Thus, we can hypothesize that, "urbanrate" is a potential moderator on relationship between "relectricperperson" and "incomeperperson"

Monday, December 14, 2015

Generating a Correlation Coefficient

Introduction

In this week's assignment, the ask is to use provided datasets and generate and interpret correlation coefficient using two quantitative variables. I will use the gapminder dataset and calculate correlation coefficient between "incomeperperson" and "relectricperperson".

Code

import pandas
import numpy
import seaborn
import scipy
import os
import matplotlib.pyplot as plt

# define method to load data of interest

def load_data(data_dir, csv_file):
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pandas.read_csv(DATA_FILE, low_memory=False)
return data

# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# loading data
data = load_data('data', 'gapminder.csv')
print(data)

reg_data = data.copy()

# Extracting data pertinent to variables of interest
reg_data = \
reg_data[['incomeperperson', 'relectricperperson']]

print(reg_data)

# setting variables of interest to numeric
reg_data['incomeperperson'] = \
pandas.to_numeric(reg_data['incomeperperson'], errors='coerce')
reg_data['relectricperperson'] = \
pandas.to_numeric(reg_data['relectricperperson'], errors='coerce')

reg_data['incomeperperson'] =\
reg_data['incomeperperson'].replace(' ', numpy.nan)

reg_data['relectricperperson'] =\
reg_data['relectricperperson'].replace(' ', numpy.nan)

scat1 = seaborn.regplot(x='incomeperperson', y='relectricperperson',
fit_reg=True, data=reg_data)
plt.xlabel('Income Per Person (constant 2008 USD)')
plt.ylabel('Per Person Electric Consumption (kWh')
plt.title('Scatterplot for the Association Between income and\
electricity consumption globally')

Scatter Plot

# Remove all NAs from data
data_clean = reg_data.dropna()

print('association between income and electricity consumption')
(r, p) =\
scipy.stats.pearsonr(data_clean['incomeperperson'],
data_clean['relectricperperson'])

print('(r = %f, p-value = %s)' % (r, p))
print('r2 = %f' % r**2)

Correlation Coefficient

(r = 0.651637, p-value = 4.63071717359e-17)

r2 = 0.424631

Conclusion

The generated correlation coefficient is positively correlated and modestly strong with a value of 0.65 and is also statistically significant(p-value < 0.05). This goes to show that, it is highly unlikely that the relationship of this magnitude between income and electricity use globally is due to chance alone.

With an r-squared = 0.42, this suggests that, if we know the income per person, we can predict 42% of the variability we observe in per person electricity consumption globally.

Wednesday, December 9, 2015

Chi Square Test of Independence

Introduction

In this week's assignment, the ask is to test for independence( no relationship between income and electricity consumption) in Sub Sahara Africa. I will be using the gapminder data.

Ho: The relative proportion of per person electricity consumption is the same regardless of income level in Sub Sahara Africa and they are independent.

Ha: The relative proportion of per person electricity consumption changes with respect to changes in income levels in Sub Sahara Africa and they are dependent.

Code

import pandas
import os
import numpy as np
import seaborn
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import matplotlib.pyplot as plt
import scipy.stats

# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# Reading data into DataFrame
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
print(econ_data.shape)

# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')

# Extract only countries in Sub-Sahara Africa.
# For this, I addeded a new column - continent to gapminder dataframe
sub_sahara_africa = econ_data[econ_data['continent'] == 'Africa']
sub_sahara_africa2 = sub_sahara_africa.copy()
print(sub_sahara_africa2.shape)

# Extracting data pertinent to variables of interest
income_electric_filter = ['incomeperperson', 'relectricperperson']
sub_sahara_africa2 = sub_sahara_africa2[income_electric_filter]
print(sub_sahara_africa2)

# Categorizing numeric data

def perpersonkwh(row):
if row['relectricperperson'] <=\
sub_sahara_africa2['relectricperperson'].quantile(.5):
return 'LowkWh'
if row['relectricperperson'] >\
sub_sahara_africa2['relectricperperson'].quantile(.5):
return 'HighkWh'

sub_sahara_africa2['perpersonkWh'] =\
sub_sahara_africa2.apply(lambda row: perpersonkwh(row), axis=1)

def percapitagdp(row):
if row['incomeperperson'] <=\
sub_sahara_africa2['incomeperperson'].quantile(.25):
return 'LowGDP'
if row['incomeperperson'] >\
sub_sahara_africa2['incomeperperson'].quantile(.25) and \
row['incomeperperson'] <=\
sub_sahara_africa2['incomeperperson'].quantile(.75):
return 'MediumGDP'
if row['incomeperperson'] >\
sub_sahara_africa2['incomeperperson'].quantile(.75):

return 'HighGDP'

sub_sahara_africa2['percapitaGDP'] =\
sub_sahara_africa2.apply(lambda row: percapitagdp(row), axis=1)

print(sub_sahara_africa2['perpersonkWh'].head(10))

print(sub_sahara_africa2['percapitaGDP'].head(10))

# recoding values for perpersonkWh into a new variable, PERPRSKWH
kWh_recode = {'LowkWh': 0, 'HighkWh': 1}
sub_sahara_africa2['PERPRSKWH'] =\
sub_sahara_africa2['perpersonkWh'].map(kWh_recode)

# recoding values for percapitaGDP into a new variable, PERCAPGDP
gdp_recode = {'LowGDP': 1, 'MediumGDP': 2, 'HighGDP': 3}
sub_sahara_africa2['PERCAPGDP'] =\
sub_sahara_africa2['percapitaGDP'].map(gdp_recode)

print(sub_sahara_africa2['PERPRSKWH'].head(10))

print(sub_sahara_africa2['PERCAPGDP'].head(10))

# contingency/cross classification table of observed counts
ssa_cont_tbl = pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['PERCAPGDP'])

print(ssa_cont_tbl)

PERCAPGDP 1.000000 2.000000 3.000000

PERPRSKWH

0.000000 3 6 2

1.000000 0 7 4

# column percentages
colsum = ssa_cont_tbl.sum(axis=0)
print(colsum)
colpct = ssa_cont_tbl/colsum
print(colpct)

PERCAPGDP 1.000000 2.000000 3.000000

PERPRSKWH

0.000000 1.000000 0.461538 0.333333

1.000000 0.000000 0.538462 0.666667

# chi-square
print('chi-square value, p value, expected counts')
ssa_chisquare = scipy.stats.chi2_contingency(ssa_cont_tbl)

print(ssa_chisquare)

# set variable types
print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['PERPRSKWH'].head(10))

sub_sahara_africa2['PERCAPGDP'] =\
sub_sahara_africa2['PERCAPGDP'].astype('category')
sub_sahara_africa2['PERPRSKWH'] =\
pandas.to_numeric(sub_sahara_africa2['PERPRSKWH'], errors='coerce')

print(sub_sahara_africa2['PERCAPGDP'].head(10))

print(sub_sahara_africa2['PERPRSKWH'].head(10))

# Factor plot
seaborn.factorplot(x='PERCAPGDP', y='PERPRSKWH', data=sub_sahara_africa2,
kind='bar', ci=None)
plt.xlabel('Per Capita GDP')

plt.ylabel('Proportion of Power Consumption')

Graphically, there seem to be some relationship between proportion of per person electricity consumption and per capita income. Specifically, the higher the income level, the more electricity consumed(LowGDP(0.0%) < MidiumGDP(54%) < HighGDP(67%)

Chi Square Result

There is no statistically significance as determined by Chi Square Test result at alpha = 0.05:
(X2 = 3.7435897435897436, p = 0.15384727771283108, d.f = 2) and we accept the null hypothesis. That is, the proportion of per person electricity consume is independent of per capita income levels across Sub Sahara Africa.

Bonferroni Adjustments

To minimize the familywise p value, we will now conduct pairwise comparison between per person electricity consumption and per capita income in Sub Sahara Africa. In this case we will need to conduct three such tests and compare the p-values from the tests to our Bonferroni Adjusted p-value 05/3 = 0.016666

mediumGDP_lowGDP = {2: 2, 1: 1}
sub_sahara_africa2['MediumLowGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(mediumGDP_lowGDP )

print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['MediumLowGDP'].head(10))
# contingency table of observed counts
MediumGDPLowGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['MediumLowGDP'])
print(MediumGDPLowGDP)

# column percentages
colsum = MediumGDPLowGDP.sum(axis=0)
print(colsum)
colpct = MediumGDPLowGDP/colsum
print(colpct)

print('chi-square value, p value, expected counts')
medium_lowGDP_cont = scipy.stats.chi2_contingency(MediumGDPLowGDP)
print(medium_lowGDP_cont)

MediumLowGDP 1.000000 2.000000

PERPRSKWH

0.000000 3 6

1.000000 0 7

MediumLowGDP 1.000000 2.000000

PERPRSKWH

0.000000 1.000000 0.461538

1.000000 0.000000 0.538462

X2 = 1.1005291005291005, P = 0.29415001818473019, d.f = 1

Proportionally, the higher the per capita income the more electricity consumed(54% vs. 0.0%)

highGDP_lowGDP = {3: 3, 1: 1}
sub_sahara_africa2['HighLowGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(highGDP_lowGDP )

print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['HighLowGDP'].head(10))
# contingency table of observed counts
HighGDPLowGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['HighLowGDP'])
print(HighGDPLowGDP)

# column percentages
colsum = HighGDPLowGDP.sum(axis=0)
print(colsum)
colpct = HighGDPLowGDP/colsum
print(colpct)

print('chi-square value, p value, expected counts')
high_lowGDP_cont = scipy.stats.chi2_contingency(HighGDPLowGDP)
print(high_lowGDP_cont)

HighLowGDP 1.000000 3.000000

PERPRSKWH

0.000000 3 2

1.000000 0 4

HighLowGDP 1.000000 3.000000
PERPRSKWH

0.000000 1.000000 0.333333

1.000000 0.000000 0.666667

(chi-square=1.40625,p-value = 0.23567991342903749,d.f = 1)

Proportionally, the higher the per capita income the more electricity consumed(67% vs. 0.0%)

highGDP_mediumGDP = {3: 3, 2: 2}
sub_sahara_africa2['HighMediumGDP'] =\
sub_sahara_africa2['PERCAPGDP'].map(highGDP_mediumGDP )

print(sub_sahara_africa2['PERCAPGDP'].head(10))
print(sub_sahara_africa2['HighMediumGDP'].head(10))
# contingency table of observed counts
HighGDPMediumGDP =\
pandas.crosstab(sub_sahara_africa2['PERPRSKWH'],
sub_sahara_africa2['HighMediumGDP'])
print(HighGDPMediumGDP)

# column percentages
colsum = HighGDPMediumGDP.sum(axis=0)
print(colsum)
colpct = HighGDPMediumGDP/colsum
print(colpct)

print('chi-square value, p value, expected counts')
high_mediumGDP_cont = scipy.stats.chi2_contingency(HighGDPMediumGDP)
print(high_mediumGDP_cont)

HighMediumGDP 2.000000 3.000000
PERPRSKWH

0.000000 6 2

1.000000 7 4

HighMediumGDP 2.000000 3.000000

PERPRSKWH

0.000000 0.461538 0.333333

1.000000 0.538462 0.666667

chi-square = 0.00069201631701630963,

p-value = 0.97901310733501912,

d.f = 1

Proportionally, the higher the per capita income the more electricity consumed(67% vs. 54.0%)

Conclusion

Based on the pairwise analysis, it appears per income capita has no impact on proportion of per person electricity consumption in Sub Sahara Africa, since we accept all pairwise comparison between income and per person electricity consumption at adjusted p-value of 0.016666. Thus, per person electricity consumption is independent of per capita income.

Thursday, December 3, 2015

Running an analysis of variance

Introduction

This marks the continuation of "Data Management and Visualization". I will continue to use the gapminder dataset to answer my hypothesis - Is per per person electricity consumption a proxy for economic development as measured by per capita income.

For this particular assignment the goal is to test this hypothesis;

H0: there is no difference in mean per person electricity consumption across categories of consumers

Ha: there are differences

Code

import pandas
import os
import numpy as np
import seaborn
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import matplotlib.pyplot as plt

# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# Reading data into DataFrame
if __name__ == "__main__":
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
print(econ_data.shape)

# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')

# Extract only countries in Sub-Sahara Africa.
# For this, I addeded a new column - continent to gapminder dataframe
sub_sahara_africa = econ_data[econ_data['continent'] == 'Africa']

# Extracting data pertinent to variables of interest
sub_sahara_africa = \
sub_sahara_africa[['incomeperperson', 'relectricperperson',
'urbanrate']]

print(sub_sahara_africa)

print('Split relectricperperson data into 3 categories \
- LowkWh, MediumkWh, HighkWh')
sub_sahara_africa['perpersonkWh'] =\
pandas.qcut(sub_sahara_africa.relectricperperson, 3,
labels=["LowkWh", "MediumkWh", "HighkWh"])

# Box plot
seaborn.boxplot(x='perpersonkWh', y='relectricperperson',
hue='perpersonkWh',
data=sub_sahara_africa, saturation=1, orient='v')

# using ols function for calculating the F-statistic and associated p value
# To handle this issue: PatsyError: Error evaluating factor: TypeError:
# 'ClassRegistry' object is not callable incomeperperson ~ C(perpersonkWh), run
# del c on the command line.

# Getting 'typeerror data type not understood' so I convert the categorical
# value to type object prior to running ols
sub_sahara_africa['perpersonkWh_fixed'] =\
sub_sahara_africa.perpersonkWh.astype(np.object)

ssa_model =\
smf.ols(formula='incomeperperson ~ C(perpersonkWh_fixed)',
data=sub_sahara_africa).fit()

print(ssa_model.summary())

ssa_dropna =\
sub_sahara_africa[['incomeperperson', 'perpersonkWh']].dropna()
print(ssa_dropna['perpersonkWh'])
print(ssa_dropna['incomeperperson'])

print('means for incomeperperson by per person kWh status')
ssa_kWh_mean = ssa_dropna.groupby('perpersonkWh').mean()
print(ssa_kWh_mean)

# To plot line chart of categorical mean values
plt.plot(ssa_kWh_mean, 'bD')

print('std for incomeperperson by per person kWh status')
ssa_kWh_std = ssa_dropna.groupby('perpersonkWh').std()
print(ssa_kWh_std)

ssa_mc = multi.MultiComparison(ssa_dropna['incomeperperson'],
ssa_dropna['perpersonkWh'])
ssa_res2 = ssa_mc.tukeyhsd()

print(ssa_res2.summary())

Graph

This boxplot shows that the mean consumption of electricity across per person electricity consumption is not equal across the groups. The line chart shows the mean per person electricity consumption across categories of electric power consumers.

ANOVA Results

When examining the relationship between per person electricity consumption across groups of interest, ANOVA revealed that among countries in Sub Sahara Africa, LowkWh(Mean=601.949389,s.d=846.373070 ) groups consumes less electricity compared to both MediumkWh(Mean=639.068846, s.d=299.583651) and HighkWh(Mean=2086.976328, s.d=1866.259027) groups of electricity consumers.

There is statistically significant differences between means of per person electricity consumption(measured in kilowatt per hour (kWh) across groups - LowkWh, MediumkWh and HighkWh, determined by one-way ANOVA F(2,19) = 3.694,
p = 0.0441.

In view of the above result, I will perform Post Hoc test using Tukey Highly Significance Difference test.

A Tukey Post hoc test(see table below) revealed that there are no statistically significance differences in per person electricity consumption in Sub Sahara Africa between groups of interest. Thus, pairs of groups consume the same quantity of electricity on average.

Multiple Comparison of Means - Tukey HSD,FWER=0.05
========================================================
group1 group2 meandiff lower upper reject
--------------------------------------------------------------------------------------------------
HighkWh LowkWh -1485.0269 -3035.9813 65.9274 False
HighkWh MediumkWh -1447.9075 -3049.7262 153.9113 False
LowkWh MediumkWh 37.1195 -1513.8349 1588.0738 False
--------------------------------------------------------------------------------------------------

Wednesday, November 18, 2015

Week 4: Creating Graphs for Your Data

Introduction

In this week assignment, the goal is create various chart types for your data of choice. In my case, I working with the gapminder datasets. My previous analysis focus on countries in Sub Sahara Africa region. However, I have decided to use the whole dataset without regards to regional differences for the simple reason that I need a large datasets for this analysis.

Code

# -*- coding: utf-8 -*-
"""
Created on Wed Nov 18 16:02:17 2015

"""

import pandas
import os
import seaborn
import matplotlib.pyplot as plt

DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)

# Extracting data pertinent to variables of interest
econ_data = \
econ_data[['country', 'incomeperperson', 'relectricperperson', 'urbanrate']]

# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)

# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')

# Univariate histrogram for quantitative variable - incomeperperson
seaborn.distplot(econ_data['incomeperperson'].dropna(), kde=False)
plt.xlabel('Per Capita Income')
plt.ylabel('Frequency')
plt.title('Global Income distribution')

# Univariate histrogram for quantitative variable - relectricperperson
seaborn.distplot(econ_data['relectricperperson'].dropna(), kde=False)
plt.xlabel('Per Capita kWh')
plt.ylabel('Frequencey')
plt.title('Global Household Electricity Consumption')

# Univariate histrogram for quantitative variable - urbanrate
seaborn.distplot(econ_data['urbanrate'].dropna(), kde=False)
plt.xlabel('Urban Rate')
plt.ylabel('Frequency')
plt.title('Global Urban Rate distribution')

# Splitting per capita income in quartiles
def percapitagdp(row):
if row['incomeperperson'] <= econ_data['incomeperperson'].quantile(.25):
return 'LowGDP'
if row['incomeperperson'] > econ_data['incomeperperson'].quantile(.25) and \
row['incomeperperson'] <= econ_data['incomeperperson'].quantile(.75):
return 'MediumGDP'
if row['incomeperperson'] > econ_data['incomeperperson'].quantile(.75):
return 'HighGDP'

econ_data['percapitagdp'] =\
econ_data.apply(lambda row: percapitagdp(row), axis=1)

dropnas = econ_data['percapitagdp'].value_counts(sort=True, dropna=True)
print(dropnas)

# bivariate bar chart for Household Electricity Consuption and Income Levels C-> Q
seaborn.factorplot(x='percapitagdp', y='relectricperperson',
data=econ_data, kind='bar', ci=None)
plt.xlabel('Income Group')
plt.ylabel('Mean kWh Rate')

# basic scatterplot for incomeperperson vs relectricperperson Q-> Q
seaborn.regplot(x='incomeperperson', y='relectricperperson',
data=econ_data, fit_reg=False)
plt.xlabel('Income Per Capita')
plt.ylabel('Household Electricity Consumption')
plt.title('Scatterplot to show Association between Income Per Capita and \
Electricity Per Capita')

def areatype(row):
if row['urbanrate'] <= econ_data['urbanrate'].quantile(.25):
return 'Village'
if row['urbanrate'] > econ_data['urbanrate'].quantile(.25) and \
row['urbanrate'] <= econ_data['urbanrate'].quantile(.75):
return 'Town'
if row['urbanrate'] > econ_data['urbanrate'].quantile(.75):
return 'City'

econ_data['areatype'] =\
econ_data.apply(lambda row: areatype(row), axis=1)

dropnas = econ_data['areatype'].value_counts(sort=False, dropna=True)
print(dropnas)

# bivariate bar chart for areatype(urbanrate) and relectricperperson C-> Q
seaborn.factorplot(x='areatype', y='relectricperperson',
data=econ_data, kind='bar', ci=None)
plt.ylabel('Household Electric Consumption')
plt.xlabel('Area Classification')

# basic scatterplot for urbanrate vs relectricperperson Q-> Q
seaborn.regplot(x='urbanrate', y='relectricperperson',
data=econ_data, fit_reg=False)
plt.ylabel('Household Electric Consumption')
plt.xlabel('Urban Rate')
plt.title('Scatterplot to show Association between Urban Rate and \

Electricity Per Capita')

Univariate Graph for Per Capita Income

This graph is unimodal and positively skewed with large number of small income earners and small number of large income earners. Thus, potential evidence of pronounced income disparity globally.

Univariate Graph for Per Household Electricity Use

This graph is unimodal with the highest peak at 500 kWh level. Also the chart is positively skewed which is evidenced by this fact the measures of central tendency are in this in this order: mode < median < mean

From the charts showing income and household electricity consumption, one can surmise that, there seems to be some relationship between household electricity consumption and income levels. The extent of the relation will be further explored.

Univariate Graph for Urban Rate

This graph appears to be bi-modal as evidenced by two sets of peaks at 40 and 70 urban rate levels.

This points to a fact that, globally, one is likely to see small urban and large urban centers across countries.

Bivariate Graph - relectricperperson vs. incomeperperson(Q ->C)

From this chart, evidence exist that high income households do consume more electricity per capita.

Bivariate Graph - relectricperperson vs. incomeperperson(Q ->Q)

This graph shows a positive relationship between income and household electricity consumption.

Bivariate Graph - urbanrate vs. relectricperperson(Q ->C)

From this graph, there is evidence of positive relationship between urban rate and household electricity consumption. Thus, higher urban rates correlate to higher household electricity consumption.

Bivariate Graph - urbanrate vs. relectricperperson(Q ->Q)

From this graph, one can decipher a positive relationship between urbanrate and relectricperperson.

However, the extent of the relationship may be a weak one.

Thursday, November 12, 2015

Week 3: Making Data Management Decisions

In this week's assignment, I chose to bin my variables of interest - "relectricperperson", "incomeperperson" and "urbanrate" into categories. The basis of this split is percentile.

Code

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 12 12:59:21 2015
Week 3 submission
"""

import pandas
import os

DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)

pandas.set_option('display.float_format', lambda x: '%f' % x)

# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')

# Extracting data pertinent to variables of interest
relevant_data = \
econ_data[['country', 'continent', 'incomeperperson', 'relectricperperson',
'urbanrate']]

# Extract only countries in Sub-Sahara Africa.
# For this addeded a new column - continent to gapminder dataframe
sub_sahara_africa = relevant_data[relevant_data['continent'] == 'Africa']
# print(sub_sahara_africa)

sub_sahara_africa.describe()

# Create a second copy of sub_sahara_africa for further data analysis
sub2_sahara_africa = sub_sahara_africa.copy()
# print(sub2_sahara_africa)

'''
Split income data into 3 strata - LowGDP, MediumGDP, HighGDP
Using percentile values to bin dataset
LowGDP: <= 25 percentile of incomeperperson
MiddleGDP: between 25 percentile and 75 percentile of incomeperperson
HighGDP: greater than 75 percentile of incomeperperson

incomeperperson
count 43.000000
mean 1098.598153
std 1705.541951
min 103.775857
25% 272.888584
50% 411.501447
75% 804.478821
max 8654.536845

'''

def percapitagdp(row):
if row['incomeperperson'] <= 272.888584:
return 'LowGDP'
if row['incomeperperson'] > 272.888584 and \
row['incomeperperson'] <= 804.478821:
return 'MediumGDP'
if row['incomeperperson'] > 804.478821:
return 'HighGDP'

sub2_sahara_africa['percapitagdp'] =\
sub2_sahara_africa.apply(lambda row: percapitagdp(row), axis=1)

# Calculate descriptive statistics for incomeperperson grouped by percapitagdp
per_capitagdp_stats =\
sub2_sahara_africa.groupby('percapitagdp').agg({'incomeperperson':
['count', 'mean',
'std', 'max']})

print(per_capitagdp_stats)

'''
Split relectricperperson data into 3 categories
- LowkWh, MediumkWh, HighkWh
LowkWh: <= 25 percentile of relectricperperson
MiddlekWh: between 25 percentile and 75 percentile of incomeperperson
HighkWh: greater than 75 percentile of incomeperperson

relectricperperson
count 22.000000
mean 149.889484
std 222.403251
min 0.000000
25% 38.325833
50% 57.961848
75% 150.778896
max 920.137600

'''

def percapitakWh(row):
if row['relectricperperson'] <= 38.325833:
return 'LowkWh'
if row['relectricperperson'] > 38.325833 and \
row['relectricperperson'] <= 150.778896:
return 'MediumkWh'
if row['relectricperperson'] > 150.778896:
return 'HighkWh'

sub2_sahara_africa['percapitakWh'] =\
sub2_sahara_africa.apply(lambda row: percapitakWh(row), axis=1)

# Calculate descriptive statistics for relectricperperson grouped by percapitakWh
per_capitakWh_stats =\
sub2_sahara_africa.groupby('percapitakWh').agg({'relectricperperson':
['count', 'mean',
'std', 'max']})
print(per_capitakWh_stats)

'''
Split urbanrate data into 3 categories - rural, town, urban
rural: >= 25 percentile of urbanrate
town: between 25 percentile and 75 percentile of urbanrate
urban: greater than 75 percentile of urbanrate

urbanrate
count 44.000000
mean 39.774091
std 17.063891
min 12.980000
25% 26.390000
50% 37.550000
75% 49.090000
max 87.300000

'''

def areatype(row):
if row['urbanrate'] <= 26.390000:
return 'rural'
if row['urbanrate'] > 26.390000 and \
row['urbanrate'] <= 49.090000:
return 'town'
if row['urbanrate'] > 49.090000:
return 'urban'
sub2_sahara_africa['areatype'] =\
sub2_sahara_africa.apply(lambda row: areatype(row), axis=1)

# Calculate descriptive statistics for areatype across countries
per_capita_areatype_stats =\
sub2_sahara_africa.groupby('areatype').agg({'urbanrate':
['count', 'mean',
'std', 'max']})
print(per_capita_areatype_stats)

Output

incomeperperson

count mean std max

percapitagdp

HighGDP 11 3265.406288 2281.132925 8654.536845

LowGDP 11 196.132710 55.934473 269.892881

MediumGDP 21 436.323409 120.507293 713.639303

relectricperperson

count mean std max

percapitakWh

HighkWh 6 425.257250 284.256336 920.137600

LowkWh 6 22.610565 13.856819 38.222943

MediumkWh 10 61.036175 17.145977 97.246492

urbanrate

count mean std max

areatype

rural 11 20.132727 4.027548 25.520000

town 22 37.951818 5.654132 48.780000

urban 11 63.060000 11.856559 87.300000

Summary

Per Capita GDP

I split the income levels into the following categories - HighGDP, MediumGDP and LowGDP.

The top income earners across Sub Sahara Africa countries take home about 32 times that of low income earners. Median Income earners earns approximately 3 times that of the lowest income earners.

Per Capita Electricity Consumption

I also split electricity consumption per capita into the following categories - HighkWh, LowkWh and MediumkWh.

Similarly, the picture for per capita electric consumption is no different. The top consumers of electricity per capital use 24 times more vis-a-vis low per capita electricity consumers. The medium rung of electricity consumers triple that of lowest consumers.

Urbanization

Broadly, I split population centers across 3 groups, viz, rural, town and urban.

On average more people in Sub Saharan Africa live in urban centers. Approximately, two-thirds of the population live in urban centers.

Sunday, November 8, 2015

Week 2: Preliminary Analysis and Research Question Refinement.

Preliminary data analysis and refinement of research question to focus on Sub-Sahara Africa

After I carefully reviewed the GapMinder dataset, I have decided to focus my research on countries in Sub-Saharan Africa instead. To accomplished this, I added a new column - 'continent' to gapminder.csv and created a new dataframe - 'sub-sahara-africa' by sub-setting the data to only include countries that are in aforementioned region.

Python Code

import pandas
import os

print(os.getcwd())
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)

pandas.set_option('display.float_format', lambda x: '%f' % x)

# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')

print('***** Extrating Gapminder Data for variables of interest*****')
relevant_data = \
econ_data[['country', 'continent', 'incomeperperson', 'relectricperperson',
'urbanrate']]

# making a copy of original dataframe
econ2_data = econ_data.copy()

# Extract only countries in Sub-Sahara Africa.
# For this I added a new column - continent to gapminder dataframe
sub_sahara_africa = relevant_data[relevant_data['continent'] == 'Africa']
print(sub_sahara_africa)

print('***** Per Capita Electricity Consumption and relative frequencies*****')
urban_rate = \
sub_sahara_africa['urbanrate'].value_counts(sort=False,
dropna=False,
normalize=False)
print(urban_rate)

rfreq_urban_rate = \
sub_sahara_africa['urbanrate'].value_counts(sort=False, dropna=False,
normalize=True)
print(rfreq_urban_rate)

per_capita_electric = \
sub_sahara_africa['relectricperperson'].value_counts(sort=False,
dropna=False,
normalize=False)
print(per_capita_electric)

rfreq_per_capita_electric = \
sub_sahara_africa['relectricperperson'].value_counts(sort=False,
dropna=False,
normalize=True)
print(rfreq_per_capita_electric)

per_capita_gdp = \
sub_sahara_africa['incomeperperson'].value_counts(sort=False,
dropna=False,
normalize=False)
print(per_capita_gdp)

rfreq_per_capita_gdp = \
sub_sahara_africa['incomeperperson'].value_counts(sort=False,
dropna=False,
normalize=True)
print(rfreq_per_capita_gdp)

# Create a second copy of sub_sahara_africa for further data analysis
sub2_sahara_africa = sub_sahara_africa.copy()

# Create a second copy of sub_sahara_africa for further data analysis
sub2_sahara_africa = sub_sahara_africa.copy()

print('Split income data into 3 strata - LowGDP, MediumGDP, HighGDP')
sub2_sahara_africa['percapitagdp'] =\
pandas.qcut(sub2_sahara_africa.incomeperperson, 3,
labels=["LowGDP", "MediumGDP", "HighGDP"])
print(sub2_sahara_africa)

# Frequencey description of income strata
per_capita_gdp_bin = \
sub2_sahara_africa['percapitagdp'].value_counts(sort=False,
dropna=True)
print(per_capita_gdp_bin)

rfreq_per_capita_gdp_bin = \
sub2_sahara_africa['percapitagdp'].value_counts(sort=False,
dropna=True,
normalize=True)

print(rfreq_per_capita_gdp_bin)

print('Split relectricperperson data into 3 categories \
- LowkWh, MediumkWh, HighkWh')
sub2_sahara_africa['percapitakWh'] =\
pandas.qcut(sub2_sahara_africa.relectricperperson, 3,
labels=["LowkWh", "MediumkWh", "HighkWh"])
print(sub2_sahara_africa)
per_kWh_use = \
sub2_sahara_africa['percapitakWh'].value_counts(sort=False,
dropna=True)
print(per_kWh_use)

rfreq_per_kWh_use = \
sub2_sahara_africa['percapitakWh'].value_counts(sort=False,
dropna=True,
normalize=True)

print(rfreq_per_kWh_use)

print('Split urbanrate data into 3 categories - rural, town, urban')
sub2_sahara_africa['areatype'] =\
pandas.qcut(sub2_sahara_africa.urbanrate, 3,
labels=["rural", "town", "urban"])
print(sub2_sahara_africa)
areatype_count = \
sub2_sahara_africa['areatype'].value_counts(sort=False,
dropna=True)
print(areatype_count)

rfreq_areatype_count = \
sub2_sahara_africa['areatype'].value_counts(sort=False,
dropna=True,
normalize=True)

print(rfreq_areatype_count)

#Aggregating data across groups
sub2_sahara_africa.groupby('percapitagdp').agg({'incomeperperson': [numpy.size, numpy.mean]})
sub2_sahara_africa.groupby('percapitakWh').agg({'relectricperperson': [numpy.size, numpy.mean]})

sub2_sahara_africa.groupby('areatype').agg({'urbanrate': [numpy.size, numpy.mean]})

# Aggregating per capita income groups across country
by_gdp = sub2_sahara_africa.groupby(['percapitagdp', 'country'])
by_gdp.incomeperperson.mean()

# Aggregating per capita electric consumption groups across country

by_kWh = sub2_sahara_africa.groupby(['percapitakWh', 'country'])
by_kWh.relectricperperson.mean()

# Aggregating urbanrate groups across country

by_areatype = sub2_sahara_africa.groupby(['areatype', 'country'])
by_areatype.urbanrate.mean()

Dataframe Output

***** Extrating Gapminder Data for variables of interest*****

country incomeperperson relectricperperson urbanrate

4 Angola 1381.004268 172.999227 56.70

19 Benin 377.039699 38.222943 41.20

24 Botswana 4189.436587 454.795705 59.58

28 Burkina Faso 276.200413 NaN 19.56

31 Cameroon 713.639303 59.551245 56.76

33 Cape Verde 1959.844472 NaN 59.62

35 Central African Rep. 239.518749 NaN 38.58

36 Chad 275.884287 NaN 26.68

42 Congo, Rep. 1253.292015 56.372450 61.34

45 Cote d'Ivoire 591.067944 70.387444 48.78

51 Djibouti 895.318340 NaN 87.30

57 Equatorial Guinea 8654.536845 NaN 39.38

58 Eritrea 131.796207 20.288131 20.72

60 Ethiopia 220.891248 15.056236 17.00

66 Gabon 4180.765821 537.104738 85.04

67 Gambia 354.599726 NaN 56.42

70 Ghana 358.979540 97.246492 50.02

78 Guinea 411.501447 NaN 34.44

79 Guinea-Bissau 161.317137 NaN 29.84

97 Kenya 468.696044 41.180003 21.60

106 Lesotho 495.734247 NaN 25.46

107 Liberia 155.033231 NaN 60.14

114 Madagascar 242.677534 NaN 29.52

115 Malawi 184.141797 NaN 18.80

118 Mali 269.892881 NaN 32.18

122 Mauritania 609.131206 NaN 41.00

123 Mauritius 5182.143721 NaN 42.48

131 Mozambique 389.763634 31.386838 36.84

133 Namibia 2667.246710 0.000000 36.84

141 Niger 180.083376 NaN 16.54

142 Nigeria 544.599477 74.064241 48.36

160 Rwanda 338.266391 NaN 18.34

166 Sao Tome and Principe NaN NaN 60.56

168 Senegal 561.708585 55.794744 42.38

171 Seychelles 8614.120219 NaN 54.34

172 Sierra Leone 268.331790 NaN 37.76

177 Somalia NaN NaN 36.52

178 South Africa 3745.649852 920.137600 60.74

181 Sudan 523.950151 50.892101 43.44

183 Swaziland 1810.230533 NaN 24.94

189 Tanzania 456.385712 38.634503 25.52

192 Togo 285.224449 66.238522 42.00

199 Uganda 377.421113 NaN 12.98

211 Zambia 432.226337 168.623031 35.42

212 Zimbabwe 320.771890 297.883200 37.34

# Generate Summary statistics for Sub Sahara Africa
sub_sahara_africa.describe()

Summary Statistics

incomeperperson relectricperperson urbanrate

count 43.000000 21.000000 45.000000

mean 1296.513138 155.564733 40.688889

std 2046.959164 226.257267 17.250389

min 131.796207 0.000000 12.980000

25% 276.042350 38.634503 26.680000

50% 432.226337 59.551245 38.580000

75% 1074.305177 168.623031 54.340000

max 8654.536845 920.137600 87.300000

The hypothesis stipulated earlier was to test if per capita electric consumption(relectricperperson) can be used as a proxy to economic growth(incomeperperson)in Sub-Sahara Africa.

A review of the summary statistics reveals that, countries with higher per capita electric consumption also tend to have higher per capita income and hence better economic outcome.

On average, 4 in 10 of the population in Sub Sahara Africa live in urban centers.
Per Capita Electric consumption across Sub Sahara Africa averages 155.56 kWh in 2008 and income per capita average of $1296.51 in constant 2000 USD.

Frequency and Relative Frequency Distributions

Income Classification Frequency

LowGDP       15

MediumGDP    14

HighGDP      14

Income Classification Relative Frequency

LowGDP      0.340909

MediumGDP   0.318182

HighGDP     0.318182

Per Capita Electricity Consumption Frequency
LowkWh       8

MediumkWh    7

HighkWh      7

Per Capita Electricity Consumption Relative Frequency
LowkWh      0.181818

MediumkWh   0.159091

HighkWh     0.159091

Locality Classification Frequency
rural    15

town     14

urban    15

Per Capita Electricity Consumption Relative Frequency
rural   0.340909

town    0.318182

urban   0.340909

Aggregating Per Capita Income over Income groups

                        incomeperperson            

                        size        mean

percapitagdp                            

LowGDP             15.000000  221.036056

MediumGDP          14.000000  435.062293

HighGDP            14.000000 2702.379115

Aggregating Per Capita Electric Consumption over Electric Consumption bands

                          relectricperperson           

                           size       mean

percapitakWh                              

LowkWh                 8.000000  26.934737

MediumkWh              7.000000  61.900107

HighkWh                7.000000 378.398571

Aggregating Urbanization rate over urban type

                 urbanrate          

              size      mean

areatype                    

rural    15.000000 22.645333

town     14.000000 38.118571

urban    15.000000 58.448000

Aggregating per capita income groups  across country 

percapitagdp  country                  mean

LowGDP        Burkina Faso            276.200413

              Central African Rep.    239.518749

              Chad                    275.884287

              Congo, Dem. Rep.        103.775857

              Eritrea                 131.796207

              Ethiopia                220.891248

              Guinea-Bissau           161.317137

              Liberia                 155.033231

              Madagascar              242.677534

              Malawi                  184.141797

              Mali                    269.892881

              Niger                   180.083376

              Sierra Leone            268.331790

              Togo                    285.224449

              Zimbabwe                320.771890

MediumGDP     Benin                   377.039699

              Gambia                  354.599726

              Ghana                   358.979540

              Guinea                  411.501447

              Kenya                   468.696044

              Lesotho                 495.734247

              Mozambique              389.763634

              Nigeria                 544.599477

              Rwanda                  338.266391

              Senegal                 561.708585

              Sudan                   523.950151

              Tanzania                456.385712

              Uganda                  377.421113

              Zambia                  432.226337

HighGDP       Angola                 1381.004268

              Botswana               4189.436587

              Cameroon                713.639303

              Cape Verde             1959.844472

              Congo, Rep.            1253.292015

              Cote d'Ivoire           591.067944

              Djibouti                895.318340

              Equatorial Guinea      8654.536845

              Gabon                  4180.765821

              Mauritania              609.131206

              Mauritius              5182.143721

              Namibia                2667.246710

              South Africa           3745.649852

              Swaziland              1810.230533

Aggregating per capita electric consumption groups across country

Out[172]:

percapitakWh country mean

LowkWh Benin 38.222943

Congo, Dem. Rep. 30.709244

Eritrea 20.288131

Ethiopia 15.056236

Kenya 41.180003

Mozambique 31.386838

Namibia 0.000000

Tanzania 38.634503

MediumkWh
Cameroon 59.551245

Congo, Rep. 56.372450

Cote d'Ivoire 70.387444

Nigeria 74.064241

Senegal 55.794744

Sudan 50.892101

Togo 66.238522

HighkWh
Angola 172.999227

Botswana 454.795705

Gabon 537.104738

Ghana 97.246492

South Africa 920.137600

Zambia 168.623030

Zimbabwe 297.883200

Aggregating urbanrate groups across country

areatype country mean

rural Burkina Faso 19.560000

Chad 26.680000

Eritrea 20.720000

Ethiopia 17.000000

Guinea-Bissau 29.840000

Kenya 21.600000

Lesotho 25.460000

Madagascar 29.520000

Malawi 18.800000

Mali 32.180000

Niger 16.540000

Rwanda 18.340000

Swaziland 24.940000

Tanzania 25.520000

Uganda 12.980000

town
Benin 41.200000

Central African Rep. 38.580000

Congo, Dem. Rep. 33.960000

Equatorial Guinea 39.380000

Guinea 34.440000

Mauritania 41.000000

Mozambique 36.840000

Namibia 36.840000

Senegal 42.380000

Sierra Leone 37.760000

Somalia 36.520000

Togo 42.000000

Zambia 35.420000

Zimbabwe 37.340000

urban
Angola 56.700000

Botswana 59.580000

Cameroon 56.760000

Cape Verde 59.620000

Congo, Rep. 61.340000

Cote d'Ivoire 48.780000

Djibouti 87.300000

Gabon 85.040000

Gambia 56.420000

Ghana 50.020000

Liberia 60.140000

Mauritius 42.480000

Nigeria 48.360000

South Africa 60.740000

Sudan 43.440000

Summary of Analysis

Upon running and generating myriad of summary statistics, one can deduce the following

Countries in Sub Sahara Africa with large per capita electricity consumption tend to have higher per capita GDP and are more urbanized.

Saturday, December 26, 2015

Introduction

Variables

Data Analysis

Chi Square Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"

Test Result

Result I - impact of Low population density as a moderator on income and electricity consumption

Result II - High population density as a moderator on income and electricity consumption

ANOVA Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"

ANOVA Test Result

Result I - impact of Low population density as a moderator on income and electricity electricity consumption

Result II - impact of High population density as a moderator on income and electricity electricity consumption

Pearson Correlation Coefficient Test: Evaluation of "urbanrate" as a potential moderator on the relationship between "incomeperperson" and "relectricperperson"

Result I - impact of Low population density as a moderator on income and electricity electricity consumption

Correlation Coefficient

Result I - impact of High population density as a moderator on income and electricity electricity consumption

Correlation Coefficient

Monday, December 14, 2015

Introduction

Code

Scatter Plot

Correlation Coefficient

Conclusion

Wednesday, December 9, 2015

Introduction

Code

PERCAPGDP 1.000000 2.000000 3.000000

PERPRSKWH

0.000000 3 6 2

1.000000 0 7 4

PERCAPGDP 1.000000 2.000000 3.000000

PERPRSKWH

0.000000 1.000000 0.461538 0.333333

1.000000 0.000000 0.538462 0.666667

Chi Square Result

Bonferroni Adjustments

MediumLowGDP 1.000000 2.000000

PERPRSKWH

0.000000 3 6

1.000000 0 7

MediumLowGDP 1.000000 2.000000

PERPRSKWH

0.000000 1.000000 0.461538

1.000000 0.000000 0.538462

X2 = 1.1005291005291005, P = 0.29415001818473019, d.f = 1

HighLowGDP 1.000000 3.000000

PERPRSKWH

0.000000 3 2

1.000000 0 4

HighLowGDP 1.000000 3.000000PERPRSKWH

0.000000 1.000000 0.333333

1.000000 0.000000 0.666667

(chi-square=1.40625,p-value = 0.23567991342903749,d.f = 1)

HighMediumGDP 2.000000 3.000000PERPRSKWH

0.000000 6 2

1.000000 7 4

HighMediumGDP 2.000000 3.000000

PERPRSKWH

0.000000 0.461538 0.333333

1.000000 0.538462 0.666667

chi-square = 0.00069201631701630963,

p-value = 0.97901310733501912,

d.f = 1

Conclusion

Thursday, December 3, 2015

Introduction

Code

Graph

ANOVA Results

Wednesday, November 18, 2015

Introduction

Code

Univariate Graph for Per Capita Income

Univariate Graph for Per Household Electricity Use

Univariate Graph for Urban Rate

Bivariate Graph - relectricperperson vs. incomeperperson(Q ->C)

Bivariate Graph - relectricperperson vs. incomeperperson(Q ->Q)

Bivariate Graph - urbanrate vs. relectricperperson(Q ->C)

Bivariate Graph - urbanrate vs. relectricperperson(Q ->Q)

Thursday, November 12, 2015

HighLowGDP 1.000000 3.000000
PERPRSKWH

HighMediumGDP 2.000000 3.000000
PERPRSKWH