Introduction
In this week assignment, the goal is create various chart types for your data of choice. In my case, I working with the gapminder datasets. My previous analysis focus on countries in Sub Sahara Africa region. However, I have decided to use the whole dataset without regards to regional differences for the simple reason that I need a large datasets for this analysis.Code
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 18 16:02:17 2015
"""
import pandas
import os
import seaborn
import matplotlib.pyplot as plt
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
# Extracting data pertinent to variables of interest
econ_data = \
econ_data[['country', 'incomeperperson', 'relectricperperson', 'urbanrate']]
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')
# Univariate histrogram for quantitative variable - incomeperperson
seaborn.distplot(econ_data['incomeperperson'].dropna(), kde=False)
plt.xlabel('Per Capita Income')
plt.ylabel('Frequency')
plt.title('Global Income distribution')
# Univariate histrogram for quantitative variable - relectricperperson
seaborn.distplot(econ_data['relectricperperson'].dropna(), kde=False)
plt.xlabel('Per Capita kWh')
plt.ylabel('Frequencey')
plt.title('Global Household Electricity Consumption')
# Univariate histrogram for quantitative variable - urbanrate
seaborn.distplot(econ_data['urbanrate'].dropna(), kde=False)
plt.xlabel('Urban Rate')
plt.ylabel('Frequency')
plt.title('Global Urban Rate distribution')
# Splitting per capita income in quartiles
def percapitagdp(row):
if row['incomeperperson'] <= econ_data['incomeperperson'].quantile(.25):
return 'LowGDP'
if row['incomeperperson'] > econ_data['incomeperperson'].quantile(.25) and \
row['incomeperperson'] <= econ_data['incomeperperson'].quantile(.75):
return 'MediumGDP'
if row['incomeperperson'] > econ_data['incomeperperson'].quantile(.75):
return 'HighGDP'
econ_data['percapitagdp'] =\
econ_data.apply(lambda row: percapitagdp(row), axis=1)
dropnas = econ_data['percapitagdp'].value_counts(sort=True, dropna=True)
print(dropnas)
# bivariate bar chart for Household Electricity Consuption and Income Levels C-> Q
seaborn.factorplot(x='percapitagdp', y='relectricperperson',
data=econ_data, kind='bar', ci=None)
plt.xlabel('Income Group')
plt.ylabel('Mean kWh Rate')
# basic scatterplot for incomeperperson vs relectricperperson Q-> Q
seaborn.regplot(x='incomeperperson', y='relectricperperson',
data=econ_data, fit_reg=False)
plt.xlabel('Income Per Capita')
plt.ylabel('Household Electricity Consumption')
plt.title('Scatterplot to show Association between Income Per Capita and \
Electricity Per Capita')
def areatype(row):
if row['urbanrate'] <= econ_data['urbanrate'].quantile(.25):
return 'Village'
if row['urbanrate'] > econ_data['urbanrate'].quantile(.25) and \
row['urbanrate'] <= econ_data['urbanrate'].quantile(.75):
return 'Town'
if row['urbanrate'] > econ_data['urbanrate'].quantile(.75):
return 'City'
econ_data['areatype'] =\
econ_data.apply(lambda row: areatype(row), axis=1)
dropnas = econ_data['areatype'].value_counts(sort=False, dropna=True)
print(dropnas)
# bivariate bar chart for areatype(urbanrate) and relectricperperson C-> Q
seaborn.factorplot(x='areatype', y='relectricperperson',
data=econ_data, kind='bar', ci=None)
plt.ylabel('Household Electric Consumption')
plt.xlabel('Area Classification')
# basic scatterplot for urbanrate vs relectricperperson Q-> Q
seaborn.regplot(x='urbanrate', y='relectricperperson',
data=econ_data, fit_reg=False)
plt.ylabel('Household Electric Consumption')
plt.xlabel('Urban Rate')
plt.title('Scatterplot to show Association between Urban Rate and \
Electricity Per Capita')
"""
Created on Wed Nov 18 16:02:17 2015
"""
import pandas
import os
import seaborn
import matplotlib.pyplot as plt
DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)
# Extracting data pertinent to variables of interest
econ_data = \
econ_data[['country', 'incomeperperson', 'relectricperperson', 'urbanrate']]
# bug fix for display format to avoid run time errors
pandas.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_columns', None)
# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')
# Univariate histrogram for quantitative variable - incomeperperson
seaborn.distplot(econ_data['incomeperperson'].dropna(), kde=False)
plt.xlabel('Per Capita Income')
plt.ylabel('Frequency')
plt.title('Global Income distribution')
# Univariate histrogram for quantitative variable - relectricperperson
seaborn.distplot(econ_data['relectricperperson'].dropna(), kde=False)
plt.xlabel('Per Capita kWh')
plt.ylabel('Frequencey')
plt.title('Global Household Electricity Consumption')
# Univariate histrogram for quantitative variable - urbanrate
seaborn.distplot(econ_data['urbanrate'].dropna(), kde=False)
plt.xlabel('Urban Rate')
plt.ylabel('Frequency')
plt.title('Global Urban Rate distribution')
# Splitting per capita income in quartiles
def percapitagdp(row):
if row['incomeperperson'] <= econ_data['incomeperperson'].quantile(.25):
return 'LowGDP'
if row['incomeperperson'] > econ_data['incomeperperson'].quantile(.25) and \
row['incomeperperson'] <= econ_data['incomeperperson'].quantile(.75):
return 'MediumGDP'
if row['incomeperperson'] > econ_data['incomeperperson'].quantile(.75):
return 'HighGDP'
econ_data['percapitagdp'] =\
econ_data.apply(lambda row: percapitagdp(row), axis=1)
dropnas = econ_data['percapitagdp'].value_counts(sort=True, dropna=True)
print(dropnas)
# bivariate bar chart for Household Electricity Consuption and Income Levels C-> Q
seaborn.factorplot(x='percapitagdp', y='relectricperperson',
data=econ_data, kind='bar', ci=None)
plt.xlabel('Income Group')
plt.ylabel('Mean kWh Rate')
# basic scatterplot for incomeperperson vs relectricperperson Q-> Q
seaborn.regplot(x='incomeperperson', y='relectricperperson',
data=econ_data, fit_reg=False)
plt.xlabel('Income Per Capita')
plt.ylabel('Household Electricity Consumption')
plt.title('Scatterplot to show Association between Income Per Capita and \
Electricity Per Capita')
def areatype(row):
if row['urbanrate'] <= econ_data['urbanrate'].quantile(.25):
return 'Village'
if row['urbanrate'] > econ_data['urbanrate'].quantile(.25) and \
row['urbanrate'] <= econ_data['urbanrate'].quantile(.75):
return 'Town'
if row['urbanrate'] > econ_data['urbanrate'].quantile(.75):
return 'City'
econ_data['areatype'] =\
econ_data.apply(lambda row: areatype(row), axis=1)
dropnas = econ_data['areatype'].value_counts(sort=False, dropna=True)
print(dropnas)
# bivariate bar chart for areatype(urbanrate) and relectricperperson C-> Q
seaborn.factorplot(x='areatype', y='relectricperperson',
data=econ_data, kind='bar', ci=None)
plt.ylabel('Household Electric Consumption')
plt.xlabel('Area Classification')
# basic scatterplot for urbanrate vs relectricperperson Q-> Q
seaborn.regplot(x='urbanrate', y='relectricperperson',
data=econ_data, fit_reg=False)
plt.ylabel('Household Electric Consumption')
plt.xlabel('Urban Rate')
plt.title('Scatterplot to show Association between Urban Rate and \
Electricity Per Capita')
Univariate Graph for Per Capita Income
This graph is unimodal and positively skewed with large number of small income earners and small number of large income earners. Thus, potential evidence of pronounced income disparity globally.
Univariate Graph for Per Household Electricity Use
This graph is unimodal with the highest peak at 500 kWh level. Also the chart is positively skewed which is evidenced by this fact the measures of central tendency are in this in this order: mode < median < mean
From the charts showing income and household electricity consumption, one can surmise that, there seems to be some relationship between household electricity consumption and income levels. The extent of the relation will be further explored.
Univariate Graph for Urban Rate
This graph appears to be bi-modal as evidenced by two sets of peaks at 40 and 70 urban rate levels.
This points to a fact that, globally, one is likely to see small urban and large urban centers across countries.
Bivariate Graph - relectricperperson vs. incomeperperson(Q ->C)
From this chart, evidence exist that high income households do consume more electricity per capita.
Bivariate Graph - relectricperperson vs. incomeperperson(Q ->Q)
This graph shows a positive relationship between income and household electricity consumption.
Bivariate Graph - urbanrate vs. relectricperperson(Q ->C)
From this graph, there is evidence of positive relationship between urban rate and household electricity consumption. Thus, higher urban rates correlate to higher household electricity consumption.
Bivariate Graph - urbanrate vs. relectricperperson(Q ->Q)
From this graph, one can decipher a positive relationship between urbanrate and relectricperperson.
However, the extent of the relationship may be a weak one.






