Thursday, November 12, 2015

Week 3: Making Data Management Decisions

In this week's assignment, I chose to bin my variables of interest - "relectricperperson", "incomeperperson" and "urbanrate" into categories. The basis of this split is percentile.

Code

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 12 12:59:21 2015
Week 3 submission
"""

import pandas
import os

DATA_PATH = os.path.join(os.getcwd(), "data")
DATA_FILE = os.path.join(DATA_PATH, "gapminder.csv")
econ_data = pandas.read_csv(DATA_FILE, low_memory=False)

pandas.set_option('display.float_format', lambda x: '%f' % x)

# setting variables of interest to numeric
econ_data['incomeperperson'] = \
pandas.to_numeric(econ_data['incomeperperson'], errors='coerce')
econ_data['relectricperperson'] = \
pandas.to_numeric(econ_data['relectricperperson'], errors='coerce')
econ_data['urbanrate'] = \
pandas.to_numeric(econ_data['urbanrate'], errors='coerce')


# Extracting data pertinent to variables of interest
relevant_data = \
econ_data[['country', 'continent', 'incomeperperson', 'relectricperperson',
'urbanrate']]

# Extract only countries in Sub-Sahara Africa.
# For this addeded a new column - continent to gapminder dataframe
sub_sahara_africa = relevant_data[relevant_data['continent'] == 'Africa']
# print(sub_sahara_africa)

sub_sahara_africa.describe()

# Create a second copy of sub_sahara_africa for further data analysis
sub2_sahara_africa = sub_sahara_africa.copy()
# print(sub2_sahara_africa)


'''
Split income data into 3 strata - LowGDP, MediumGDP, HighGDP
Using percentile values to bin dataset
LowGDP: <= 25 percentile of incomeperperson
MiddleGDP: between 25 percentile and 75 percentile of incomeperperson
HighGDP: greater than 75 percentile of incomeperperson

incomeperperson
count 43.000000
mean 1098.598153
std 1705.541951
min 103.775857
25% 272.888584
50% 411.501447
75% 804.478821
max 8654.536845

'''


def percapitagdp(row):
if row['incomeperperson'] <= 272.888584:
return 'LowGDP'
if row['incomeperperson'] > 272.888584 and \
row['incomeperperson'] <= 804.478821:
return 'MediumGDP'
if row['incomeperperson'] > 804.478821:
return 'HighGDP'


sub2_sahara_africa['percapitagdp'] =\
sub2_sahara_africa.apply(lambda row: percapitagdp(row), axis=1)

# Calculate descriptive statistics for incomeperperson grouped by percapitagdp
per_capitagdp_stats =\
sub2_sahara_africa.groupby('percapitagdp').agg({'incomeperperson':
['count', 'mean',
'std', 'max']})

print(per_capitagdp_stats)


'''
Split relectricperperson data into 3 categories
- LowkWh, MediumkWh, HighkWh
LowkWh: <= 25 percentile of relectricperperson
MiddlekWh: between 25 percentile and 75 percentile of incomeperperson
HighkWh: greater than 75 percentile of incomeperperson

relectricperperson
count 22.000000
mean 149.889484
std 222.403251
min 0.000000
25% 38.325833
50% 57.961848
75% 150.778896
max 920.137600

'''


def percapitakWh(row):
if row['relectricperperson'] <= 38.325833:
return 'LowkWh'
if row['relectricperperson'] > 38.325833 and \
row['relectricperperson'] <= 150.778896:
return 'MediumkWh'
if row['relectricperperson'] > 150.778896:
return 'HighkWh'


sub2_sahara_africa['percapitakWh'] =\
sub2_sahara_africa.apply(lambda row: percapitakWh(row), axis=1)

# Calculate descriptive statistics for relectricperperson grouped by percapitakWh
per_capitakWh_stats =\
sub2_sahara_africa.groupby('percapitakWh').agg({'relectricperperson':
['count', 'mean',
'std', 'max']})
print(per_capitakWh_stats)


'''
Split urbanrate data into 3 categories - rural, town, urban
rural: >= 25 percentile of urbanrate
town: between 25 percentile and 75 percentile of urbanrate
urban: greater than 75 percentile of urbanrate

urbanrate
count 44.000000
mean 39.774091
std 17.063891
min 12.980000
25% 26.390000
50% 37.550000
75% 49.090000
max 87.300000

'''


def areatype(row):
if row['urbanrate'] <= 26.390000:
return 'rural'
if row['urbanrate'] > 26.390000 and \
row['urbanrate'] <= 49.090000:
return 'town'
if row['urbanrate'] > 49.090000:
return 'urban'
sub2_sahara_africa['areatype'] =\
sub2_sahara_africa.apply(lambda row: areatype(row), axis=1)

# Calculate descriptive statistics for areatype across countries
per_capita_areatype_stats =\
sub2_sahara_africa.groupby('areatype').agg({'urbanrate':
['count', 'mean',
'std', 'max']})
print(per_capita_areatype_stats)


Output



                                     incomeperperson                                    
                                  count        mean            std                     max
percapitagdp                                                    
HighGDP                   11    3265.406288    2281.132925   8654.536845
LowGDP                    11   196.132710       55.934473         269.892881
MediumGDP              21   436.323409      120.507293        713.639303

                                relectricperperson                                 
                                count         mean                std                     max
percapitakWh                                                    
HighkWh                    6           425.257250    284.256336        920.137600
LowkWh                     6             22.610565      13.856819          38.222943
MediumkWh              10             61.036175      17.145977          97.246492

                                     urbanrate                              
                                count           mean               std             max
areatype                                        
rural                          11         20.132727      4.027548    25.520000
town                          22         37.951818     5.654132     48.780000
urban                         11         63.060000   11.856559     87.300000

Summary

Per Capita GDP

I split the income levels into the following categories - HighGDP, MediumGDP and LowGDP. 
The top income earners across Sub Sahara Africa countries take home about  32 times that of low income earners. Median Income earners earns approximately 3 times that of the lowest income earners. 

Per Capita Electricity Consumption

I also split electricity consumption per capita into the following categories - HighkWh, LowkWh and MediumkWh. 
Similarly, the picture for per capita electric consumption is no different.  The top consumers of electricity per capital use 24 times more vis-a-vis low per capita electricity consumers. The medium rung of electricity consumers triple that of lowest consumers.

Urbanization

Broadly, I split population centers across 3 groups, viz, rural, town and urban. 
On average more people in Sub Saharan Africa live in urban centers. Approximately, two-thirds of the population live in urban centers.

No comments:

Post a Comment