Data Analysis and Interpretation: 2016

Saturday, April 16, 2016

Module 3. Exploratory Data Analysis

Result Section

As part of Data Exploration Analysis, I created two artifacts to understand the data.
The first one is Fig 1: Pair plot for Adjusted National Income Per Capita(x11_2012) and multiple explanatory variables. This shows relationship between the response variable and a sample of the explanatory variables. A review of the plot shows that response variable - Adjusted National Income Per Capita is not normally distributed. However, Urban Population Growth (x284_2012) is.

Also, there appears to be a perfect negative correlation between Rural Population (x258_2012) and Urban Population (x283_2012. Positive linear correlation exists between Female Survival to Age 65(x274_2012) and Male Survival to Age 65(x275_2012).

Fig 2: Heat Map for Adjusted National Income Per Capita and multiple explanatory variables quantifies the behavior demonstrated in the pair plot.
Pearson's correlation coefficient between Rural Population (x258_2012) and Urban Population (x283_2012 is -1, and 0.93 between Female Survival to Age(x274_2012) and Male Survival to Age 65(x275_2012).

Fig 3 depicts a plot of regression coefficients retained from LASSO Regression Analysis.
A total of 14(16%) features were retained out of 86.

Fig. 4 depicts the mean square error graph for each of the retained variables. It shows that MSE decline from 1 to a low point of about 0.1 and start to increase thereafter. This corresponds to about 2.5 folds.

Table 1 lists all the retained features.

Fig 2: Heat Map for Adjusted National Income Per Capita and multiple explanatory variables

Fig 3

Fig 4

Tabel 1: Retain Variables

Variables	Description	Regression Coefficient
x149_2012	Health Expenditure Per Capita(Current US$)	0.629593475
x142_2012	GDP Per Capita(Current US$)	0.172713173
x49_2012	Automated Teller Machines(Per 100,000 adults)	0.081700721
x218_2012	Population Ages 65 and above(% of total)	0.073893377
x242_2012	Private Credit Bureau Coverage(% of adults)	0.045980527
x275_2012	Male Survival to ages 65 (% of Cohort)	0.026677315
x283_2012	Urban Population(% of total)	0.015867453
x161_2012	Industry Value Added(% of GDP)	0.008073948
x220_2012	Population Growth(Annual %)	0.007546031
x25_2012	Adolescent Fertility Rate(births per 1,000 women ages 15-19)	-0.004927315
x86_2012	Commercial Bank Branches(per 100,000 adults)	-0.032404314
x153_2012	Household Final Consumption Expenditure(% of GDP)	-0.041665547
x223_2012	Female Population(% of total)	-0.149783923
x219_2012	Population Density(People per sq. km of land area)	-0.150375913

Model Evaluation

MSE train: 0.083, MSE test: 0.070

R^2 train: 0.917, R^2 test: 0.930

The robustness of the model is evidenced from a lower MSE for the test data 0.070 compared to MSE for the training dataset of 0.083.

The R-squared values shows that the testing dataset explain about 93% of the variability in the response variable - Adjusted National Income Per Capita and retained explanatory variables - Health Expenditure Per Capita, GDP Per Capita, Automated Teller Machines(Per 100,000 adults), Population ages 65 and above, Private Bureau Coverage, Male Survival to ages 65, Urban Population, Industry Value Added, Population Growth, Adolescent Fertility Rate(births per 1,000 women ages 15-19), Commercial Bank Branches(per100,000 adults), Household Final Consumption Expenditure, Female Population and Population Density(People per sq. km of land area) compare to about 92% of the training dataset for the same response and explanatory variables.

Code

import pandas as pd

import os

import numpy as np

# bug fix for display format to avoid run time errors

pd.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame

pd.set_option('display.max_rows', None)

pd.set_option('display.max_columns', None)

##############################################################################

# DATA MANAGEMENT

##############################################################################

# define method to load data of interest

def load_data(data_dir, csv_file):

DATA_PATH = os.path.join(os.getcwd(), data_dir)

DATA_FILE = os.path.join(DATA_PATH, csv_file)

data = pd.read_csv(DATA_FILE, low_memory=False)

return data

# loading data

worldbank_df = load_data('data', 'worldbank.csv')

# Deleting regional and non classified coutry data

worldbank_df = worldbank_df.drop([39, 59, 60, 69, 70, 71, 75, 92, 93, 94, 95, 120, 121, 123, 130, 131, 132, 147, 148, 165, 168, 169, 171, 211, 212])

worldbank_df.drop('country', axis = 1, inplace=True)

print(worldbank_df.columns.values)

worldbank_df.dtypes

# Will need to add the column names after imputing missing values

index = ['x1_2012', 'x2_2012', 'x9_2012', 'x11_2012', 'x12_2012', 'x14_2012',

'x15_2012','x16_2012', 'x18_2012', 'x19_2012', 'x21_2012', 'x25_2012',

'x29_2012', 'x31_2012', 'x35_2012', 'x36_2012', 'x37_2012', 'x38_2012',

'x45_2012', 'x47_2012', 'x48_2012', 'x49_2012', 'x58_2012', 'x67_2012',

'x68_2012', 'x69_2012', 'x86_2012', 'x100_2012', 'x121_2012', 'x125_2012',

'x126_2012', 'x129_2012', 'x131_2012', 'x132_2012', 'x134_2012', 'x139_2012',

'x140_2012', 'x142_2012', 'x143_2012', 'x146_2012', 'x149_2012', 'x150_2012',

'x153_2012', 'x154_2012', 'x155_2012', 'x156_2012', 'x157_2012', 'x161_2012',

'x162_2012', 'x163_2012', 'x167_2012', 'x169_2012', 'x171_2012', 'x172_2012',

'x173_2012', 'x174_2012', 'x179_2012', 'x187_2012', 'x190_2012', 'x191_2012',

'x192_2012', 'x195_2012', 'x204_2012', 'x205_2012', 'x211_2012', 'x212_2012',

'x213_2012', 'x218_2012', 'x219_2012', 'x220_2012', 'x221_2012', 'x222_2012',

'x223_2012', 'x242_2012', 'x243_2012', 'x244_2012', 'x253_2012', 'x255_2012',

'x258_2012', 'x261_2012', 'x268_2012', 'x274_2012', 'x275_2012', 'x277_2012',

'x283_2012', 'x284_2012', 'x9_2013', 'x11_2013', 'x12_2013', 'x14_2013',

'x15_2013', 'x16_2013', 'x18_2013', 'x19_2013', 'x21_2013', 'x25_2013',

'x29_2013', 'x31_2013', 'x35_2013', 'x36_2013', 'x41_2013', 'x42_2013',

'x45_2013', 'x47_2013', 'x48_2013', 'x49_2013', 'x58_2013', 'x86_2013',

'x100_2013', 'x121_2013', 'x125_2013', 'x126_2013', 'x129_2013', 'x131_2013',

'x132_2013', 'x134_2013', 'x139_2013', 'x140_2013', 'x142_2013', 'x143_2013',

'x146_2013', 'x149_2013', 'x150_2013', 'x153_2013', 'x154_2013', 'x155_2013',

'x156_2013', 'x157_2013', 'x161_2013', 'x162_2013', 'x167_2013', 'x169_2013',

'x171_2013', 'x172_2013', 'x173_2013', 'x174_2013', 'x187_2013', 'x190_2013',

'x191_2013', 'x192_2013', 'x204_2013', 'x211_2013', 'x213_2013', 'x216_2013',

'x218_2013', 'x219_2013', 'x220_2013', 'x221_2013', 'x222_2013', 'x223_2013',

'x242_2013', 'x243_2013', 'x244_2013', 'x255_2013', 'x258_2013', 'x261_2013',

'x267_2013', 'x268_2013', 'x274_2013', 'x275_2013', 'x283_2013','x284_2013']

# Changing column names to human readable ones

columns = {'x1_2012' : 'Access to Electricity(% pop)', 'x2_2012' : 'Access to Non Solid Fuel(% pop)',

'x9_2012' : 'Net National Income(US$) ', 'x11_2012' : 'Net National Income Per Capita(US$)',

'x12_2012' : 'CO2 Damage(% GNI)', 'x14_2012' : 'Cons. of fixed capital(% GNI)',

'x15_2012' : 'Cons fixed capital(US$)', 'x16_2012' : 'Education Expenditure(% GNI)',

'x18_2012' : 'Energy Depletion(% GNI)', 'x19_2012' : 'Energy Depletion(US$)',

'x21_2012' : 'Nat. Resource Depletion(% GNI)', 'x25_2012' : 'Fertility Rate',

'x29_2012' : 'Age Dependency Ratio(% working-age pop)', 'x31_2012' : 'Agric Land(% land area)',

'x35_2012' : 'Agric Value Added(% GDP)', 'x36_2012' : 'Agric Value Added(Annual % Growth)',

'x37_2012' : 'Air Transport: Passengers Carried', 'x38_2012' : 'Air Transoport: Registered Carrier Departures Worldwide',

'x45_2012' : 'Arable Land(% of land area)', 'x47_2012' : 'Armed Forces Personnel(% of total labor force)',

'x48_2012' : 'Total Armed Forces Personnel', 'x49_2012' : 'ATMS(per 100K adults)', 'x58_2012' : 'Crude birth rate(per 1K ppl)',

'x67_2012' : 'Death: various causes', 'x68_2012' : 'Death: injury causes(% total)',

'x69_2012' : 'Death: Non communicable disease(% of total)', 'x86_2012' : 'Commercial bank branches(per 100K adults)',

'x100_2012' : 'Crude death rate(per 1K ppl)', 'x121_2012' : 'Exports of goods and service(% of GDP)',

'x125_2012' : 'Total fertility rate', 'x126_2012' : 'Fixed broadband subscr(per 100 ppl)',

'x129_2012' : 'Food prod. index(2004-2006=100)', 'x131_2012' : 'Foreign drct. investment, inflows(% of GDP)',

'x132_2012' : 'Foreign direct investment, net inflows(BOP, US$)', 'x134_2012' : 'Forest Area(% land area)',

'x139_2012' : 'GDP at market prices(US$)', 'x140_2012' : 'GDP Growth(annual %)', 'x142_2012' : 'GDP per capita(US$)',

'x143_2012' : 'GDP per capita growth(annual %)', 'x146_2012' : 'Gross domestic savings(% of GDP)',

'x149_2012' : 'Health exp. per capita(US$)', 'x150_2012' : 'Total health exp(% of GDP).',

'x153_2012' : 'Hshld final consumption expd.(% of GDP)', 'x154_2012' : 'Imports: goods & services(% of GDP)',

'x155_2012' : 'Imprvd Sanitation facilities(% pop w/ acess)', 'x156_2012' : 'Imprvd water sources(% pop w/ access)',

'x157_2012' : 'Incidence of TB(per 100K ppl)', 'x161_2012' : 'Industry: value added(% of GDP)',

'x162_2012' : 'Inflation(annual %)', 'x163_2012' : 'Intentional homicides(per 100K ppl)',

'x167_2012' : 'Internet Users(per 100 ppl)', 'x169_2012' : 'Female labor force(% total labor force)',

'x171_2012' : 'Female life expectancy at birth(years)', 'x172_2012' : 'Male life expectancy at birth(years)',

'x173_2012' : 'Total life expectancy at birth(years)', 'x174_2012' : 'Lifetime risk of maternal death(%)',

'x179_2012' : 'Manufg. value added(% of GDP)', 'x187_2012' : 'Mobile cellular subscr.(per 100 ppl)',

'x190_2012' : 'Infant mortality rate(per 1K live births)', 'x191_2012' : 'Neonatal mortality rate(per 1K live births',

'x192_2012' : 'Under-5 mortality rate(per 1K)', 'x195_2012' : 'Net Migration', 'x204_2012' : 'OOP health exp.(% total health expd.)',

'x205_2012' : 'Female primary education(%)', 'x211_2012' : 'Paid personal remit(US$)', 'x212_2012' : 'Rcvd personal remit(% of GDP)',

'x213_2012' : 'Rcvd personal remit(US$)', 'x218_2012' : 'Pop. ages 65+(% of Total)', 'x219_2012' : 'Pop. density',

'x220_2012' : 'Pop. growth(annual %)', 'x221_2012' : 'Pop. ages 0-14(% of total)', 'x222_2012' : 'Pop. ages 15-64(% of total)',

'x223_2012' : 'Female pop.(% total)', 'x242_2012' : 'Pvt CR bureau coverage(% adults)', 'x243_2012' : 'Prop. of women natl. plmt',

'x244_2012' : 'Public CR bureau coverage(% adults)', 'x253_2012' : 'Renewable Elect. output',

'x255_2012' : 'Freshwater resources per capita(cubic meters)', 'x258_2012' : 'Rural pop', 'x261_2012' : 'Secure internet servers(per 1M ppl)',

'x268_2012' : 'Surface area(sq. km)', 'x274_2012' : 'Female survival age 65(% cohort)', 'x275_2012' : 'Male survival age 65(% cohort)',

'x277_2012' : 'Protected areas: terrestrial & matrine(% territorial area)',

'x283_2012' : 'Urban population(% total)', 'x284_2012' : 'Urban pop growth(%)'}

# Imputtig missing values

from sklearn.preprocessing import Imputer

imr = Imputer(missing_values='NaN', strategy='median', axis=0)

worldbank_df = imr.fit_transform(worldbank_df)

worldbank_df = pd.DataFrame(worldbank_df)

# populating the column headings

worldbank_df.columns = index

print(worldbank_df.columns.values)

# Data: 2012

X = worldbank_df.iloc[:, :86]

print(X.columns.values)

# renaming the columns

X = X.rename(index=str, columns=columns)

print(X.columns.values)

sample_cols = ['Education Expenditure(% GNI)', 'Fertility Rate',

'Foreign direct investment, net inflows(BOP, US$)', 'Fixed broadband subscr(per 100 ppl)',

'Health exp. per capita(US$)', 'Female labor force(% total labor force)', 'Infant mortality rate(per 1K live births)',

'Paid personal remit(US$)', 'Female survival age 65(% cohort)', 'Male survival age 65(% cohort)',

'Imports: goods & services(% of GDP)', 'Public CR bureau coverage(% adults)', 'Net National Income Per Capita(US$)']

import matplotlib.pyplot as plt

import seaborn as sns

sns.set(style='whitegrid', context='notebook', font_scale=1.5)

plt.title('Pairwise Plot for Net National Per Capita Income(US$) and explanatory variables')

# Create pairwise scatter plot

sns.pairplot(X[sample_cols], size=5)

plt.show()

cm = np.corrcoef(X[sample_cols].values.T)

sns.set(font_scale=1.0)

hm = sns.heatmap(cm,

cbar=True,

annot=True,

square=True,

fmt='.2f',

annot_kws={'size': 11.0},

yticklabels=X[sample_cols].columns.values,

xticklabels=X[sample_cols].columns.values)

plt.title('Pairwise Plot for Net National Per Capita Income(US$) and explanatory variables')

plt.show()

y = X['Net National Income Per Capita(US$)']

X.drop('Net National Income Per Capita(US$)', axis=1, inplace=True)

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

#LASSO Regression

# Standardization of Features

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train_std = sc.fit_transform(X_train)

X_test_std = sc.fit_transform(X_test)

y_train_std = sc.fit_transform(y_train)

y_test_std = sc.fit_transform(y_test)

from sklearn.linear_model import LassoLarsCV

lcv = LassoLarsCV(cv=9, precompute=False)

lcv.fit(X_train_std, y_train_std)

lcv_dict = dict(zip(X_train.columns.values,lcv.coef_))

import operator

sorted(lcv_dict.items(), key=operator.itemgetter(-1), reverse=True)

# plot coefficient progression

m_log_alphas = -np.log10(lcv.alphas_)

ax = plt.gca()

plt.plot(m_log_alphas, lcv.coef_path_.T)

plt.axvline(-np.log10(lcv.alpha_), linestyle='--', color='k', label='alpha CV')

plt.ylabel('Regression Coefficients')

plt.xlabel('-log(alpha)')

plt.legend()

plt.title('Regression Coefficients for Lasso Paths')

# plot mean square error for each fold

m_log_alphascv = -np.log10(lcv.cv_alphas_)

plt.figure()

plt.plot(m_log_alphascv, lcv.cv_mse_path_, ':')

plt.plot(m_log_alphascv, lcv.cv_mse_path_.mean(axis=-1), color='k',

label='Average across the folds', linewidth=2)

plt.axvline(-np.log10(lcv.alpha_), linestyle='--', color='k',

label='alpha CV')

plt.legend()

plt.xlabel('-log(alpha)')

plt.ylabel('Mean squared error')

plt.title('Mean squared error on each fold')

# Model Evealuation MSE

y_train_pred_std = lcv.predict(X_train_std)

y_test_pred_std = lcv.predict(X_test_std)

# Residuals Plot

plt.scatter(y_train_pred_std,

y_train_pred_std - y_train_std,

c='black',

marker='o',

s=35,

alpha=0.5,

label='Training data')

from sklearn.metrics import mean_squared_error

print('MSE train: %.3f, MSE test: %.3f' % (mean_squared_error(y_train_std, y_train_pred_std), mean_squared_error(y_test_std, y_test_pred_std)))

# R^2

from sklearn.metrics import r2_score

print('R^2 train: %.3f, R^2 test: %.3f' % (r2_score(y_train_std, y_train_pred_std), r2_score(y_test_std, y_test_pred_std)))

# Pairplot and heatmap for retained features

retained_features = ['Health exp. per capita(US$)', 'GDP per capita(US$)', 'ATMS(per 100K adults)', \

'Pop. ages 65+(% of Total)' , 'Pvt CR bureau coverage(% adults)', 'Male survival age 65(% cohort)',\

'Urban population(% total)', 'Industry: value added(% of GDP)', 'Pop. growth(annual %)',\

'Fertility Rate', 'Commercial bank branches(per 100K adults)', 'Hshld final consumption expd.(% of GDP)',\

'Female pop.(% total)','Pop. density']

sns.pairplot(X[retained_features], size=5)

plt.show()

ret_feat = np.corrcoef(X[retained_features].values.T)

sns.set(font_scale=1.0)

hm = sns.heatmap(ret_feat,

cbar=True,

annot=True,

square=True,

fmt='.2f',

annot_kws={'size': 10.5},

yticklabels=X[retained_features].columns.values,

xticklabels=X[retained_features].columns.values)

plt.title('Heat Map of LASSO Regression Retained Features')

plt.show()

Saturday, April 9, 2016

Milestone Assignment 2: Methods

1. Sample

The dataset for my analysis is from the World Bank. Primarily, this is a country specific socio-economic indicators compiled from internationally recognized sources. The data represent the most current and accurate global development data available. It also includes regional and global aggregate datasets in addition to the national ones.

After accounting for regional global aggregates, the World Bank dataset consists of N = 233 countries. and territories and more than 162 features covering years 2012 and 2013.

2. Measures

The response variable is Adjusted National Income Per Capita (current US$) and the explanatory variables are a mixture of economic, social and human development index indicators.
Some of them are - Fixed Broadband Subscriptions (Per 100 People), GDP Per Capita(Current US$), Health Expenditure Per Capita (Current US$), Adjusted Savings: Consumption of Fixed Capital (Current US$), Manufacturing Value Added(% of GDP), Mobile Cellular Subscriptions (Per 100 people), Adjusted Savings: Energy Depletion (Current US$), Personal Remittances, Received (% of GDP), Population Ages 65 and above (% of Total), Population Density (People per sq. km of Land area), Adjusted Savings: Natural Resources Depletion (% of GNI), Population, Female(% of Total), Private Credit Bureau Coverage (% of Adults), Proportion of seats held by Women in National Parliaments (% Total), Adolescent Fertility Rate (Births per 1000 women ages 15-19), Terrestrial and Marine protected areas(% of territorial area), Agricultural Land(% of land area) and Agriculture, Value Added (Annual % Growth)

3. Analyses

As part of the exploratory data analysis, a pairwise scatter plot was created to visually detect the presence of outliers, the distribution of the data and the relationship between features. All missing values were imputed using media values of the features.

Fig 1: Pairwise Scatter Plot of Features

For instance, it appears from Fig 1: Pairwise Scatter Plot Features, that Adjusted Savings: Consumption of Fixed Capital(% of GNI)[x14_2012] is normally distributed. However, it is not clear whether a linear relationship exists between the response variable - Adjusted Net National Income Per Capita(current US$)[x11_2012 and sample explanatory variables.

To quantify the linear relationship between the sample features, a heatmap was created using correlation matrix, which contains Pearson product-moment correlation coefficients(Pearson's r). Pearson's r measures the linear dependence between pairs of features. The correlation coefficients are bounded to the range -1 and 1.

Fig 2: Heat Map of Correlation Matrix

From Fig 2: Heat Map of Correlation Matrix, there appears to be some linear relationship between adjusted Savings: Consumption of Fixed Capital(% of GNI)[x14_2012] and Access to Electricity (% of population)[x1_2012] , Access to Non-Solid Fuel(% of population)[x2_2012], Adjusted Savings: Consumption of fixed capital(% of GNI)[x14_2012] and Adjusted Savings: Consumption of fixed capital(current US$)[x15_2012]

OLS Summary

OLS analysis shows that a relationship exists between Adjusted National Income Per Capita (current US$) and the following statistically significant explanatory variables:

Health Expenditure, Total(% of GDP)[beta = -5.79e-09, p-value = 0.032)]
Personal Remittances Paid(Current US$) [beta = -124.9623, p-value = 0.019]
Age Dependency Ratio(% of working age population) [beta = 557.8104, p-value = 0.019]
Agriculture, Value Added(% of GDP)[beta = -79.7164, p-value = 0.053]
Air Transport, Registered Departures Worldwide[beta = -0.0055, p-value = 0.057]
Automated Teller Machines(ATM)[beta = 23.9621, p-value = 0.014]
Birth Rate, Crude(Per 1,000 people)[beta = 641.0533, p-value = 0.002]
Fixed Broadband Subscriptions(Per 100 People)[beta = -115.1073, p-value = 0.026]
Manufacturing, Value Added(% of GDP)[beta = -126.7740, p-value = 0.028]
Mortality Rate, Infant(Per 1,000 Live Births)[beta = 242.0010, p-value = 0.056]
Out-of-pocket Health Expenditure(% of Total Expenditure on Health)[beta = -51.7340, p-value = 0.016]
Population, Female(% of Total)[beta = -398.8621, p-value = 0.039]

For further analysis, I will train and test my models with 2012 dataset.
2012 data was randomly split with testing set accounting for 30% of the data(N=67) and training set made up the remainder 70% of size N=156.

My goal is to perform LASSO Regression analysis to ascertain economic and social indicators that contribute significantly to a country's economic growth and development.

Saturday, April 2, 2016

Capstone Project

Milestone Assignment 1: Title and Introduction to the Research Question

Title: The association between economic indicators and GDP Per Capita

The purpose of this study is to identify predictors - economic indicators that contribute significantly to a country's economic growth measured by GDP Growth (Annual %) - response variable. The predictors of choice among many are: Fixed Broadband Subscriptions(per 100 people), Foreign Direct Investment, net inflows(% of GDP), Health Expenditure, total(% of GDP), Household final consumption expenditure(% of GDP), Import of Goods and Services(% of GDP), Improved Sanitation Facilities(% of population with access) and Industry Value added (% of GDP)

My motivation is to identify economic activities via economic indicators that contribute to economic growth, which invariable will reduce the incidence of persistence poverty globally. Over the years, a myriad of international organization from International Monetary Fund and the World Bank have prescribed panacea to ending endemic economic stagnation globally. The result is a mix bag. While some countries have made progress over this period, other countries are still mired in stagnant economic existence.

This research will attempt to identify potential list of economic indicators that are critical to economic growth and development. This will give policy makers economic areas to invest in to actually arrest the current economic decline across the globe and in particular emerging economies.

Worldbank Data Codebook

Saturday, February 27, 2016

K-Means Cluster Analysis

Introduction

In this blog post, I will continue using gapminder data to perform k-means cluster analysis
My interest is ascertaining the contribution of each variable in the dataset to per person electricity consumption(kWh) globally.

Code

from pandas import DataFrame
import pandas as pd
import os
import numpy as np
import matplotlib.pylab as plt

from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans

# bug fix for display format to avoid run time errors
pd.set_option('display.float_format', lambda x: '%f' % x)

# Set pandas to display all columns and rows in DataFrame
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)##############################################################################
# DATA MANAGEMENT
##############################################################################

# define method to load data of interest

def load_data(data_dir, csv_file):
        DATA_PATH = os.path.join(os.getcwd(), data_dir)
        DATA_FILE = os.path.join(DATA_PATH, csv_file)
        data = pd.read_csv(DATA_FILE, low_memory=False)
        return data

# loading data
gapminder_df = load_data('data', 'gapminder.csv')

#upper-case all DataFrame column names
gapminder_df.columns = map(str.upper, gapminder_df.columns)

if __name__ == "__main__":
    cluster = gapminder_df.iloc[:, 2:]

    def convert_to_numeric(mycluster):
        for i in mycluster:
            mycluster.loc[:,i] = pd.to_numeric(mycluster.loc[:,i], errors='coerce')
        return mycluster
    numeric_df = convert_to_numeric(cluster)
    print(numeric_df)


    def map_target(row):
        if row['POLITYSCORE'] > 6:
            return 0 # Democracy
        else:
            return 1 # Dictatorship

    cluster.loc[:,'DEMO'] = cluster.apply(lambda row: map_target(row), axis=1)
    cluster.drop('POLITYSCORE', axis=1, inplace=True)

    data_clean = cluster.fillna(0)

    # standardize cluster to have mean=0 and sd=1
    def standardize_cluster(mycluster):
        print(mycluster.columns.values)
        clustervar = mycluster.copy()
        clustervar.drop('RELECTRICPERPERSON', axis=1, inplace=True)
        print(mycluster.columns.values)

        for i in clustervar:
            clustervar.loc[:,i] = preprocessing.scale(clustervar.loc[:,i].astype('float64'))
        return clustervar

    clustervar = standardize_cluster(data_clean)
    print( clustervar)

    ##############################################################################
    # END DATA MANAGEMENT
    ##############################################################################


    ###############################################################################
    # K-Means CLUSTER ANALYSIS
    ###############################################################################
    # split data into train and test sets
    clus_train, clus_test =\
                train_test_split(clustervar, test_size=.4, random_state=123)

    # k-means cluster analysis for 1-15 clusters
    from scipy.spatial.distance import cdist
    clusters = range(1,15)
    meandist = []

    for k in clusters:
        model = KMeans(n_clusters = k)
        model.fit(clus_train)
        clusassign = model.predict(clus_train)
        meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))
        / clus_train.shape[0])

    """
    Plot average distance from observations from the cluster centroid
    to use the Elbow Method to identify number of clusters to choose
    """

    plt.plot(clusters, meandist)
    plt.xlabel('Number of clusters')
    plt.ylabel('Average distance')
    plt.title('Selecting k with the Elbow Method')

    # Interpret 3 cluster solution base on the Elbow method
    model3 = KMeans(n_clusters=3)
    model3.fit(clus_train)
    clusassign = model3.predict(clus_train)

    #labels
    labels = model3.labels_

    # plot clusters

    from sklearn.decomposition import PCA
    pca_2 = PCA(2)
    plot_columns = pca_2.fit_transform(clus_train)
    plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels,)
    plt.xlabel('Canonical variable 1')
    plt.ylabel('Canonical variable 2')
    plt.title('Scatterplot of Canonical Variables for 2 Clusters')
    plt.show()

    """
    BEGIN multiple steps to merge cluster assignment with clustering variables to examine
    cluster variable means by cluster
    """
    # create a unique identifier variable from the index for the
    # cluster training data to merge with the cluster assignment variable
    clus_train.reset_index(level=0, inplace=True)
    # create a list that has the new index variable
    cluslist = list(clus_train['index'])
    # create a list of cluster assignments
    # combine index variable list with cluster assignment list into a dictionary
    newlist = dict(zip(cluslist, labels))
    newlist
    # convert newlist dictionary to a dataframe
    newclus = DataFrame.from_dict(newlist, orient='index')
    newclus
    # rename the cluster assignment column
    newclus.columns = ['cluster']

    # now do the same for the cluster assignment variable
    # create a unique identifier variable from the index for the
    # cluster assignment dataframe
    # to merge with cluster training data
    newclus.reset_index(level=0, inplace=True)
    # merge the cluster assignment dataframe with the cluster training variable dataframe
    # by the index variable
    merged_train = pd.merge(clus_train, newclus, on='index')
    merged_train.head(n=100)
    # cluster frequencies
    merged_train.cluster.value_counts()

    """
    END multiple steps to merge cluster assignment with clustering variables to examine
    cluster variable means by cluster
    """

    # FINALLY calculate clustering variable means by cluster
    clustergrp = merged_train.groupby('cluster').mean()
    print ("Clustering variable means by cluster")
    print(clustergrp)


    # validate clusters in training data by examining cluster differences in RELECTRICPERPERSON using ANOVA
    # first have to merge RELELCTRICPERPERSON with clustering variables and cluster assignment data
    relectricperperson_data = data_clean['RELECTRICPERPERSON']
    # split RELECTRICPERPERSON data into train and test sets
    relectricperperson_train, relectricperperson_test =\
            train_test_split(relectricperperson_data, test_size=.4, random_state=123)

    relectricperperson_train1 = pd.DataFrame(relectricperperson_train)
    relectricperperson_train1.reset_index(level=0, inplace=True)
    merged_train_all = pd.merge(relectricperperson_train1, merged_train, on='index')

    kWh = merged_train_all[['RELECTRICPERPERSON', 'cluster']].dropna()

    import statsmodels.formula.api as smf
    import statsmodels.stats.multicomp as multi

    kWhmod = smf.ols(formula='RELECTRICPERPERSON ~ C(cluster)', data=kWh).fit()
    print(kWhmod.summary())

    print ('means for RELECTRICPERPERSON by cluster')
    kWhMean = kWh.groupby('cluster').mean()
    print (kWhMean)

    print ('standard deviations for RELECTRICPERPERSON by cluster')
    kWhSD= kWh.groupby('cluster').std()
    print (kWhSD)

    mc1 = multi.MultiComparison(kWh['RELECTRICPERPERSON'], kWh['cluster'])
    res1 = mc1.tukeyhsd()
    print(res1.summary())

    """
    VALIDATION
    BEGIN multiple steps to merge cluster assignment with clustering variables to examine
    cluster variable means by cluster in test data set
    """
    # create a variable out of the index for the cluster training dataframe to merge on
    clus_test.reset_index(level=0, inplace=True)
    # create a list that has the new index variable
    cluslistval = list(clus_test['index'])
    # create a list of cluster assignments
    labelsval=list(labels)
    # combine index variable list with labels list into a dictionary
    newlistval=dict(zip(cluslistval, labelsval))
    newlistval
    # convert newlist dictionary to a dataframe
    newclusval=DataFrame.from_dict(newlistval, orient='index')
    newclusval
    # rename the cluster assignment column
    newclusval.columns = ['cluster']
    # create a variable out of the index for the cluster assignment dataframe to merge on
    newclusval.reset_index(level=0, inplace=True)
    # merge the cluster assignment dataframe with the cluster training variable dataframe
    # by the index variable
    merged_test = pd.merge(clus_test, newclusval, on='index')
    # cluster frequencies
    merged_test.cluster.value_counts()
    """
    END multiple steps to merge cluster assignment with clustering variables to examine
    cluster variable means by cluster
    """

    # calculate test data clustering variable means by cluster
    clustergrpval = merged_test.groupby('cluster').mean()
    print ("Test data clustering variable means by cluster")
    print(clustergrpval)

    # validate clusters in testing data by examining cluster differences in RELECTRICPERPERSON using ANOVA
    # first have to merge RELELCTRICPERPERSON with clustering variables and cluster assignment data

    relectricperperson_test1 = pd.DataFrame(relectricperperson_test)
    relectricperperson_test1.reset_index(level=0, inplace=True)
    merged_test_all = pd.merge(relectricperperson_test1, merged_test, on='index')

    kWh_test = merged_test_all[['RELECTRICPERPERSON', 'cluster']].dropna()

    import statsmodels.formula.api as smf
    import statsmodels.stats.multicomp as multi

    kWh_testmod = smf.ols(formula='RELECTRICPERPERSON ~ cluster', data=kWh_test).fit()
    print(kWh_testmod.summary())

    print ('Test Data means for RELECTRICPERPERSON by cluster')
    kWh_testMean = kWh_test.groupby('cluster').mean()
    print (kWh_testMean)

    print ('Test Data standard deviations for RELECTRICPERPERSON by cluster')
    kWh_testSD= kWh_test.groupby('cluster').std()
    print (kWh_testSD)

    mc1_test = multi.MultiComparison(kWh_test['RELECTRICPERPERSON'], kWh_test['cluster'])
    res1_test = mc1_test.tukeyhsd()
    print(res1_test.summary())

Summary

I conducted a k-means cluster analysis to identify subgroups of countries that we can use per person electricity consumption as a proxy for economic attainment. This is based on 15 economic variables deemed to influence a country's economic growth.

Clustering variables includes categorical variable of a country's political maturity - Democracy or Dictatorship and quantitative variables of per capita income, alcohol consumption, rate of growth of a country's armed forces, breast cancer infection per 100 people, carbon dioxide emission, employment rate, female employment rate, HIV infection rate, internet use rate, life expectancy, oil use per person, suicide per 100 people, employment rate, urbanization rate and per person electricity consumption. I standardized/normalized all clustering variables to have a mean of 0 and standard deviation of 1

I randomly split data into training set consisting of 60% of observations (N=127) and a testing set made up 40% of observations (N=86).

Using Euclidean distance measurement, I conducted a series of k-means cluster analyses on the training dataset, specifying clusters between 1 through 15. For guidance on the number of potential clusters to interpret, I plotted the variance for each of the 15 cluster solutions in an elbow curve accounted for by the clusters (r-square). See Figure 1 below.

Figure 1: Selecting k with the Elbow Method

The Elbow Method diagram suggested that 3-clusters solutions can be interpreted.

Figure 2 below shows the result of an interpretation of a 3-cluster solution.

Figure 2: Scatterplot of Canonical Variables for 3 Clusters

Leveraging Canonical Discriminant Analysis, I reduced the variables from 15 to a few that has outsize contribution to the variance of the clustering variables. Figure 2 above is the scatter plot of the first two canonical variables.

The means of the clustering variables revealed that compared to other clusters, countries in cluster 0(green) are relatively developed economies as evidenced by higher levels of per capita income, alcohol consumption, rate of growth of a country's armed forces, breast cancer infection per 100 people, carbon dioxide emission, employment rate, female employment rate, internet use rate, life expectancy, oil use per person, suicide per 100 people, employment rate and urbanization rate. However, they have low HIV infection rate and are not a democracies.

Cluster 1(red) has both high HIV rate of infection and suicide rate per 100 person but low life expectancy and urbanization rate. Also, they have high employment in general but a particularly high female employment rate. This is a fledgling democracy.

Cluster 2(blue) is a fully fledged democracy but a developing economy nonetheless. It is least developed of the three clusters. The relative contribution by the economic variables of interest are non existent to a degree. They have relative low income per person, low employment rate, low urbanization rate and low life expectancy rate.

To externally validate clusters, I performed an Analysis of Variance(ANOVA) to test for significance differences between the three clusters on per person electricity consumption(kWh). Using Tukey test, I performed a post-hoc comparison between the three clusters. My test results indicated a statistically significance difference between the clusters on RELECTRICPERPERSON (F(2,125) = 23.31, p < 0.001). See Figure 3 below

Figure 3: OLS Regression Analysis(Training Dataset)

                               OLS Regression Results
==============================================================================
Dep. Variable:     RELECTRICPERPERSON   R-squared:                       0.273
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     23.31
Date:                Sat, 27 Feb 2016   Prob (F-statistic):           2.54e-09
Time:                        16:24:12   Log-Likelihood:                -1071.6
No. Observations:                 127   AIC:                             2149.
Df Residuals:                     124   BIC:                             2158.
Df Model:                           2
Covariance Type:            nonrobust
===================================================================================
                      coef    std err          t      P>|t|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept        1605.9341    164.969      9.735      0.000      1279.415 1932.453
C(cluster)[T.1] -1364.0482    221.964     -6.145      0.000     -1803.377 -924.719
C(cluster)[T.2] -1542.8442    292.156     -5.281      0.000     -2121.103 -964.586
==============================================================================
Omnibus:                      118.652   Durbin-Watson:                   2.178
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1454.624
Skew:                           3.287   Prob(JB):                           0.00
Kurtosis:                      18.221   Cond. No.                         3.83
==============================================================================

The Tukey post-hoc comparison revealed a significant difference between clusters on RELECTRICPERPERSON with the exception that there no significance differences between clusters 1 and 2. See Figure 4 below.

Figure 4: Tukey Multiple Comparison of Means (Training Dataset)

Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff      lower            upper            reject
---------------------------------------------------------------------------
0           1      -1364.0482    -1890.6291    -837.4672     True
0           2      -1542.8442    -2235.9471    -849.7413     True
1           2       -178.796         -850.6163     493.0242    False
-------------------------------------------------------------------------
Countries in cluster 0 have the largest per person electricity consumption globally ( mean = 1605.934133, s.d = 1780.616595) while countries in cluster 2 have the lowest per person electricity consumption globally (mean = 63.089933 , s.d = 295.918015).