Result Section
As part of Data Exploration Analysis, I created two artifacts to understand the data.
The first one is Fig 1: Pair plot for Adjusted National Income Per Capita(x11_2012) and multiple explanatory variables. This shows relationship between the response variable and a sample of the explanatory variables. A review of the plot shows that response variable - Adjusted National Income Per Capita is not normally distributed. However, Urban Population Growth (x284_2012) is.
Also, there appears to be a perfect negative correlation between Rural Population (x258_2012) and Urban Population (x283_2012. Positive linear correlation exists between Female Survival to Age 65(x274_2012) and Male Survival to Age 65(x275_2012).
Fig 2: Heat Map for Adjusted National Income Per Capita and multiple explanatory variables quantifies the behavior demonstrated in the pair plot.
Pearson's correlation coefficient between Rural Population (x258_2012) and Urban Population (x283_2012 is -1, and 0.93 between Female Survival to Age(x274_2012) and Male Survival to Age 65(x275_2012).
The first one is Fig 1: Pair plot for Adjusted National Income Per Capita(x11_2012) and multiple explanatory variables. This shows relationship between the response variable and a sample of the explanatory variables. A review of the plot shows that response variable - Adjusted National Income Per Capita is not normally distributed. However, Urban Population Growth (x284_2012) is.
Also, there appears to be a perfect negative correlation between Rural Population (x258_2012) and Urban Population (x283_2012. Positive linear correlation exists between Female Survival to Age 65(x274_2012) and Male Survival to Age 65(x275_2012).
Fig 2: Heat Map for Adjusted National Income Per Capita and multiple explanatory variables quantifies the behavior demonstrated in the pair plot.
Pearson's correlation coefficient between Rural Population (x258_2012) and Urban Population (x283_2012 is -1, and 0.93 between Female Survival to Age(x274_2012) and Male Survival to Age 65(x275_2012).
Fig 3 depicts a plot of regression coefficients retained from LASSO Regression Analysis.
A total of 14(16%) features were retained out of 86.
Fig. 4 depicts the mean square error graph for each of the retained variables. It shows that MSE decline from 1 to a low point of about 0.1 and start to increase thereafter. This corresponds to about 2.5 folds.
Table 1 lists all the retained features.
Fig 2: Heat Map for Adjusted National Income Per Capita and multiple explanatory variables
![]() |
Fig 3 |
![]() |
Fig 4 |
Tabel 1: Retain Variables
| Variables | Description | Regression Coefficient |
| x149_2012 | Health Expenditure Per Capita(Current US$) | 0.629593475 |
| x142_2012 | GDP Per Capita(Current US$) | 0.172713173 |
| x49_2012 | Automated Teller Machines(Per 100,000 adults) | 0.081700721 |
| x218_2012 | Population Ages 65 and above(% of total) | 0.073893377 |
| x242_2012 | Private Credit Bureau Coverage(% of adults) | 0.045980527 |
| x275_2012 | Male Survival to ages 65 (% of Cohort) | 0.026677315 |
| x283_2012 | Urban Population(% of total) | 0.015867453 |
| x161_2012 | Industry Value Added(% of GDP) | 0.008073948 |
| x220_2012 | Population Growth(Annual %) | 0.007546031 |
| x25_2012 | Adolescent Fertility Rate(births per 1,000 women ages 15-19) | -0.004927315 |
| x86_2012 | Commercial Bank Branches(per 100,000 adults) | -0.032404314 |
| x153_2012 | Household Final Consumption Expenditure(% of GDP) | -0.041665547 |
| x223_2012 | Female Population(% of total) | -0.149783923 |
| x219_2012 | Population Density(People per sq. km of land area) | -0.150375913 |
Model Evaluation
MSE train: 0.083, MSE test: 0.070
R^2 train: 0.917, R^2 test: 0.930
The robustness of the model is evidenced from a lower MSE for the test data 0.070 compared to MSE for the training dataset of 0.083.The R-squared values shows that the testing dataset explain about 93% of the variability in the response variable - Adjusted National Income Per Capita and retained explanatory variables - Health Expenditure Per Capita, GDP Per Capita, Automated Teller Machines(Per 100,000 adults), Population ages 65 and above, Private Bureau Coverage, Male Survival to ages 65, Urban Population, Industry Value Added, Population Growth, Adolescent Fertility Rate(births per 1,000 women ages 15-19), Commercial Bank Branches(per100,000 adults), Household Final Consumption Expenditure, Female Population and Population Density(People per sq. km of land area) compare to about 92% of the training dataset for the same response and explanatory variables.
Code
import pandas as pd
import os
import numpy as np
# bug fix for display format to avoid run time errors
pd.set_option('display.float_format', lambda x: '%f' % x)
# Set pandas to display all columns and rows in DataFrame
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
##############################################################################
# DATA MANAGEMENT
##############################################################################
# define method to load data of interest
def load_data(data_dir, csv_file):
DATA_PATH = os.path.join(os.getcwd(), data_dir)
DATA_FILE = os.path.join(DATA_PATH, csv_file)
data = pd.read_csv(DATA_FILE, low_memory=False)
return data
# loading data
worldbank_df = load_data('data', 'worldbank.csv')
# Deleting regional and non classified coutry data
worldbank_df = worldbank_df.drop([39, 59, 60, 69, 70, 71, 75, 92, 93, 94, 95, 120, 121, 123, 130, 131, 132, 147, 148, 165, 168, 169, 171, 211, 212])
worldbank_df.drop('country', axis = 1, inplace=True)
print(worldbank_df.columns.values)
worldbank_df.dtypes
# Will need to add the column names after imputing missing values
index = ['x1_2012', 'x2_2012', 'x9_2012', 'x11_2012', 'x12_2012', 'x14_2012',
'x15_2012','x16_2012', 'x18_2012', 'x19_2012', 'x21_2012', 'x25_2012',
'x29_2012', 'x31_2012', 'x35_2012', 'x36_2012', 'x37_2012', 'x38_2012',
'x45_2012', 'x47_2012', 'x48_2012', 'x49_2012', 'x58_2012', 'x67_2012',
'x68_2012', 'x69_2012', 'x86_2012', 'x100_2012', 'x121_2012', 'x125_2012',
'x126_2012', 'x129_2012', 'x131_2012', 'x132_2012', 'x134_2012', 'x139_2012',
'x140_2012', 'x142_2012', 'x143_2012', 'x146_2012', 'x149_2012', 'x150_2012',
'x153_2012', 'x154_2012', 'x155_2012', 'x156_2012', 'x157_2012', 'x161_2012',
'x162_2012', 'x163_2012', 'x167_2012', 'x169_2012', 'x171_2012', 'x172_2012',
'x173_2012', 'x174_2012', 'x179_2012', 'x187_2012', 'x190_2012', 'x191_2012',
'x192_2012', 'x195_2012', 'x204_2012', 'x205_2012', 'x211_2012', 'x212_2012',
'x213_2012', 'x218_2012', 'x219_2012', 'x220_2012', 'x221_2012', 'x222_2012',
'x223_2012', 'x242_2012', 'x243_2012', 'x244_2012', 'x253_2012', 'x255_2012',
'x258_2012', 'x261_2012', 'x268_2012', 'x274_2012', 'x275_2012', 'x277_2012',
'x283_2012', 'x284_2012', 'x9_2013', 'x11_2013', 'x12_2013', 'x14_2013',
'x15_2013', 'x16_2013', 'x18_2013', 'x19_2013', 'x21_2013', 'x25_2013',
'x29_2013', 'x31_2013', 'x35_2013', 'x36_2013', 'x41_2013', 'x42_2013',
'x45_2013', 'x47_2013', 'x48_2013', 'x49_2013', 'x58_2013', 'x86_2013',
'x100_2013', 'x121_2013', 'x125_2013', 'x126_2013', 'x129_2013', 'x131_2013',
'x132_2013', 'x134_2013', 'x139_2013', 'x140_2013', 'x142_2013', 'x143_2013',
'x146_2013', 'x149_2013', 'x150_2013', 'x153_2013', 'x154_2013', 'x155_2013',
'x156_2013', 'x157_2013', 'x161_2013', 'x162_2013', 'x167_2013', 'x169_2013',
'x171_2013', 'x172_2013', 'x173_2013', 'x174_2013', 'x187_2013', 'x190_2013',
'x191_2013', 'x192_2013', 'x204_2013', 'x211_2013', 'x213_2013', 'x216_2013',
'x218_2013', 'x219_2013', 'x220_2013', 'x221_2013', 'x222_2013', 'x223_2013',
'x242_2013', 'x243_2013', 'x244_2013', 'x255_2013', 'x258_2013', 'x261_2013',
'x267_2013', 'x268_2013', 'x274_2013', 'x275_2013', 'x283_2013','x284_2013']
# Changing column names to human readable ones
columns = {'x1_2012' : 'Access to Electricity(% pop)', 'x2_2012' : 'Access to Non Solid Fuel(% pop)',
'x9_2012' : 'Net National Income(US$) ', 'x11_2012' : 'Net National Income Per Capita(US$)',
'x12_2012' : 'CO2 Damage(% GNI)', 'x14_2012' : 'Cons. of fixed capital(% GNI)',
'x15_2012' : 'Cons fixed capital(US$)', 'x16_2012' : 'Education Expenditure(% GNI)',
'x18_2012' : 'Energy Depletion(% GNI)', 'x19_2012' : 'Energy Depletion(US$)',
'x21_2012' : 'Nat. Resource Depletion(% GNI)', 'x25_2012' : 'Fertility Rate',
'x29_2012' : 'Age Dependency Ratio(% working-age pop)', 'x31_2012' : 'Agric Land(% land area)',
'x35_2012' : 'Agric Value Added(% GDP)', 'x36_2012' : 'Agric Value Added(Annual % Growth)',
'x37_2012' : 'Air Transport: Passengers Carried', 'x38_2012' : 'Air Transoport: Registered Carrier Departures Worldwide',
'x45_2012' : 'Arable Land(% of land area)', 'x47_2012' : 'Armed Forces Personnel(% of total labor force)',
'x48_2012' : 'Total Armed Forces Personnel', 'x49_2012' : 'ATMS(per 100K adults)', 'x58_2012' : 'Crude birth rate(per 1K ppl)',
'x67_2012' : 'Death: various causes', 'x68_2012' : 'Death: injury causes(% total)',
'x69_2012' : 'Death: Non communicable disease(% of total)', 'x86_2012' : 'Commercial bank branches(per 100K adults)',
'x100_2012' : 'Crude death rate(per 1K ppl)', 'x121_2012' : 'Exports of goods and service(% of GDP)',
'x125_2012' : 'Total fertility rate', 'x126_2012' : 'Fixed broadband subscr(per 100 ppl)',
'x129_2012' : 'Food prod. index(2004-2006=100)', 'x131_2012' : 'Foreign drct. investment, inflows(% of GDP)',
'x132_2012' : 'Foreign direct investment, net inflows(BOP, US$)', 'x134_2012' : 'Forest Area(% land area)',
'x139_2012' : 'GDP at market prices(US$)', 'x140_2012' : 'GDP Growth(annual %)', 'x142_2012' : 'GDP per capita(US$)',
'x143_2012' : 'GDP per capita growth(annual %)', 'x146_2012' : 'Gross domestic savings(% of GDP)',
'x149_2012' : 'Health exp. per capita(US$)', 'x150_2012' : 'Total health exp(% of GDP).',
'x153_2012' : 'Hshld final consumption expd.(% of GDP)', 'x154_2012' : 'Imports: goods & services(% of GDP)',
'x155_2012' : 'Imprvd Sanitation facilities(% pop w/ acess)', 'x156_2012' : 'Imprvd water sources(% pop w/ access)',
'x157_2012' : 'Incidence of TB(per 100K ppl)', 'x161_2012' : 'Industry: value added(% of GDP)',
'x162_2012' : 'Inflation(annual %)', 'x163_2012' : 'Intentional homicides(per 100K ppl)',
'x167_2012' : 'Internet Users(per 100 ppl)', 'x169_2012' : 'Female labor force(% total labor force)',
'x171_2012' : 'Female life expectancy at birth(years)', 'x172_2012' : 'Male life expectancy at birth(years)',
'x173_2012' : 'Total life expectancy at birth(years)', 'x174_2012' : 'Lifetime risk of maternal death(%)',
'x179_2012' : 'Manufg. value added(% of GDP)', 'x187_2012' : 'Mobile cellular subscr.(per 100 ppl)',
'x190_2012' : 'Infant mortality rate(per 1K live births)', 'x191_2012' : 'Neonatal mortality rate(per 1K live births',
'x192_2012' : 'Under-5 mortality rate(per 1K)', 'x195_2012' : 'Net Migration', 'x204_2012' : 'OOP health exp.(% total health expd.)',
'x205_2012' : 'Female primary education(%)', 'x211_2012' : 'Paid personal remit(US$)', 'x212_2012' : 'Rcvd personal remit(% of GDP)',
'x213_2012' : 'Rcvd personal remit(US$)', 'x218_2012' : 'Pop. ages 65+(% of Total)', 'x219_2012' : 'Pop. density',
'x220_2012' : 'Pop. growth(annual %)', 'x221_2012' : 'Pop. ages 0-14(% of total)', 'x222_2012' : 'Pop. ages 15-64(% of total)',
'x223_2012' : 'Female pop.(% total)', 'x242_2012' : 'Pvt CR bureau coverage(% adults)', 'x243_2012' : 'Prop. of women natl. plmt',
'x244_2012' : 'Public CR bureau coverage(% adults)', 'x253_2012' : 'Renewable Elect. output',
'x255_2012' : 'Freshwater resources per capita(cubic meters)', 'x258_2012' : 'Rural pop', 'x261_2012' : 'Secure internet servers(per 1M ppl)',
'x268_2012' : 'Surface area(sq. km)', 'x274_2012' : 'Female survival age 65(% cohort)', 'x275_2012' : 'Male survival age 65(% cohort)',
'x277_2012' : 'Protected areas: terrestrial & matrine(% territorial area)',
'x283_2012' : 'Urban population(% total)', 'x284_2012' : 'Urban pop growth(%)'}
# Imputtig missing values
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='median', axis=0)
worldbank_df = imr.fit_transform(worldbank_df)
worldbank_df = pd.DataFrame(worldbank_df)
# populating the column headings
worldbank_df.columns = index
print(worldbank_df.columns.values)
# Data: 2012
X = worldbank_df.iloc[:, :86]
print(X.columns.values)
# renaming the columns
X = X.rename(index=str, columns=columns)
print(X.columns.values)
sample_cols = ['Education Expenditure(% GNI)', 'Fertility Rate',
'Foreign direct investment, net inflows(BOP, US$)', 'Fixed broadband subscr(per 100 ppl)',
'Health exp. per capita(US$)', 'Female labor force(% total labor force)', 'Infant mortality rate(per 1K live births)',
'Paid personal remit(US$)', 'Female survival age 65(% cohort)', 'Male survival age 65(% cohort)',
'Imports: goods & services(% of GDP)', 'Public CR bureau coverage(% adults)', 'Net National Income Per Capita(US$)']
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook', font_scale=1.5)
plt.title('Pairwise Plot for Net National Per Capita Income(US$) and explanatory variables')
# Create pairwise scatter plot
sns.pairplot(X[sample_cols], size=5)
plt.show()
cm = np.corrcoef(X[sample_cols].values.T)
sns.set(font_scale=1.0)
hm = sns.heatmap(cm,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 11.0},
yticklabels=X[sample_cols].columns.values,
xticklabels=X[sample_cols].columns.values)
plt.title('Pairwise Plot for Net National Per Capita Income(US$) and explanatory variables')
plt.show()
y = X['Net National Income Per Capita(US$)']
X.drop('Net National Income Per Capita(US$)', axis=1, inplace=True)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
#LASSO Regression
# Standardization of Features
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.fit_transform(X_test)
y_train_std = sc.fit_transform(y_train)
y_test_std = sc.fit_transform(y_test)
from sklearn.linear_model import LassoLarsCV
lcv = LassoLarsCV(cv=9, precompute=False)
lcv.fit(X_train_std, y_train_std)
lcv_dict = dict(zip(X_train.columns.values,lcv.coef_))
import operator
sorted(lcv_dict.items(), key=operator.itemgetter(-1), reverse=True)
# plot coefficient progression
m_log_alphas = -np.log10(lcv.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, lcv.coef_path_.T)
plt.axvline(-np.log10(lcv.alpha_), linestyle='--', color='k', label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.legend()
plt.title('Regression Coefficients for Lasso Paths')
# plot mean square error for each fold
m_log_alphascv = -np.log10(lcv.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, lcv.cv_mse_path_, ':')
plt.plot(m_log_alphascv, lcv.cv_mse_path_.mean(axis=-1), color='k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(lcv.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
# Model Evealuation MSE
y_train_pred_std = lcv.predict(X_train_std)
y_test_pred_std = lcv.predict(X_test_std)
# Residuals Plot
plt.scatter(y_train_pred_std,
y_train_pred_std - y_train_std,
c='black',
marker='o',
s=35,
alpha=0.5,
label='Training data')
from sklearn.metrics import mean_squared_error
print('MSE train: %.3f, MSE test: %.3f' % (mean_squared_error(y_train_std, y_train_pred_std), mean_squared_error(y_test_std, y_test_pred_std)))
# R^2
from sklearn.metrics import r2_score
print('R^2 train: %.3f, R^2 test: %.3f' % (r2_score(y_train_std, y_train_pred_std), r2_score(y_test_std, y_test_pred_std)))
# Pairplot and heatmap for retained features
retained_features = ['Health exp. per capita(US$)', 'GDP per capita(US$)', 'ATMS(per 100K adults)', \
'Pop. ages 65+(% of Total)' , 'Pvt CR bureau coverage(% adults)', 'Male survival age 65(% cohort)',\
'Urban population(% total)', 'Industry: value added(% of GDP)', 'Pop. growth(annual %)',\
'Fertility Rate', 'Commercial bank branches(per 100K adults)', 'Hshld final consumption expd.(% of GDP)',\
'Female pop.(% total)','Pop. density']
sns.pairplot(X[retained_features], size=5)
plt.show()
ret_feat = np.corrcoef(X[retained_features].values.T)
sns.set(font_scale=1.0)
hm = sns.heatmap(ret_feat,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 10.5},
yticklabels=X[retained_features].columns.values,
xticklabels=X[retained_features].columns.values)
plt.title('Heat Map of LASSO Regression Retained Features')
plt.show()




Information Science is tied in with mining concealed experiences of information relating to patterns, conduct, elucidation and deductions to empower educated choices to help the business. data science course in pune
ReplyDeleteI really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post. Hats off to you! The information that you have provided is very helpful.
ReplyDeleteData science course in mumbai
cool stuff you have and you keep overhaul every one of us
ReplyDeleteData science course in mumbai
Such a very useful article. Very interesting to read this article. I have learn some new information.thanks for sharing. ExcelR
ReplyDeleteThis is an excellent post I seen thanks to share it. It is really what I wanted to see hope in future you will continue for sharing such a excellent post.
ReplyDeleteExcelR Data Analytics courses
I like viewing web sites which comprehend the price of delivering the excellent useful resource free of charge. I truly adored reading your posting. Thank you!
ReplyDeletedata analytics courses
The information provided on the site is informative. Looking forward more such blogs. Thanks for sharing .
ReplyDeleteArtificial Inteligence course in Faridabad
AI Course in Faridabad
ReplyDeleteI need to communicate my deference of your composing aptitude and capacity to make perusers read from the earliest starting point as far as possible. I might want to peruse more up to date presents and on share my musings with you.
https://360digitmg.com/course/artificial-intelligence-ai-and-deep-learning
I Want to leave a little comment to support and wish you the best of luck.
ReplyDeletewe wish you the best of luck in all your blogging endeavors.
Business Analytics Course in Bangalore
nice blog!! i hope you will share a blog on Data Analytics.
ReplyDeletedata analytics courses aurangabad