Data Analysis and Interpretation: Milestone Assignment 2: Methods

1. Sample

The dataset for my analysis is from the World Bank. Primarily, this is a country specific socio-economic indicators compiled from internationally recognized sources. The data represent the most current and accurate global development data available. It also includes regional and global aggregate datasets in addition to the national ones.

After accounting for regional global aggregates, the World Bank dataset consists of N = 233 countries. and territories and more than 162 features covering years 2012 and 2013.

2. Measures

The response variable is Adjusted National Income Per Capita (current US$) and the explanatory variables are a mixture of economic, social and human development index indicators.
Some of them are - Fixed Broadband Subscriptions (Per 100 People), GDP Per Capita(Current US$), Health Expenditure Per Capita (Current US$), Adjusted Savings: Consumption of Fixed Capital (Current US$), Manufacturing Value Added(% of GDP), Mobile Cellular Subscriptions (Per 100 people), Adjusted Savings: Energy Depletion (Current US$), Personal Remittances, Received (% of GDP), Population Ages 65 and above (% of Total), Population Density (People per sq. km of Land area), Adjusted Savings: Natural Resources Depletion (% of GNI), Population, Female(% of Total), Private Credit Bureau Coverage (% of Adults), Proportion of seats held by Women in National Parliaments (% Total), Adolescent Fertility Rate (Births per 1000 women ages 15-19), Terrestrial and Marine protected areas(% of territorial area), Agricultural Land(% of land area) and Agriculture, Value Added (Annual % Growth)

3. Analyses

As part of the exploratory data analysis, a pairwise scatter plot was created to visually detect the presence of outliers, the distribution of the data and the relationship between features. All missing values were imputed using media values of the features.

Fig 1: Pairwise Scatter Plot of Features

For instance, it appears from Fig 1: Pairwise Scatter Plot Features, that Adjusted Savings: Consumption of Fixed Capital(% of GNI)[x14_2012] is normally distributed. However, it is not clear whether a linear relationship exists between the response variable - Adjusted Net National Income Per Capita(current US$)[x11_2012 and sample explanatory variables.

To quantify the linear relationship between the sample features, a heatmap was created using correlation matrix, which contains Pearson product-moment correlation coefficients(Pearson's r). Pearson's r measures the linear dependence between pairs of features. The correlation coefficients are bounded to the range -1 and 1.

Fig 2: Heat Map of Correlation Matrix

From Fig 2: Heat Map of Correlation Matrix, there appears to be some linear relationship between adjusted Savings: Consumption of Fixed Capital(% of GNI)[x14_2012] and Access to Electricity (% of population)[x1_2012] , Access to Non-Solid Fuel(% of population)[x2_2012], Adjusted Savings: Consumption of fixed capital(% of GNI)[x14_2012] and Adjusted Savings: Consumption of fixed capital(current US$)[x15_2012]

OLS Summary

OLS analysis shows that a relationship exists between Adjusted National Income Per Capita (current US$) and the following statistically significant explanatory variables:

Health Expenditure, Total(% of GDP)[beta = -5.79e-09, p-value = 0.032)]
Personal Remittances Paid(Current US$) [beta = -124.9623, p-value = 0.019]
Age Dependency Ratio(% of working age population) [beta = 557.8104, p-value = 0.019]
Agriculture, Value Added(% of GDP)[beta = -79.7164, p-value = 0.053]
Air Transport, Registered Departures Worldwide[beta = -0.0055, p-value = 0.057]
Automated Teller Machines(ATM)[beta = 23.9621, p-value = 0.014]
Birth Rate, Crude(Per 1,000 people)[beta = 641.0533, p-value = 0.002]
Fixed Broadband Subscriptions(Per 100 People)[beta = -115.1073, p-value = 0.026]
Manufacturing, Value Added(% of GDP)[beta = -126.7740, p-value = 0.028]
Mortality Rate, Infant(Per 1,000 Live Births)[beta = 242.0010, p-value = 0.056]
Out-of-pocket Health Expenditure(% of Total Expenditure on Health)[beta = -51.7340, p-value = 0.016]
Population, Female(% of Total)[beta = -398.8621, p-value = 0.039]

For further analysis, I will train and test my models with 2012 dataset.
2012 data was randomly split with testing set accounting for 30% of the data(N=67) and training set made up the remainder 70% of size N=156.

My goal is to perform LASSO Regression analysis to ascertain economic and social indicators that contribute significantly to a country's economic growth and development.

Data Analysis and Interpretation

Saturday, April 9, 2016

Milestone Assignment 2: Methods

1. Sample

2. Measures

3. Analyses

Fig 1: Pairwise Scatter Plot of Features

OLS Summary

1 comment: