moving towards new ideas...

Principal component analysis (PCA) of human security

   

Abstract

This research is a continuation of the work (Zgurovsky M.Z., Gvishiani A.D., 2008), in which the list of ten most essential global threats to the future of mankind have been presented. The initial data on each threat are taken from the respectable international organizations data bases. Then, we defined the summarized impact of the examined ten global threats totality on different countries based on cluster analysis method with the purpose of selecting groups of the countries with "close" performances of summarized threats. By using the Minkovsky type metric the foresight of the future global conflicting has been executed. To facilitate the analysis and make it easier we use the method of Principal Component Analysis (PCA) which allows reduce variables with many properties to several hidden factors. The analysis shows that currently the most considerable threats for most countries are the reduction of energy security, worsening of balance between bio capacity and human demands and the incomes inequality between people and countries.

Keywords

Global threats, global conflicting, Minkovsky metric, cluster analysis, principal component analysis, energy security, bio capacity, incomes inequality. 

1. Introduction

In the work (Zgurovsky M.Z., Gvishiani A. G., 2008) the impact of system world conflicts on sustainable development is studied in the global context. On the basis of data analysis pertaining to the global conflicts taking place from 705 B.C. till now the regularity of their flow is determined. It is shown that the sequence of life cycles of system world conflicts is subordinate to the law of Fibonacci series, and the intensity of these conflicts, depending on a level of technological evolution of a society, builds up under the hyperbolic law. By using the revealed regularities we attempt to foresee the upcoming world conflict, called “the conflict of the XXI century” and analyze its nature and principal performances: - durations, main phases of the flow and intensity.

The totality of main global threats generating the conflict of the XXI century is given. These global threats are: ES – Energy Security; FB – Footprint and Biocapacity Balance; GINI – Incomes Inequality; GD – Global Diseases; CM – Child Mortality; CP – Corruption Perception; WA – Water Access; GW – Global Warming; SF – State Fragility; ND – Natural Disasters. By the cluster analysis method we define the impact of the above threats on different countries and on twelve large groups of countries (civilizations according to Huntington) combined by common culture features. Assumptions are made as to possible scenarios in the course of the conflict of the XXI century and after its termination.

Since it is difficult to analyze the security of this or that country simultaneously in the space of ten global threats, to make the research more convenient and demonstrative we use the Principal Component Analysis (PCA). This method makes it possible to reduce analysis of many properties to some hidden factors determining these properties. In this case the security of a country may be presented in a simplified form not by all ten global threats, but some most significant factors.

2. Application of the principal component method for the analysis of the impact of global threats totality on sustainable development

The example of sustainable development global simulation (System Analysis and Decisions, The example of sustainable development global simulation, 2009) presents global threats and degree of their impact on different countries. Let us format table 1 in the form of the initial data matrix, \normalsize X^m_N, N=106, m=10, in such a way that its lines \normalsize X_i, i=\bar {1,N}  correspond to the analyzed countries, and the columns \normalsize X_j, j=\bar {1,m} contain the values of threats (indicators) \normalsize PX_k, k=\bar {1,N}, m=10 . Then, for each country there will be the corresponding vector \normalsize X_i=\langle x_i^1, x_i^2, \ldots , x_i^m \rangle of threats values (the upper index corresponds to the threat’s ordinal number).

The purpose of the given study conducted with application of the principal component method is finding out and interpreting latent common factors with simultaneous goal to minimize both their number and the degree of dependence \normalsize PX_i on their specific residual random components. Suppose that each threat \normalsize PX_i is a result of impact \normalsize m' of hypothetical and one characteristic factor (Lindsay I. Smith, 2002): \normalsize PX_i=\large \sum_{j=1}^{m'} \normalsize q_j^i \cdot F_j + e_i, i=\bar {1,m} , where \normalsize q_j^i – factor loadings; \normalsize F_j – factors to be defined; \normalsize e_i – characteristic factor for the i-th initial feature representing independent random value with zero mathematical expectation and finite variance.

The expression for \normalsize PX_i may be presented in matrix form:

\normalsize X_N^m=V \cdot Q^T +E (1)

where \normalsize V – matrix of factor scores; \normalsize Q – matrix of factor loadings; \normalsize E – matrix of residuals.

Searching of principal components is reduced to finding the matrix decomposition \normalsize X_N^m in the form (Lindsay I. Smith, 2002): \normalsize X_N^m=T \cdot P^T +E, where \normalsize T – matrix of scores with dimension \normalsize N \times m'\left(m'\leq m\right). Each line of this matrix is a projection of data vector \normalsize X_i^m on \normalsize m' of principal components. Number of lines – \normalsize N corresponds to the number of vectors of the initial data. Number of columns or number of principal components vectors selected for projection is equal \normalsize m'\normalsize P – loadings matrix of dimension \normalsize m \times m', where \normalsize m' – number of lines (data space dimension); \normalsize m – number of columns (number of vectors of principal components selected for projection); \normalsize E – matrix of residuals.

Matrix of scores assigns a set of vectors \normalsize T=\langle t_i^j \rangle , i=\bar {1,N}, j=\bar {1,m'}, determining projectors of vectors \normalsize X_i^j, i=\bar {1,N}, j=\bar {1,m} in the principal components space (number of components is equal \normalsize m'\leq m). Matrix of loadings assigns the mapping of the initial space basis in principal components space. The principal component method allows find such mapping \normalsize R^m \longrightarrow^{f} R^m, that \normalsize m'\leq m and \normalsize \sum_i \sum_j e_{ij}^2 \rightarrow \min  for all possible \normalsize T and \normalsize P (Lindsay I. Smith, 2002).

Defining principal components is connected with calculation of eigenvectors of the covariance matrix (Lindsay I. Smith, 2002) and (Strang, Gilbert, 2006),defined as:

\normalsize C=\left( c_{ij}:c_{ij} = cov \left( PX_i,PX_j \right) \right), i=\bar {1,m}, j=\bar {1,m}, (2)

where
\normalsize cov \left( PX_i,PX_j \right) =\frac{\sum_{k=1}^N {(x_k^i-\bar X^i) \cdot (x_k^j-\bar X^j)}}{N-1} – covariance of parameters \normalsize PX_i and \normalsize PX_j.

For selection of sufficient number \normalsize m'\leq m of principal components a cumulative variance is often used (Jambu, M., 1991):

\normalsize D_i=\frac{\sum_{j=1}^{i} \lambda_j}{m},i=\bar {1,m}, (3)

where \normalsize \lambda_j,j=\bar {1,m}, – eigenvalues of covariance matrix \normalsize C are used.

Preliminary analysis of principal components is given in Table 1.

Table 1. Analysis of principal components

Value
Eigenvalues Extraction: Principal components
Eigenvalue
% Total variance
Cumulative eigenvalue
Cumulative %
1
5,065629
50,65629
5,065629
50,65629
2
1,331475
13,31475
6,397103
63,97103
3
1,065071
10,65071
7,462175
74,62175
Figure 1. Defining principal components by using “slide rocks” criterion  

We shall define the sufficient number of principal components by using the “slide rocks” criterion suggested by (Cattell, R. B, 1966). "Slide rocks" is a geological term to define rock debris accumulated in the lower part of a rocky slope. Using this analogy it is possible to show graphically (Figure 1) the eigenvalues presented in Table 1. It is necessary to find such a place in the plot where a decrease of eigenvalues left to right is maximally slow. It is supposed that to the right from this point only “factorial slide rocks” are located. In accordance with this criterion only 2 or 3 factors may be left.

As seen from the above presented data it is sufficient to use three first principal components (the eigenvalues corresponding to them are indicated in red) to represent the data variability higher than 74 %.

Definition of factor loadings

Now let us analyze principal components and consider solving a problem with three factors. For this we consider correlations between threats and factors (or “new” variables) which are calculated by the formula (Harman H.H, 1966):

r_{k,l}=\frac{\sum_{i=1}^{N} \left(x_i^k-\bar X^k \right) \cdot \left(x_i^l-\bar X^l \right)}{\sqrt{\sum_{i=1}^N {\left(x_i^k-\bar X^k \right)}^2} \cdot \sqrt {\sum_{i=1}^N {\left(x_i^l-\bar X^l \right)}^2}} (4)

where \normalsize r_{k,l} –  correlation coefficient of parameters \normalsize X^l and \normalsize X^k;

\normalsize \bar {X^l}, \bar {X^k} – average values of parameters \normalsize X^l and \normalsize X^k;

\normalsize \bar {X^l}=\frac{\sum_{i=1}^N x_i^l}{N}, \bar {X^k}=\frac{\sum_{i=1}^N x_i^k}{N}.

The correlation coefficient itself does not have informal interpretation. However, its square called the coefficient of determination shows to what extent variations of dependent characteristics may be explained by variations of an independent one. It is thought that correlation coefficients which by their module are more that 0.7 indicate a strong connection (in this case coefficients of determination > 50%, i.e. one characterististics determines the other more than by half. Correlation coefficients which by their module are less that 0.7, but more than 0.5 indicate that connection is average (in this case the coefficients of determination are less than 50%, but more than 25%). At last, correlation coefficients which by their module are less than 0.5 indicate a weak connection (here the coefficients of determination are less than 25 %). Table 2 shows the values of correlation coefficients between principal factors and initial threats. The coefficients corresponding to strong connections are indicated in red.

Table 2. Correlation coefficients between principal factors and initial threats

Variable
Factor Loadings (Unrotated) Extraction:
Principal components (Marked loadings are > 0,7)
Factor 1
Factor 2
Factor 3
ES
0,208964
0,817502
0,342974
FB
-0,855800
0,412124
0,053021
GINI
-0,355499
0,105301
-0,716591
CP
-0,856876
0,248258
-0,03646
NA
-0,809616
-0,315140
0,210144
GW
0,723432
-0,392527
-0,006533
CM
-0,844045
-0,267343
-0,024123
ND
-0,326707
-0,285766
0,615743
SF
-0,899250
-0,086816
-0,005283
GD
-0,788874
-0,080839
-0,084617
Expl. Var
5,065629
1,331475
1,065071
Prp. Totl
0,506563
0,133147
0,106507
Figure 2. Interpretation of threats in coordinates of principal components  

From Table 2 it is seen that the first factor to greater extent correlates with threats than the second and third factors. It should be expected, since, as it has been mentioned above, factors are defined sequentially and contain less and less total variance.

Interpretation of factor structure

It is convenient to carry out interpretation of factors (principal components) by using a diagram where threats are shown as vectors the coordinates of which correspond to factor loadings (Figure 2). 

In accordance with maximum factor loadings threats may be divided into three categories (red, blue and green coulours). The first group of threats includes: FB, CP, SF, GD, NA, CM, GW. As seen in Figure 2 these threats are in the plane of the first and second factors. It means that for more detail analysis it is advisable to show them in the projection on this plane (Figure3).

Figure 3. Projection of threats on the plane of the first and second factors  

As seen from Figure 3 the pairs of vectors SF-GD, FB-GW are practically colinear, which indicates their high degree of dependence. It is interesting that we study only two factors, then the pair of vectors CP-GINI may be considered as colinear. It should be also noted that the vector ES is orthogonial to FB (GW). 

It means that:

  • between level of energy security (ES), balance of biological capacity of the Earth and people’s needs (FB) and CO2 emissions(GW) the dependence is inconsiderable; 
  • balance between biological capacity of the Earth and people’s needs(FB) and CO2 emissions (GW) has negative correlation;
  • level of state fragility (SF)) is closely connected with level of global diseases vulnerability(GD);
  • corruption perception index (CP) is closely connected with level inequality between people and countries (GINI) in the context determined by the first and second factors.

Figure 4. Definition of most significant global threats

 

The most significant global threats are defined by using factor loadings of the initial list of threats. For this it is necessary to select such factors which have maximum loading by absolute value on the first, second and third factors. This choice ensured the definition of maximum impact of initial threats under condition of their maximum independence on the aggregated indicator (Minkovsky norm) of these threats (Figure 4).

In accordance with the indicated approach such threats are SF, ES, GINI (Figure 4), i.e. the most significant threats in descending order are state fragility, global decrease of energy security and growing inequality between people and countries.

Clustering of countries by the level of global threats and the corresponding graphic interpretation is done in the plane of the first and second factors. For this purpose we cluster countries by the degree of their remoteness from threats (Minkovsky norm) using the clustering method of K-means.

Figure 5. Interpretation of global threats in the plane of the first and second factors
 
 

As seen from Figure 5 the isolines which assign the Minlovsky norm approximation are practically orthogonal to the first factor axis. It gives the ground to state that the first factor values mostly determine the countries’ remoteness from global threats.

3. Researching the dependence of countries’ national security on particular threats by using modified method of weighted local correlation

Let us consider that the quantitative value of Minkovsky norm for this or that country is an estimate of its national security level. We define the level of Minkovsky norm dependence on initial threats by calculating the corresponding correlation coefficients (Table 3):

Table 3. Correlation coefficients between Minkovsky norm and global threats

Variable
Correlations
Marked correlations are significant at p<0,05
N=105 (Casewise deletion of missing data)
ES
FB
GINI
CP
NA
GW
CM
ND
SF
GD
Minkovsky norm
-0,16
0,80
0,31
0,82
0,83
-0,54
0,83
0,49
0,89
0,78

The calculated correlation coefficients show a high degree of dependence of Minkovsky norm on initial threats, but at the same time do not answer the question what risks the countries are running from the point of view of their approaching various threats. The reason is the averaging of correlation coefficients on the entire data sample.

For detailed analysis of global threats the countries may face, it is necessary to localize the sample on which correlation is estimated. It is natural to assume that this sample should include “alike” countries the degree of similarity of which may be estimated as, for example, a Euclidean distance in the space of threats. The second assumption is connected with the idea that the closer is a country to the point in which the correlation is analyzed; the higher is the degree of the country’s indicators impact on the correlation coefficient.

In accordance with the above assumptions we define the weighted mean (A MATLAB Toolbox for computing Weighted Correlation Coefficients, 2008) as:

m(X,W)=\frac{\sum_i {w_i x_i}}{\sum_i w_i} (5)

where \normalsize X – data sample; \normalsize W - weighted function.

If we define \normalsize W, as function depending on distance, for example

\normalsize W(x,t)=e^{-\lambda d(x,t)}, in which (6)

\normalsize d(x,t) - distance between points \normalsize x,t \in R, and \normalsize \lambda - distribution parameter and substitute in (5), then we get the expression for calculating the weighted localized mean in point \normalsize t for sample \normalsize X:

m(X,W)=\frac{\sum_i {e^{-\lambda d(t,x_i)} \cdot {x_i}}}{\sum_i e^{-\lambda d(t,x_i)}},x_i \in X (7)

Similarly, we can define the weighted localized covariation:

cov(X,Y,t)=\frac{\sum_i {e^{-\lambda d(t,x_i)} (x_i-m(X,t)) (y_i-m(Y,t))}}{\sum_i e^{-\lambda d(t,x_i)}} (8)

And we define the weighted localized correlation (WLC):

corr(X,Y,t)=\frac{cov(X,Y,t)}{\sqrt{cov(X,X,t) \cdot cov(Y,Y,t)}} (9)

The distribution parameter of weights \normalsize \lambda may be chosen in such a way that it is possible to restrict the impact area of point’s located at large distances. For example, we assume that points located at mean distance from the point where WLC is calculated have the weight equal \normalsize s (distribution scale). I.e.

e^{-\lambda(t)m(d_t)}=s, then \lambda (t)=\frac{\ln{s}}{m(d_t)}, where (10)

\normalsize m(d_t) - mean distance from the sample points to point \normalsize t. Examples of weights distribution for different values of mean distance and distribution scale are given in Figures 6, 7.

Figure 6. Weights distribution for mean distance equal 0.5
Figure 7. Weights characteristics for scale distribution equal 0.1
 

With distribution scale equal 1, WLC coincides with Pearson product-moment correlation coefficient. As seen from (10), the weights distribution parameter is calculated for each point \normalsize t, which is a sample point. And for each new point the mean distance value is calculated \normalsize m(d_t) anew. Hence, the suggested method of estimating threats local dependence is adaptive.

The interpretation of WLC values is presented in Table 4.

Table 4. Interpretation of values of weighted localized correlation (WLC)

Value of WLC
Behavior of global threats under study
Interpretation
[-1.0, -0.5)
High degree of negative correlation (more than 25 %).
The growth of one threat is connected with reduction of the other.
With a decrease of a particular threat the general remoteness from the totality of global threats considerably decreases.
The studied threat has low (as compared to others) contribution to the general remoteness from global threats.
[-0.5, -0.3)
Mean degree of negative correlation (9-25%).
The growth of one threat is connected with reduction of the other.
With a decrease of a particular threat the general remoteness from the totality of global threats considerably decreases at the mean degree.
 
[-0.3, 0.3]
Low degree of correlation (less than 9%)
It is possible to speak about an inconsiderable dependence of the degree of remoteness from the totality of global threats on the studied threat.
(0.3, 0.5]
Mean degree of positive correlation (9 – 25 %)
The growth of one threat is connected with the growth of other.
With a decrease of the particular threat the general remoteness from global threats increases at the mean degree.
(0.5, 1.0]
High degree of positive correlation.
The growth of threat is connected with the growth of other (more than by 25%).
With a decrease of the particular threat the general remoteness from the totality of global threats considerably increases.
The studied threat considerably influences the general remoteness from the totality of global threats.
Figure 8. Values of WLC between Minkovsky norm and state fragility (SF)
Figure 9. Values of WLC between Minkovsky norm and energy security (ES)
Figure 10. Values of WLC between Minkovsky norm and population inequality (Gini)
 

Figures 8-10 present the plotted values of weighted localized correlation (WLC) between Minkovsky norm and most significant threats, respectively: SF, ES and GINI.

As seen from Figure 8 the level of state fragility (SF) for most countries has considerable impact on their national security.

As to the impact of energy security on the level of national security (Figure 9), the following groups of countries may be identified:

  • A group of countries with high level of ES and high values of Minkovsky norm (Canada, Sweden, Norway, Australia) for which energy security considerably influences their national security;
  • An adjacent group (Finland, New Zealand, Denmark, Switzerland, Netherlands, Austria, Luxembourg, Japan), for which a mean level of dependence between energy security and Minkovsky norm is observed;
  • A group of countries for which this dependence is weak;
  • A group of countries with mean level of national security (Belarus, Israel, Thailand, Mexico, Jamaica, Jordan, Malaysia, Tunisia, Panama, Bosnia and Herzegovina, Vietnam, Brazil, Ukraine, Columbia, Korea Republic), for which there exist threats more serious than energy security;
  • A group of countries with low level of national security (Kenya, Zimbabwe, Cameroon, Cambodia, Zambia, Haiti, Turkmenistan, Nigeria), for which both energy security and other threats are equally important;
  • A group of most problem countries (Ethiopia, Mozambique), where the level of energy security at least extent determines the level of national security.

As to the impact of population inequality on national security (Figure 10) it is possible to identify a group of countries (Canada, Sweden, Norway, Australia, Finland, New Zealand, Denmark, Switzerland, Netherlands, Austria, Luxembourg, Japan, Ireland, France, Germany, Portugal, Slovenia, Belgium), for which a mean positive correlation between this threat and Minkovsky norm is observed. For the rest of countries this correlation is insignificant.

4. Conclusions

1. Since it is very complicated to analyze security of this or that country simultaneously in the space of ten global threats the principal component analysis (PCA) was used. This method allowed reducing ten global threats influencing the general level of national security (in the sense of Minkovsky norm) to three hidden factors determining this characteristic. The application of this approach allowed considerably facilitate research of national security, reducing it to the analysis in the space of three determining factors.

2. By using this method a comprehensive study of national security of different countries was carried out in the space of three determining factors. Factor loadings were defined by calculating coefficients of correlation between principal factors and initial threats. Clustering of countries was made according to the level of global threats, and three most significant threats were defined influencing national security of most countries: state fragility (SF), energy security (ES) and people’s inequality (Gini). Graphic interpretation of global threats was done in the space of three principal components. The factor structure of threats was studied, and the degrees of dependence between main groups were defined.

3. The method of weighted localized correlation was modified, which allowed carry out research of the dependence of national security level (Minkovsky norm) on particular global threats. By using this method the dependence between Minkovsky norm and most significant threats were analyzed in detail, in particular, state fragility (SF), energy security (ES) and people’s inequality (Gini). Recommendations were made for different countries regarding strengthening their national security.

References