Principal Components Analysis

Principal components or eigenvectors representing the axes of inertia of the data point structure, ref.: PCA

A common method of factor analysis of the first type (size plus shape) is Principal Components Analysis (PCA), the concept of which was first introduced by Karl Pearson in 1901 and later on extended by Harold Hotelling around 1933.

Usually the method is applied to a table in which the rows represent the objects (or subjects) and in which the columns represent the variables (measurements). A multivariate situation arises when there are more than two or three variables. The variables can be expressed in different units, such as in a table describing heigth, weigth, age and blood pressure of a group of people. The data can be represented geometrically such that each object is represented as a point in a space where each coordinate axis represents a particular variable. There are as many dimensions in such a representation as there are variables in the table. Often the situation is multidimensional and, hence, difficult or even impossible to visualize.

Fortunately, the variables are often correlated (such as in the case of heigth, weigth and age) and this points to a degree of reduncancy in the data. In such a case the number of "true" dimensions (or factors) is less than the number of observed variables in the data table. This is the Platonic view that behind the apparent and fuzzy complexity of the world one may find simple and pure structures. This is the idea of factor analysis in general, and Principal Components Analysis in particular.

Standard PCA applies standardization (centering about zero mean and normalization to unit variance) of the variables (columns) to the data table as a first step.

The next step is to extract the factors from the standardized data table (by means of a mathematical method called "eigenvector extraction"). The computed factors are referred to as the "principal components" of the data. They can be seen as the axes of inertia (or symmetry) of the geometrical structure that results from representing the objects as points in the space of the variables.

The result of PCA is that the objects are now described by the principal components (or factors) rather than by the original variables in the data table. Usually the number of relevant principal components is much less than the number of variables, and a substantial reduction of the number of dimensions can be achieved. In the case of two or three relevant principal components it is possible to represent the objects and variables in Cartesian diagrams in which the axes represent the principal components.

The variables (columns) are displayed as points in a "loadings plot", in which the axes are the principal components, if one is interested in the correlations between the variables. Alternatively, the objects (rows) are represented as points in a "scores plot", in which the axes are also the principal components, if the objective is to find clusters or groupings of the objects. The relationship between both the objects (rows) and the variables (columns) of the table can also be made visible by means of a "biplot".

A search on Google for “Principal Components Analysis” yielded about 400,000 hits (in October, 2005).

Back to Begin       Back to Title Page       Previous       Next

December 19, 2005         Date last modified: September 6, 2006