Orthogonality and Dimensionality in Airline Cluster Analysis using PCA and Kernel PCA
Summary
This analysis investigates the US airline profit cycles from 1995 to 2020, replicating a k-means clustering experiment that combines principal component analysis (PCA) and system dynamic modeling. The study demonstrates the geometric robustness of a six-cluster taxonomy, showing that k-means in a 3-dimensional PC score space yields bit-for-bit identical cluster assignments compared to the original 7-dimensional raw-variable space. Applying kernel PCA with six different kernels, including a linear baseline, further confirms the six-cluster assignment in 2D. A 1D diagnostic revealed that the linear kernel conflates the COVID year C_3 with the peak-profit cluster C_0, whereas non-baseline kernels correctly shift C_3 to overlap the post-financial-crisis cluster C_5, indicating an intrinsically linear manifold. However, the silhouette criterion suggests the dataset structurally supports only three clusters, not six, with collinearity in the 7D raw space suppressing this signal.
Key takeaway
For data scientists performing cluster analysis on complex, high-dimensional datasets, you should not solely rely on initial cluster assignments from raw data. Validate your chosen number of clusters using criteria like the silhouette score, especially after dimensionality reduction, as collinearity can mask the true structural support for clusters. Consider applying kernel PCA to diagnose the intrinsic linearity of your data manifold, which can refine cluster interpretations and prevent misattributing specific data points, such as outlier events, to incorrect clusters.
Key insights
Airline profit cycle clustering is geometrically robust across dimensionality reductions, but structural support for cluster count varies.
Principles
- Geometric robustness of cluster assignments can be maintained across dimensionality reductions.
- Collinearity in high-dimensional data can obscure true structural cluster counts.
- Kernel PCA can reveal intrinsic linearity or non-linearity of data manifolds.
Method
The study replicates k-means clustering in 7D raw, 3D PC, and 4D PC spaces, then applies kernel PCA with six kernels for nonlinearity checks, and uses the silhouette criterion for structural cluster validation.
In practice
- Validate cluster assignments across different dimensional spaces (e.g., raw vs. PCA).
- Use kernel PCA to diagnose intrinsic data manifold linearity.
- Employ silhouette criterion to assess the structural validity of cluster numbers.
Topics
- Airline Profit Cycles
- K-means Clustering
- Principal Component Analysis
- Kernel PCA
- Dimensionality Reduction
- Silhouette Criterion
Best for: Research Scientist, AI Scientist, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.