Orthogonality and Dimensionality in Airline Cluster Analysis using PCA and Kernel PCA

· Source: Machine Learning · Field: Finance & Economics — Economic Analysis & Policy, Data Science & Analytics, Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

This analysis investigates the US airline profit cycles from 1995 to 2020, replicating a k-means clustering experiment that combines principal component analysis (PCA) and system dynamic modeling. The study demonstrates the geometric robustness of a six-cluster taxonomy, showing that k-means in a 3-dimensional PC score space yields bit-for-bit identical cluster assignments compared to the original 7-dimensional raw-variable space. Applying kernel PCA with six different kernels, including a linear baseline, further confirms the six-cluster assignment in 2D. A 1D diagnostic revealed that the linear kernel conflates the COVID year C_3 with the peak-profit cluster C_0, whereas non-baseline kernels correctly shift C_3 to overlap the post-financial-crisis cluster C_5, indicating an intrinsically linear manifold. However, the silhouette criterion suggests the dataset structurally supports only three clusters, not six, with collinearity in the 7D raw space suppressing this signal.

Key takeaway

For data scientists performing cluster analysis on complex, high-dimensional datasets, you should not solely rely on initial cluster assignments from raw data. Validate your chosen number of clusters using criteria like the silhouette score, especially after dimensionality reduction, as collinearity can mask the true structural support for clusters. Consider applying kernel PCA to diagnose the intrinsic linearity of your data manifold, which can refine cluster interpretations and prevent misattributing specific data points, such as outlier events, to incorrect clusters.

Key insights

Airline profit cycle clustering is geometrically robust across dimensionality reductions, but structural support for cluster count varies.

Principles

Method

The study replicates k-means clustering in 7D raw, 3D PC, and 4D PC spaces, then applies kernel PCA with six kernels for nonlinearity checks, and uses the silhouette criterion for structural cluster validation.

In practice

Topics

Best for: Research Scientist, AI Scientist, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.