In this article, we explore the clustering of time series data using principal component analysis (PCA) for dimensionality reduction and density-based spatial clustering of applications with noise (DBSCAN) for clustering. This technique identifies patterns in time series data, such as traffic flow in a city, without requiring labeled data. We use Intel® Extension for Scikit-learn* to accelerate performance. Time series data often exhibit repetitive patterns due to human behavior, machinery, or other measurable sources. Identifying these patterns manually can be challenging. Unsupervised learning approaches like PCA and DBSCAN enable us to discover these patterns.
Methodology
Data Generation
We generate synthetic waveform data to simulate time series patterns. The data consists of three distinct waveforms, each with added noise to simulate real-world variability. We use the scikit-learn* agglomerative clustering example authored by Gaël Varoquaux (figure 1). It is available under BSD-3Clause or CC0 licenses.
Figure 1. Code and plot generated by the author from scikit-learn agglomerative clustering algorithm developed by Gaël Varoquaux.
Accelerate PCA and DBSCAN with Intel® Extension for Scikit-learn*
Both PCA and DBSCAN can be accelerated via a patching scheme using Intel Extension for Scikit-learn. Scikit-learn is a Python* module for machine learning. Intel Extension for Scikit-learn is one of the AI Tools that seamlessly accelerates scikit-learn applications on Intel CPUs and GPUs in single- and multi-node configurations. This extension dynamically patches scikit-learn estimators to improve machine learning training and inference by up to 100x with equivalent mathematical accuracy (figure 2).
Figure 2. GitHub* repository for Intel Extension for Scikit-learn
Intel Extension for Scikit-learn uses the scikit-learn API and can be enabled from the command line or by modifying a couple of lines of your Python application prior to importing scikit-learn:
Dimensionality Reduction with PCA
Before attempting to cluster 90 samples, each containing 2,000 features, we use PCA to reduce dimensionality while retaining 99% of the variance in the dataset:
We use a pairplot to look for visible clusters in the reduced data (figure 3):
Figure 3. Looking for clusters in the data after dimensionality reduction
Cluster with DBSCAN
Based on the pairplot, PC1 and PC2 seem to separate the clusters well, so we use these components for DBSCAN clustering. We can also get an estimate of the DBSCAN EPS parameter. I chose 50 because the PC1 versus PC0 diagram suggests that this is a reasonable separation distance for the observed clusters:
We can plot the clustered data to see how well DBSCAN has identified the clusters (figure 4):
Figure 4. Plot of clustered data generated using the previous code example
Compare to Ground Truth
As you can see from figure 4, the DBSCAN does a good job finding plausible colored clusters and compares well to the original ground truth data (figure 1). In this case, the clustering recovered the underlying patterns used to generate the data perfectly. By using PCA for dimensionality reduction and DBSCAN for clustering, we can effectively identify and label patterns in time series data. This approach allows for the discovery of underlying structures in the data without the need for labeled samples.