Unsupervised Learning

Principal component analysis (PCA) and K-means clustering

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

Data Preparation

  • Clean the dataset by handling missing values

  • Scale/normalize the data

  • Check for outliers

  • Separate metadata from drug sensitivity values

! curl https://naicno.github.io/BioNT_Module2_handson/_downloads/d10b2868b0538b6d8e941c3b4de06a4c/BRCA_Drug_sensitivity_test_data.csv -o BRCA_Drug_sensitivity_test_data.csv
! ls -lh BRCA_Drug_sensitivity_test_data.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 18108  100 18108    0     0   133k      0 --:--:-- --:--:-- --:--:--  133k
-rw-r--r-- 1 runner docker 18K Jun  5 12:07 BRCA_Drug_sensitivity_test_data.csv
data = pd.read_csv('BRCA_Drug_sensitivity_test_data.csv')
data.head()
Patient ID bendamustine ML320 BRD-K14844214 leptomycin B Compound 23 citrate BRD4132 dabrafenib necrosulfonamide PF-543 ... Compound 1541A pevonedistat ISOX tosedostat AT13387 BRD-A02303741 BRD-K99006945 GSK525762A necrostatin-7 nakiterpiosin
0 TCGA-D8-A1JU 14.821152 14.286200 14.608184 4.960653 13.530885 13.202820 13.879099 12.839373 14.103941 ... 14.778758 12.541663 12.113460 12.364817 8.913646 14.860838 13.170200 13.116395 14.548597 11.835573
1 TCGA-AC-A3QQ 14.304843 14.333355 14.960661 3.552977 13.122806 13.129905 13.911525 12.124210 14.127482 ... 15.108698 10.936052 11.802052 11.893495 8.326862 15.265313 15.341146 13.509974 14.311074 10.772699
2 TCGA-C8-A12Q 14.658052 14.649223 14.547810 3.467945 13.438683 13.441228 13.823993 12.935296 14.052508 ... 15.012134 12.837987 11.944058 11.694113 8.622052 15.205567 14.039701 13.252381 14.496764 11.387851
3 TCGA-AR-A1AY 14.342911 13.657840 14.426898 1.800027 13.242654 13.094447 13.531232 11.987928 14.046877 ... 14.811534 11.098317 11.480102 11.379716 8.080488 14.906361 13.462326 13.030773 14.466555 9.665608
4 TCGA-A8-A0A2 14.655920 14.323148 14.527811 4.491210 13.293485 13.054068 13.807173 12.446720 14.012040 ... 14.816099 12.075367 11.878005 11.800027 8.595326 14.988143 14.074985 12.959710 14.435152 11.143563

5 rows × 51 columns

print(f"Dataset shape: {data.shape}")
print(f"Patient IDs: {data['Patient ID'].nunique()} unique values")
print(f"Features: {data.shape[1]-1} drug sensitivity measurements")
Dataset shape: (25, 51)
Patient IDs: 25 unique values
Features: 50 drug sensitivity measurements

Dimensionality of the datasets

Notes: High-dimensional data

  • High-dimensional data refers to datasets where the number of features or attributes (dimensions, denoted as p) is significantly large

  • In such datasets, each observation can be thought of as residing in a high-dimensional space

  • The definition of “high-dimensional” data can vary depending on the context, the field of study, and the specific analysis being performed.

Dataset with 51 columns (drug compounds) and 25 rows (patients):

  • Contains more features (50 drug sensitivity scores) than samples (25 patients)

  • Difficulty in Visualization: Identifying patterns visually becomes impossible

  • Sparsity: With 50 features but only 25 samples, data points are scattered across a vast 50-dimensional space

  • Distance metrics become less meaningful: In high dimensions, the difference between the nearest and farthest neighbors becomes less significant - nearly all points appear similar-distances from each other

  • Overfitting risk: model has many potential combinations of features to consider, but limited examples to learn from, it has a large capacity to fit even noise in the limited training examples leading to overfitting and poor performance

Clean the dataset by handling missing values

# Check for missing values
data.isnull().sum().sum()
np.int64(0)
# Check data types
print("\nData types:")
print("Datatypes of first 10 columns:", data.dtypes[:10])
print("Different datatypes in the dataframe:", data.drop(columns=["Patient ID"]).dtypes.unique())
Data types:
Datatypes of first 10 columns: Patient ID              object
bendamustine           float64
ML320                  float64
BRD-K14844214          float64
leptomycin B           float64
Compound 23 citrate    float64
BRD4132                float64
dabrafenib             float64
necrosulfonamide       float64
PF-543                 float64
dtype: object
Different datatypes in the dataframe: [dtype('float64')]
# Basic statistics for drug sensitivity values
data.describe(include='all').T.sort_values('mean', ascending=False).head(10)
count unique top freq mean std min 25% 50% 75% max
myricetin 25.0 NaN NaN NaN 15.126359 0.043691 15.053067 15.095327 15.120994 15.151308 15.219127
BRD-A02303741 25.0 NaN NaN NaN 15.058701 0.163287 14.70386 14.96314 15.036734 15.205567 15.398952
Compound 1541A 25.0 NaN NaN NaN 14.907687 0.136504 14.724434 14.811534 14.879828 14.981904 15.318321
carboplatin 25.0 NaN NaN NaN 14.711186 0.135093 14.459443 14.623642 14.671899 14.799047 14.999075
BRD-K27188169 25.0 NaN NaN NaN 14.624495 0.086419 14.428273 14.601246 14.623367 14.67901 14.801653
BRD-K14844214 25.0 NaN NaN NaN 14.610919 0.1589 14.393664 14.519765 14.597586 14.681205 15.019018
pifithrin-mu 25.0 NaN NaN NaN 14.589748 0.267786 14.076076 14.384124 14.600936 14.790655 15.095189
RO4929097 25.0 NaN NaN NaN 14.559876 0.181499 13.979993 14.480718 14.59874 14.66368 14.924847
RG-108 25.0 NaN NaN NaN 14.447688 0.218108 13.95736 14.283119 14.4542 14.592225 14.81662
bendamustine 25.0 NaN NaN NaN 14.447378 0.341383 13.492308 14.304843 14.46329 14.658052 15.100501
data.describe(include='all').T.sort_values('mean', ascending=False).tail(10)
count unique top freq mean std min 25% 50% 75% max
nakiterpiosin 25.0 NaN NaN NaN 10.90167 0.831147 8.701056 10.381886 11.26099 11.37343 11.92009
YM-155 25.0 NaN NaN NaN 10.166452 0.883099 7.774662 9.441895 10.42738 10.774703 11.398507
NVP-BEZ235 25.0 NaN NaN NaN 9.83664 0.38227 8.935818 9.616815 9.876253 10.104948 10.476573
belinostat 25.0 NaN NaN NaN 9.719361 0.630235 8.049804 9.406909 9.762903 9.957621 11.005536
pluripotin 25.0 NaN NaN NaN 9.469802 0.687683 7.503132 9.085914 9.568229 9.910524 10.704892
MLN2238 25.0 NaN NaN NaN 8.930093 0.741736 7.218104 8.485371 9.008484 9.357212 10.017931
KX2-391 25.0 NaN NaN NaN 8.798709 1.396529 5.071334 7.925459 9.349282 9.607232 10.796027
AT13387 25.0 NaN NaN NaN 8.378987 0.764056 6.576897 8.046521 8.281312 8.913646 9.612334
leptomycin B 25.0 NaN NaN NaN 3.473396 1.242517 0.864948 2.824049 3.549045 4.49121 5.194158
Patient ID 25 25 TCGA-D8-A1JU 1 NaN NaN NaN NaN NaN NaN NaN
# Step 2: Exploratory Data Analysis
def perform_eda(data):
    print("\n=== Exploratory Data Analysis ===")
    # Separate metadata from drug sensitivity values
    drug_data = data.drop('Patient ID', axis=1)
    
    # Distribution of drug sensitivity values
    plt.figure(figsize=(15, 6))
    plt.subplot(1, 2, 1)
    sns.histplot(drug_data.values.flatten(), kde=True)
    plt.title('Distribution of Drug Sensitivity Values')
    plt.xlabel('Sensitivity Value')
    
    # Boxplot of drug sensitivity values (first 10 drugs)
    plt.subplot(1, 2, 2)
    sns.boxplot(data=drug_data.iloc[:, :10])
    plt.title('Boxplot of First 10 Drugs')
    plt.xticks(rotation=90)
    plt.tight_layout()

    return drug_data
 
drug_data = perform_eda(data)
=== Exploratory Data Analysis ===
../_images/f625cf5f35c1cc46b3951a3b581a7de7562252810bcb221afa8ef1b829e2fce3.png
drug_data.head()
bendamustine ML320 BRD-K14844214 leptomycin B Compound 23 citrate BRD4132 dabrafenib necrosulfonamide PF-543 KX2-391 ... Compound 1541A pevonedistat ISOX tosedostat AT13387 BRD-A02303741 BRD-K99006945 GSK525762A necrostatin-7 nakiterpiosin
0 14.821152 14.286200 14.608184 4.960653 13.530885 13.202820 13.879099 12.839373 14.103941 10.796027 ... 14.778758 12.541663 12.113460 12.364817 8.913646 14.860838 13.170200 13.116395 14.548597 11.835573
1 14.304843 14.333355 14.960661 3.552977 13.122806 13.129905 13.911525 12.124210 14.127482 8.489440 ... 15.108698 10.936052 11.802052 11.893495 8.326862 15.265313 15.341146 13.509974 14.311074 10.772699
2 14.658052 14.649223 14.547810 3.467945 13.438683 13.441228 13.823993 12.935296 14.052508 9.349282 ... 15.012134 12.837987 11.944058 11.694113 8.622052 15.205567 14.039701 13.252381 14.496764 11.387851
3 14.342911 13.657840 14.426898 1.800027 13.242654 13.094447 13.531232 11.987928 14.046877 7.385516 ... 14.811534 11.098317 11.480102 11.379716 8.080488 14.906361 13.462326 13.030773 14.466555 9.665608
4 14.655920 14.323148 14.527811 4.491210 13.293485 13.054068 13.807173 12.446720 14.012040 9.540638 ... 14.816099 12.075367 11.878005 11.800027 8.595326 14.988143 14.074985 12.959710 14.435152 11.143563

5 rows × 50 columns

plt.figure(figsize=(20, 6))
sns.boxplot(data=drug_data)
plt.xticks(rotation=90);
../_images/0b2e89e686c2a1c801a0d73182520d92f7f236436a095b645a46a48333ed1525.png

Variance plots

variances = drug_data.var().sort_values(ascending=False)
plt.figure(figsize=(20, 6))
sns.barplot(x=variances.index, y=variances.values)
plt.xticks(rotation=90)
plt.title('Variance Across Columns')
plt.ylabel('Variance')
plt.show()
../_images/657a95d374dc9c19d317f03850231025848bb9fdada4d05751a2838f9042a92c.png

Normalize the data (Standardization)

PCA is sensitive to the variances of the original features, therefore data normalization before applying PCA is crucial

  • PCA works by identifying the directions (principal components) that capture the maximum variance in the data.

  • Variables with larger variances can dominate the principal components

    • This means the principal components might primarily reflect the variability of the features with the largest ranges

    • principal components do not capture the underlying relationships in the data

  • Drugs with inherently higher variability in their sensitivity scores would disproportionately influence the principal components, potentially masking the contributions and relationships involving drugs with lower score variability.

PCA looks for directions of maximum variance, standardization ensures that each feature contributes more equally to the determination of the principal components

# Create the scaler and standardize the data
scaler = StandardScaler()
drug_data = scaler.fit_transform(drug_data)
print("Type of drug_data:", type(drug_data))
print("Shape of drug_data:", drug_data.shape)
print("First 5 rows and columns of drug_data:\n", drug_data[:5, :5])
Type of drug_data: <class 'numpy.ndarray'>
Shape of drug_data: (25, 50)
First 5 rows and columns of drug_data:
 [[ 1.11745915 -0.14919196 -0.01756464  1.22165355  0.70067562]
 [-0.4261314  -0.02638244  2.24640919  0.06536914 -0.66625469]
 [ 0.62984536  0.7962644  -0.40535119 -0.00447747  0.3918278 ]
 [-0.3123208  -1.78569382 -1.18197636 -1.3745291  -0.26480228]
 [ 0.62347222 -0.0529662  -0.5338052   0.83604629 -0.09453399]]
plt.figure(figsize=(20, 6))
sns.boxplot(data=drug_data)
plt.xticks(rotation=90);
../_images/c64144c71820efdae9ded4c74865638435e542cfa81d0a4b1b9eb56a1b247d7a.png

PCA Implementation

  • Apply PCA transformation

  • Calculate explained variance

Apply PCA transformation

# Create the PCA instance and fit and transform the data with pca
pca = PCA()
pc = pca.fit_transform(drug_data)

## `fit()` in PCA:
# Calculates the principal components (eigenvectors) and their explained variances
# Determines the transformation matrix based on your data
# Stores these components internally in the PCA object
# Does not actually transform your data

## `transform()` in PCA:
# Projects your data onto the principal components using the previously calculated transformation matrix
# Actually reduces the dimensionality of your data
# Example: pca.transform(X_test) projects test data onto the same components learned from training data

## `fit_transform()` in PCA:
# Combines both operations: calculates principal components and then projects your data
# Equivalent to calling fit() followed by transform()
# More efficient than calling them separately
# Example: pca.fit_transform(X_train) learns components from training data and immediately projects it
print("Type of pc:", type(pc))
print("Shape of pc:", pc.shape)
print("First 5 rows and columns of pc:\n", pc[:5, :5])
Type of pc: <class 'numpy.ndarray'>
Shape of pc: (25, 25)
First 5 rows and columns of pc:
 [[ 3.64790724  3.78552267  1.33659718 -0.20864174 -2.17346721]
 [ 0.32125071 -5.9898321   4.18994535 -0.20697045  0.17194817]
 [ 3.83231009  1.28508967 -0.47749898 -2.19393142  3.98217509]
 [-5.63359232  3.69098317 -3.47828106 -0.8034397   0.44613498]
 [ 0.17402269  1.51797501  0.16606953  2.91234111  0.42116796]]

Note:

  • Original drug_data had 25 rows (observations), so pc also had 25 rows representing patients

  • We didn’t specify the number of components (n_components) when creating the PCA object (pca = PCA()), scikit-learn defaults to calculating min(n_samples, n_features) components. Original dataset had 50 features, PCA still only computes 25 components because the maximum number of meaningful principal components is limited by the number of samples (you can’t find more independent directions of variance than you have data points).

The fit_transform method did two things:

  1. fit: It analyzed drug_data to find the 25 principal component directions (axes) based on the variance and correlations between your original features.

  2. transform: It then projected your original 25 samples from their original feature space (with 50 dimensions) onto this new coordinate system defined by the 25 principal components

pc_df = pd.DataFrame(pc, columns=[f'PC_{i}' for i in range(1, pc.shape[1]+1)])
pc_df.head()
PC_1 PC_2 PC_3 PC_4 PC_5 PC_6 PC_7 PC_8 PC_9 PC_10 ... PC_16 PC_17 PC_18 PC_19 PC_20 PC_21 PC_22 PC_23 PC_24 PC_25
0 3.647907 3.785523 1.336597 -0.208642 -2.173467 -0.159574 -0.201224 -0.175144 -0.442335 0.125566 ... 0.015509 0.053778 -0.106066 -0.353473 -0.059685 -0.655202 0.315228 -0.239004 0.257046 1.959170e-16
1 0.321251 -5.989832 4.189945 -0.206970 0.171948 0.960831 -0.520194 1.106246 -0.674214 -0.811364 ... -0.724327 -0.503311 0.348717 -0.460284 -0.166123 0.121399 0.262114 0.042369 -0.096848 1.959170e-16
2 3.832310 1.285090 -0.477499 -2.193931 3.982175 -0.967499 -0.351455 0.389307 0.007497 0.070364 ... 0.080597 -0.748926 -0.429462 -0.277130 -0.028362 0.054581 0.009554 0.045150 0.040776 1.959170e-16
3 -5.633592 3.690983 -3.478281 -0.803440 0.446135 0.610747 -2.052404 0.316239 -1.076850 -1.359736 ... 0.029642 0.142202 0.806519 0.175962 -0.013190 -0.119514 -0.072060 0.053852 0.077116 1.959170e-16
4 0.174023 1.517975 0.166070 2.912341 0.421168 -0.887175 -1.127990 -1.452414 -1.104615 -0.970554 ... 0.832998 -0.053671 -0.443638 -0.031360 -0.348642 0.422218 0.010565 -0.405376 -0.039028 1.959170e-16

5 rows × 25 columns

PCA context

  • PCA aims to summarize a large set of correlated variables with a smaller number of representative variables

  • The goal is to find a low-dimensional representation of the data that retains as much of the original variation as possible

  • Why focus on maximizing variance?

    • By capturing most of the variation in the first few uncorrelated PCs, PCA allows you to summarize a large set of potentially correlated variables with a smaller number of representative variables

    • This process can be viewed as identifying directions in the original feature space along which the data vary the most, or finding low-dimensional linear surfaces that are closest to the observations

    • This is valuable for dimension reduction and compression, especially when dealing with many attributes that might be redundant

  • By transforming the original correlated variables into a set of uncorrelated principal components, PCA effectively removes redundancy in the data

    • PCA transforms the original correlated variables into a new set of uncorrelated variables (the principal components), ordered by how much variance they explain. This allows you to potentially use a smaller subset of these new, uncorrelated components to represent the data while retaining most of the original variability

  • The first principal component is defined as the linear combination of the original features that has the largest sample variance

  • Subsequent principal components are then found such that they have the maximal variance out of all linear combinations that are uncorrelated with the preceding principal components

    • For example, the second principal component must be uncorrelated with the first, the third with the first two, and so on

drug_data_normalised = pd.DataFrame(drug_data, columns=data.columns[1:])
# Correlation heatmap
# plt.figure(figsize=(12, 10))

fig, axes = plt.subplots(1, 2, figsize=(20, 10))

mask = np.triu(np.ones_like(drug_data_normalised.corr(), dtype=bool))
sns.heatmap(drug_data_normalised.corr(), mask=mask, annot=False, cmap='coolwarm', 
            linewidths=0.5, vmin=-1, vmax=1, ax=axes[0])
axes[0].set_title('Correlation Heatmap (All Drugs)')

mask = np.triu(np.ones_like(pc_df.corr(), dtype=bool))
sns.heatmap(pc_df.corr(), mask=mask, annot=False, cmap='coolwarm', 
            linewidths=0.5, vmin=-1, vmax=1, ax=axes[1])
axes[1].set_title('Correlation Heatmap (PCA Components)')


plt.tight_layout()
    
plt.show()
plt.close()
../_images/db72dccdbc8a9915ceb381195ba82aaec2f1ed948f70ba530bba1296f2b5fdcd.png
fig, axes = plt.subplots(2, 1, figsize=(20, 10))

drug_data_normalised.plot(kind="box", title="Drug Sensitivity Boxplot", ax=axes[0])
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=90)  # Rotate x-labels for axes[0]
pc_df.plot(kind="box", title="PCA Components Boxplot", ax=axes[1])
plt.tight_layout()
../_images/c8977976b111bc0454129f42e5ec07b651fe355df44b2898f29f24e31f22b2a7.png

Note:

  • PCA successfully achieved its goal of decorrelating our dataset

  • Decorrelation:

    • The variance originally shared between features (correlated features) in the original dataset has been reorganized and captured along these new, independent PC axes

  • Correlated variables inherently contain overlapping information (shared variation)

  • When variables are correlated, it means their values tend to change together to some degree

    • This implies that they are capturing some of the same underlying patterns or variance in the data

    • Dealing with a large set of correlated variables suggests there might be redundancy

    • The correlation structure can be thought of as arising from shared underlying sources or latent variables.

  • PCA transforms these into uncorrelated components

    • The process of Principal Component Analysis explicitly finds a new set of variables - principal components (PCs), which are linear combinations of the original features

    • A key property of these principal components is that they are uncorrelated with each other

    • The second principal component is defined as having maximal variance out of all linear combinations that are uncorrelated with the first principal component, and this extends to subsequent components

  • Decorrelation ensures each PC captures a distinct aspect of variability:

    • Because each subsequent principal component is constrained to be uncorrelated with the ones already found, it is forced to capture variation in a direction that is distinct from the directions represented by the earlier components

    • The first PC captures the most variance. The second PC captures the most variance that is not captured by the first PC (due to the uncorrelated constraint), and so on.

    • This ensures that each PC adds new, non-overlapping information about the data’s variability, allowing the first few components to collectively explain a “sizable amount of the variation” and provide a lower-dimensional representation that retains “as much of the information as possible”

pc_df.head()
PC_1 PC_2 PC_3 PC_4 PC_5 PC_6 PC_7 PC_8 PC_9 PC_10 ... PC_16 PC_17 PC_18 PC_19 PC_20 PC_21 PC_22 PC_23 PC_24 PC_25
0 3.647907 3.785523 1.336597 -0.208642 -2.173467 -0.159574 -0.201224 -0.175144 -0.442335 0.125566 ... 0.015509 0.053778 -0.106066 -0.353473 -0.059685 -0.655202 0.315228 -0.239004 0.257046 1.959170e-16
1 0.321251 -5.989832 4.189945 -0.206970 0.171948 0.960831 -0.520194 1.106246 -0.674214 -0.811364 ... -0.724327 -0.503311 0.348717 -0.460284 -0.166123 0.121399 0.262114 0.042369 -0.096848 1.959170e-16
2 3.832310 1.285090 -0.477499 -2.193931 3.982175 -0.967499 -0.351455 0.389307 0.007497 0.070364 ... 0.080597 -0.748926 -0.429462 -0.277130 -0.028362 0.054581 0.009554 0.045150 0.040776 1.959170e-16
3 -5.633592 3.690983 -3.478281 -0.803440 0.446135 0.610747 -2.052404 0.316239 -1.076850 -1.359736 ... 0.029642 0.142202 0.806519 0.175962 -0.013190 -0.119514 -0.072060 0.053852 0.077116 1.959170e-16
4 0.174023 1.517975 0.166070 2.912341 0.421168 -0.887175 -1.127990 -1.452414 -1.104615 -0.970554 ... 0.832998 -0.053671 -0.443638 -0.031360 -0.348642 0.422218 0.010565 -0.405376 -0.039028 1.959170e-16

5 rows × 25 columns

# Calculate cumulative variance = cumulative proportion of variance explained by the principal components
variances = pc_df.var(axis=0)
cumulative_variance = variances.cumsum() / variances.sum()

# Plot cumulative variance
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--')
plt.title('Cumulative Variance Explained by Principal Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Variance Explained')
plt.grid()
plt.show()
../_images/aa390baf64a7a0cac1f4fb2d5aa36871e963041e3f705469542fcbccd55b5388.png

Explained variance:

  • PCA seeks a low-dimensional representation of a dataset that captures as much as possible of the variation in the original data

  • Variance Explained by a principal component (PC):

    • Each PC captures a specific amount of variance (this is also known as variance explained by each PC)

    • For a specific PC, say the m-th one, its variance explained is essentially the variance of the scores across m-th PC, when the data points are projected principal components

  • Relationship between PC Variance and Total Variance:

    • A key property is that the sum of the variances of all principal components equals the total variance of the original data

    • This means maximizing the variance of the n principal components is equivalent to minimizing the reconstruction error when approximating the data with those n components

  • Proportion of Variance Explained - PVE (Explained Variance Ratio):

    • The PVE of the mth principal component is calculated as the variance of the mth principal component divided by the total variance in the original data

    • The PVEs for all principal components (up to min(n-1, p)) sum to one

  • Cumulative PVE:

    • The cumulative PVE of the principal components is simply the sum of the PVEs for those components

## Access the explained variance directly
explained_variance = pca.explained_variance_
print("Explained variance by component:", explained_variance)
Explained variance by component: [2.06118885e+01 9.46481592e+00 5.66619038e+00 4.25914905e+00
 2.86483250e+00 1.91158440e+00 1.59439385e+00 1.22821102e+00
 9.13563089e-01 8.09987389e-01 5.90796907e-01 4.07254006e-01
 3.73077915e-01 2.98129789e-01 2.48097781e-01 2.31592814e-01
 1.58434105e-01 1.29071221e-01 8.11724160e-02 7.48451289e-02
 6.29952247e-02 5.23843184e-02 3.73057774e-02 1.35597766e-02
 3.99827721e-32]
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained variance ratio by component:", explained_variance_ratio)
Explained variance ratio by component: [3.95748260e-01 1.81724466e-01 1.08790855e-01 8.17756618e-02
 5.50047841e-02 3.67024205e-02 3.06123619e-02 2.35816516e-02
 1.75404113e-02 1.55517579e-02 1.13433006e-02 7.81927692e-03
 7.16309597e-03 5.72409195e-03 4.76347739e-03 4.44658203e-03
 3.04193482e-03 2.47816745e-03 1.55851039e-03 1.43702647e-03
 1.20950831e-03 1.00577891e-03 7.16270926e-04 2.60347711e-04
 7.67669224e-34]
# Option 3: Calculate the cumulative explained variance
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
print("Cumulative explained variance:", cumulative_explained_variance)
Cumulative explained variance: [0.39574826 0.57747273 0.68626358 0.76803924 0.82304403 0.85974645
 0.89035881 0.91394046 0.93148087 0.94703263 0.95837593 0.96619521
 0.9733583  0.9790824  0.98384587 0.98829246 0.99133439 0.99381256
 0.99537107 0.99680809 0.9980176  0.99902338 0.99973965 1.
 1.        ]

Visualize the explained variance

# Visualize the explained variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, alpha=0.7, label='Individual explained variance')
plt.step(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, where='mid', label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% explained variance threshold')
plt.axvline(x=10, color='r', linestyle='--')
plt.axhline(y=0.90, color='r', linestyle='dotted', label='90% explained variance threshold')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
../_images/3ad4ea091a5a9b2196091c065d01ed8de3b5292c340fecd767ddf82b5b27b201.png
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker='o', linestyle='--')
plt.title('Cumulative Variance Explained by Principal Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid()
plt.show()
../_images/d060c8ff8d7786355365430d6d182f1458824c32639dce6b924530db5b6493db.png

Determine optimal number of components

  • Variance Threshold

  • Scree Plot (Elbow Method)

Variance Threshold

  • Find number of components that explain at least 95% (or 99%) of variance

  • Selects components that collectively explain a predefined percentage (e.g., 95%) of total variance

  • Ensures you keep enough information while reducing dimensions

  • Provides a clear cutoff criterion that doesn’t require subjective judgment (Automated selection)

n_components_95 = np.argmax(cumulative_explained_variance >= 0.95)
# np.argmax(): This function returns the index of the first occurrence of the maximum value in the array.

print(f"Number of components for 95% variance: {n_components_95 + 1}")  # +1 because index starts at 0
Number of components for 95% variance: 11

Scree Plot

  • Helps identify the point where additional components yield diminishing returns (Visual identification)

    • Balances model complexity against information retention

  • The “elbow” often marks where principal components transition from capturing signal to capturing noise (Noise reduction)

Logic Behind Using a Scree Plot:

  1. Understanding Eigenvalues

  • Eigenvalues are the foundation of scree plots. In PCA, eigenvalues represent the amount of variance captured by each principal component

  • Eigenvalue associated with a specific eigenvector is equal to the variance of the corresponding principal component

  • Specifically:

    • Each eigenvalue corresponds to one principal component

    • The magnitude of an eigenvalue indicates how much variance that component explains

    • Eigenvalues are measured in the same units as the original data’s variance (When PCA is performed on the standardized data, eigenvalues are unitless because standardization removes units)

    • The sum of all eigenvalues equals the total variance in the dataset

  1. The Scree Plot Displays Eigenvalues

  • A scree plot displays the eigenvalues (not directly the proportion of variance explained) on the Y-axis for each principal component on the X-axis

  • The eigenvalues are plotted in descending order, creating the characteristic “scree” or slope appearance.

  1. Relationship to Proportion of Variance Explained (PVE):

  • PVE for the m-th principal component = (Eigenvalue of m-th component) / (Sum of all eigenvalues)

  • Together, all principal components explain 100% of the variance (sum of all PVEs = 1)

  1. Identifying the “Elbow”

  • The visual logic involves examining the scree plot and looking for a point where the eigenvalues drop off noticeably

  • This sharp drop-off is often referred to as an “elbow” in the plot

  • Since eigenvalues directly represent variance amounts, a steep decline indicates that subsequent components capture significantly less variance.

  1. Interpreting the Elbow

    • The idea is that the components before the elbow have large eigenvalues and therefore capture a “sizable amount of the variation,” representing the key dimensions along which the data vary the most

    • Components after the elbow have small eigenvalues and explain significantly less additional variance, making them less “interesting” or necessary for understanding the main patterns in the data.

  2. By choosing the number of components up to the elbow, you aim to retain the minimum number of components required to explain a substantial portion of the total variance while achieving dimensionality reduction.

Important Considerations:

  • This method of choosing the number of components by inspecting the scree plot is “ad hoc” and “inherently subjective”

  • There isn’t a single, universally accepted objective rule based solely on the scree plot

  • The decision often depends on the specific application and the dataset

  • In practice, analysts might look at the first few components for interesting patterns and continue examining subsequent ones until no further insights are gained

  • Alternative methods include using cumulative proportion of variance explained (e.g., retaining components that explain 80-90% of total variance) or cross-validation approaches

plt.figure(figsize=(10, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, 'o-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue (Variance)')
plt.grid(True)
plt.show()
../_images/18e7566391389df74808f023cc93946de2f0c92fff9b6036a34f5b7a6497f16f.png

Automated Selection with PCA

  • Instead of manually determining the number of components, you can initialize PCA with a variance threshold:

# Automatically select components to explain 90% of variance
optimal_pca = PCA(n_components=0.90)  # Keep enough components to explain 90% of variance
pc_auto = optimal_pca.fit_transform(drug_data)
print(f"Number of components selected: {optimal_pca.n_components_}")

optimal_pca_df = pd.DataFrame(
    data=pc_auto,
    columns=[f'PC{i+1}' for i in range(pc_auto.shape[1])]
)
optimal_pca_df['Patient ID'] = data["Patient ID"]

print(f"Optimal PCA DataFrame: \n{optimal_pca_df.head()}")
Number of components selected: 8
Optimal PCA DataFrame: 
        PC1       PC2       PC3       PC4       PC5       PC6       PC7  \
0  3.647907  3.785523  1.336597 -0.208642 -2.173467 -0.159574 -0.201224   
1  0.321251 -5.989832  4.189945 -0.206970  0.171948  0.960831 -0.520194   
2  3.832310  1.285090 -0.477499 -2.193931  3.982175 -0.967499 -0.351455   
3 -5.633592  3.690983 -3.478281 -0.803440  0.446135  0.610747 -2.052404   
4  0.174023  1.517975  0.166070  2.912341  0.421168 -0.887175 -1.127990   

        PC8    Patient ID  
0 -0.175144  TCGA-D8-A1JU  
1  1.106246  TCGA-AC-A3QQ  
2  0.389307  TCGA-C8-A12Q  
3  0.316239  TCGA-AR-A1AY  
4 -1.452414  TCGA-A8-A0A2  

Feature Loadings

  • Feature Loadings are

    • the weights that transform original, potentially correlated variables into a new set of principal components

    • i.e., contribution of each original feature to each principal component

  • Shows which original features most strongly influence each principal component - Feature influence

  • Helps interpret what each principal component represents - Component interpretation

  • Identifies which original features are most important for your dataset’s structure - Feature selection

  • Reveals relationships between features in your high-dimensional space - Dimensionality insights

Loading and correlation:

  • Loadings are not correlations themselves - they are the coefficients (weights) in the linear combination that defines the principal component

  • In PCA (with standardized data), loadings are proportional to correlations between original variables and PC scores

Positive Loading (e.g., +0.7):

  • This indicates a positive correlation

  • When the score of the principal component increases, the value of the original feature also tends to increase. Conversely, when the PC score decreases, the feature’s value tends to decrease.

Negative Loading (e.g., -0.6):

  • This indicates a negative correlation

  • When the score of the principal component increases, the value of the original feature tends to decrease. Conversely, when the PC score decreases, the feature’s value tends to increase

High absolute values (positive or negative) indicate strong influence on that component. A loading of 0 means no influence

loadings = optimal_pca.components_

# Get feature names (assuming you have them in a list or as column names)
feature_names = data.iloc[:, 1:].columns  # Or your list of feature names

# Create a DataFrame with the loadings
loadings_df = pd.DataFrame(
    loadings,
    columns=feature_names,
    index=[f'PC_{i+1}' for i in range(loadings.shape[0])]
)

loadings_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, PC_1 to PC_8
Data columns (total 50 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   bendamustine                    8 non-null      float64
 1   ML320                           8 non-null      float64
 2   BRD-K14844214                   8 non-null      float64
 3   leptomycin B                    8 non-null      float64
 4   Compound 23 citrate             8 non-null      float64
 5   BRD4132                         8 non-null      float64
 6   dabrafenib                      8 non-null      float64
 7   necrosulfonamide                8 non-null      float64
 8   PF-543                          8 non-null      float64
 9   KX2-391                         8 non-null      float64
 10  ELCPK                           8 non-null      float64
 11  carboplatin                     8 non-null      float64
 12  SB-525334                       8 non-null      float64
 13  CIL41                           8 non-null      float64
 14  belinostat                      8 non-null      float64
 15  Compound 7d-cis                 8 non-null      float64
 16  lapatinib                       8 non-null      float64
 17  tacrolimus                      8 non-null      float64
 18  pifithrin-mu                    8 non-null      float64
 19  RG-108                          8 non-null      float64
 20  BRD-K97651142                   8 non-null      float64
 21  NVP-BEZ235                      8 non-null      float64
 22  BRD-K01737880                   8 non-null      float64
 23  pluripotin                      8 non-null      float64
 24  GW-405833                       8 non-null      float64
 25  masitinib                       8 non-null      float64
 26  YM-155                          8 non-null      float64
 27  MLN2238                         8 non-null      float64
 28  birinapant                      8 non-null      float64
 29  RAF265                          8 non-null      float64
 30  BRD-K27188169                   8 non-null      float64
 31  PIK-93                          8 non-null      float64
 32  N9-isopropylolomoucine          8 non-null      float64
 33  myricetin                       8 non-null      float64
 34  epigallocatechin-3-monogallate  8 non-null      float64
 35  ceranib-2                       8 non-null      float64
 36  BRD-K66532283                   8 non-null      float64
 37  elocalcitol                     8 non-null      float64
 38  RO4929097                       8 non-null      float64
 39  BRD-K02251932                   8 non-null      float64
 40  Compound 1541A                  8 non-null      float64
 41  pevonedistat                    8 non-null      float64
 42  ISOX                            8 non-null      float64
 43  tosedostat                      8 non-null      float64
 44  AT13387                         8 non-null      float64
 45  BRD-A02303741                   8 non-null      float64
 46  BRD-K99006945                   8 non-null      float64
 47  GSK525762A                      8 non-null      float64
 48  necrostatin-7                   8 non-null      float64
 49  nakiterpiosin                   8 non-null      float64
dtypes: float64(50)
memory usage: 3.2+ KB
print("Contribution of each original feature to each principal component")
loadings_df
Contribution of each original feature to each principal component
bendamustine ML320 BRD-K14844214 leptomycin B Compound 23 citrate BRD4132 dabrafenib necrosulfonamide PF-543 KX2-391 ... Compound 1541A pevonedistat ISOX tosedostat AT13387 BRD-A02303741 BRD-K99006945 GSK525762A necrostatin-7 nakiterpiosin
PC_1 0.188893 0.158907 -0.014409 0.184616 0.188557 0.148854 0.169982 0.195003 0.114682 0.186589 ... 0.013166 0.186654 0.211779 0.153139 0.186287 0.091061 -0.092112 0.125780 0.127629 0.184714
PC_2 0.130845 -0.140145 -0.233375 0.073321 -0.021347 0.107959 -0.080697 -0.013266 -0.024182 0.121842 ... -0.272325 0.094608 0.014560 0.064277 0.026162 -0.255982 -0.258407 -0.169401 0.188433 0.083508
PC_3 -0.021312 -0.084223 0.186021 0.116362 -0.187231 0.127776 -0.129210 -0.165453 0.120737 0.124947 ... 0.026282 -0.007923 -0.035613 0.082174 -0.057860 -0.049318 0.141969 -0.145518 -0.050762 0.125907
PC_4 0.002208 0.054969 -0.168718 0.204135 0.007250 -0.127992 0.012934 -0.043828 -0.228499 0.082702 ... -0.175538 0.146641 -0.057821 0.019380 0.129659 -0.015643 0.014241 0.011453 -0.116959 0.151041
PC_5 0.032004 0.169326 -0.154553 -0.061343 -0.063709 0.255198 0.039662 -0.012972 -0.180148 -0.060818 ... -0.012056 0.095784 -0.037420 -0.213437 -0.050259 0.151289 0.119979 -0.133028 -0.047117 -0.040750
PC_6 0.026600 -0.001502 -0.203015 0.044929 0.140286 0.042471 0.153289 -0.052095 0.269030 -0.099810 ... -0.158710 -0.120513 -0.109148 0.034131 0.272596 0.106136 -0.022863 -0.083687 0.295653 -0.157441
PC_7 0.110011 0.061453 -0.006347 0.049683 -0.096940 0.049480 -0.080599 0.157775 -0.281277 -0.054539 ... 0.088570 0.008416 -0.082869 0.363718 -0.125535 0.122516 -0.145911 -0.100268 0.119357 0.078340
PC_8 -0.007307 -0.195364 -0.010461 0.000593 0.019415 0.083812 0.053971 -0.126673 0.113933 -0.000942 ... -0.087146 0.140472 -0.021129 -0.089389 -0.021654 0.044740 -0.014215 0.338453 -0.170848 0.043675

8 rows × 50 columns

Visualization - Feature Loadings

# Create a heatmap
plt.figure(figsize=(12, 8))
heatmap = sns.heatmap(
    loadings_df, 
    cmap='coolwarm',
    center=0,
    annot=False,
    fmt=".2f",
    linewidths=.5
)
plt.title('PCA Feature Loadings')
plt.tight_layout()
plt.show()
../_images/555d2bb8e7cda6a8d45b5794c47d8c310f955c830b5c23d98758d4ca28a8bc55.png
# Select top n contributing features for each component
def get_top_features(loadings_df, n=10):
    top_features = pd.DataFrame()
    
    for pc in loadings_df.index:
        pc_loadings = loadings_df.loc[pc].abs().sort_values(ascending=False)
        top_features[pc] = pc_loadings.index[:n]
        
    return top_features

top_features = get_top_features(loadings_df)
print("Top contributing features for each principal component:")
top_features
Top contributing features for each principal component:
PC_1 PC_2 PC_3 PC_4 PC_5 PC_6 PC_7 PC_8
0 ISOX lapatinib NVP-BEZ235 pluripotin CIL41 RO4929097 birinapant SB-525334
1 necrosulfonamide Compound 1541A RO4929097 N9-isopropylolomoucine birinapant necrostatin-7 tosedostat GSK525762A
2 ceranib-2 BRD-K99006945 BRD-K02251932 SB-525334 NVP-BEZ235 AT13387 BRD-K66532283 RO4929097
3 MLN2238 BRD-A02303741 elocalcitol PF-543 BRD4132 PF-543 PF-543 myricetin
4 BRD-K66532283 RAF265 BRD-K27188169 birinapant RG-108 tacrolimus myricetin BRD-K97651142
5 Compound 7d-cis BRD-K14844214 YM-155 masitinib lapatinib CIL41 GW-405833 ML320
6 bendamustine PIK-93 BRD-K01737880 leptomycin B tosedostat carboplatin BRD-K97651142 epigallocatechin-3-monogallate
7 Compound 23 citrate carboplatin ELCPK BRD-K02251932 RAF265 BRD-K27188169 RO4929097 necrostatin-7
8 GW-405833 BRD-K97651142 masitinib PIK-93 belinostat BRD-K14844214 RG-108 N9-isopropylolomoucine
9 pevonedistat myricetin CIL41 myricetin PIK-93 elocalcitol necrosulfonamide BRD-K01737880

K-means Cluster analysis

Logic - Minimizing Variation:

  • The core logic of K-means is to find a partition that minimizes the within-cluster variation.

  • The algorithm aims to partition the observations into K clusters such that the total within-cluster variation, summed over all K clusters, is minimized.

  • The objective being minimized is the sum of squared distances between each point and the centroid of its assigned cluster

Inertia (within-cluster sum-of-squares)

  • Inertia is a mathematical measure that quantifies how tightly grouped the data points are within their assigned clusters

  • Definition: Inertia is the total within-cluster sum of squares. It is a measure of how internally coherent the clusters are

  • K-means clustering uses an iterative optimization strategy to minimize inertia

  • Inertia always decreases or stays the same, never increases

  • The rate of decrease typically follows a curve that looks like this:

    • Rapid decrease initially (adding the first few clusters)

    • Gradually diminishing returns as K continues to increase

    • Eventually, minimal improvements with additional clusters

Why This Happens?

  • With more clusters, each data point can be assigned to a centroid that’s closer to it

  • When K=1: Maximum inertia (all points compared to global mean)

  • When K=n (number of data points): Zero inertia (each point is its own cluster)

  • The first few clusters capture the major structure in the data, while additional clusters only capture finer details (Diminishing returns)

The Elbow Method

This behavior forms the basis of the popular “elbow method” for determining the optimal number of clusters:

  1. Plot inertia against K (for K=1,2,3…)

  2. Look for the “elbow point” where the curve bends sharply

  3. This point represents where adding more clusters stops providing significant reduction in inertia

Important Considerations:

  1. Inertia will always decrease as K increases, even if you’re adding meaningless clusters

  2. This is why we look for the elbow point rather than simply minimizing inertia

Silhouette Score

  • Evaluating clustering quality using inertia only raises some limitations

    • It measures how similar each point is to its own cluster (cohesion)

    • but do not evaluate how different it is from other clusters (separation)

  • Silhouette Score is a measure of how similar an object is to its own cluster compared to other clusters

  • Measures how similar a data point is to its own cluster (cohesion) compared to other clusters (separation)

  • While Inertia is the objective function that the K-means algorithm directly tries to minimize during its iterative process, the Silhouette Score is a measure used to evaluate the quality of the clustering result after the algorithm has finished. It is not part of the K-means algorithm’s iterative assignment and update steps.

Is each data point placed in the right cluster?

  • How close a point is to other points in its assigned cluster (inertia)

  • How close that same point is to points in the nearest neighboring cluster (Silhouette Score)

Interpretation of Silhouette Score

  • The silhouette score ranges from -1 to 1:

    • Close to +1: The point is well matched to its own cluster and far from neighboring clusters

    • Close to 0: The point lies near the boundary between two clusters

    • Close to -1: The point might be assigned to the wrong cluster

  • The overall silhouette score of a cluster is simply the average of all individual silhouette scores.

Finding Optimal K Using Silhouette

Unlike inertia which always decreases as K increases, the silhouette score typically:

  • Increases as K approaches the natural number of clusters

  • Reaches a maximum at or near the optimal K

  • Decreases as K becomes too large

  • This makes it particularly useful for determining the optimal number of clusters - you simply choose the K value that maximizes the average silhouette score.

# 1. Determine optimal number of clusters using the Elbow method

inertia = []
silhouette_scores = []
k_range = range(2, 11)  # Testing from 2 to 10 clusters

# Exclude the Patient ID column for clustering
pca_data = optimal_pca_df.drop('Patient ID', axis=1)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(pca_data)
    inertia.append(kmeans.inertia_)
    
    # Calculate silhouette score (only valid for k >= 2)
    labels = kmeans.labels_
    silhouette_scores.append(silhouette_score(pca_data, labels))
    print(f"Silhouette score for {k} clusters: {silhouette_scores[-1]}")
Silhouette score for 2 clusters: 0.315860568817066
Silhouette score for 3 clusters: 0.23399486593926483
Silhouette score for 4 clusters: 0.1923227376052971
Silhouette score for 5 clusters: 0.21280378841022102
Silhouette score for 6 clusters: 0.2215781206218201
Silhouette score for 7 clusters: 0.18825389524527888
Silhouette score for 8 clusters: 0.16503544411475235
Silhouette score for 9 clusters: 0.13979820154212966
Silhouette score for 10 clusters: 0.13772690406314397
# Plot the Elbow curve
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(k_range, inertia, 'o-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)

# Plot silhouette scores
plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, 'o-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for Different k')
plt.grid(True)
plt.tight_layout()
plt.show()
../_images/2d3fa3dc2d2d4eade73a96b9f6a7b0c94eb5fc17c7af6d709ad9d4b51afa5292.png
# 2. Apply K-means with the optimal k
optimal_k = 3  # optimal k from the plots
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)

# Fit K-means with the optimal number of clusters
clusters = kmeans.fit(pca_data)

print(f"Number of clusters: {optimal_k}")
print(f"Cluster centers:\n{kmeans.cluster_centers_}")
print(f"Cluster labels for each patient sample: {clusters.labels_}, shape: {clusters.labels_.shape}")
Number of clusters: 3
Cluster centers:
[[ 2.42255496  1.31481132  0.22658619  0.31864774  0.11053081 -0.05490954
  -0.12686288 -0.10959923]
 [-1.34671408 -3.70301852 -0.57098639 -0.42079775 -0.54593247 -0.09185375
   0.23802588  0.04913763]
 [-8.97044194  2.06631993  0.19937064 -0.61137725  0.72118837  0.48887309
   0.0789207   0.43334165]]
Cluster labels for each patient sample: [0 1 0 2 0 1 0 0 0 0 0 1 0 0 0 2 0 1 0 1 0 2 1 1 0], shape: (25,)
# 3. Add cluster labels to the dataframe
optimal_pca_df['Cluster'] = clusters.labels_
print("First 5 rows of PCA DataFrame with cluster labels:")
optimal_pca_df.head()
First 5 rows of PCA DataFrame with cluster labels:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 Patient ID Cluster
0 3.647907 3.785523 1.336597 -0.208642 -2.173467 -0.159574 -0.201224 -0.175144 TCGA-D8-A1JU 0
1 0.321251 -5.989832 4.189945 -0.206970 0.171948 0.960831 -0.520194 1.106246 TCGA-AC-A3QQ 1
2 3.832310 1.285090 -0.477499 -2.193931 3.982175 -0.967499 -0.351455 0.389307 TCGA-C8-A12Q 0
3 -5.633592 3.690983 -3.478281 -0.803440 0.446135 0.610747 -2.052404 0.316239 TCGA-AR-A1AY 2
4 0.174023 1.517975 0.166070 2.912341 0.421168 -0.887175 -1.127990 -1.452414 TCGA-A8-A0A2 0
# Basic cluster analysis
print("Cluster distribution:")
print(optimal_pca_df['Cluster'].value_counts())
Cluster distribution:
Cluster
0    15
1     7
2     3
Name: count, dtype: int64
optimal_pca_df.groupby(["Cluster", "Patient ID"]).size().reset_index(name='Counts').drop(columns='Counts')
Cluster Patient ID
0 0 TCGA-A8-A07G
1 0 TCGA-A8-A08Z
2 0 TCGA-A8-A0A2
3 0 TCGA-AN-A0XW
4 0 TCGA-BH-A0AU
5 0 TCGA-BH-A0DV
6 0 TCGA-BH-A18K
7 0 TCGA-BH-A1F8
8 0 TCGA-C8-A12Q
9 0 TCGA-D8-A1J8
10 0 TCGA-D8-A1JN
11 0 TCGA-D8-A1JU
12 0 TCGA-EW-A1P1
13 0 TCGA-EW-A3E8
14 0 TCGA-LD-A7W6
15 1 TCGA-A2-A0CX
16 1 TCGA-A2-A0CY
17 1 TCGA-AC-A3QQ
18 1 TCGA-AO-A12F
19 1 TCGA-AR-A1AH
20 1 TCGA-E2-A10C
21 1 TCGA-E2-A1LL
22 2 TCGA-AR-A1AY
23 2 TCGA-EW-A3U0
24 2 TCGA-OL-A5D7
# Calculate mean of each feature for each cluster
cluster_means = optimal_pca_df.drop(columns="Patient ID").groupby('Cluster').mean()
print("\nCluster centers in PCA space:")
print(cluster_means)
Cluster centers in PCA space:
              PC1       PC2       PC3       PC4       PC5       PC6       PC7  \
Cluster                                                                         
0        2.422555  1.314811  0.226586  0.318648  0.110531 -0.054910 -0.126863   
1       -1.346714 -3.703019 -0.570986 -0.420798 -0.545932 -0.091854  0.238026   
2       -8.970442  2.066320  0.199371 -0.611377  0.721188  0.488873  0.078921   

              PC8  
Cluster            
0       -0.109599  
1        0.049138  
2        0.433342  
# 4. Visualize clusters using first two principal components
plt.figure(figsize=(10, 8))
for cluster in range(optimal_k):
    cluster_points = optimal_pca_df[optimal_pca_df['Cluster'] == cluster]
    plt.scatter(
        cluster_points['PC1'], 
        cluster_points['PC2'],
        label=f'Cluster {cluster}',
        alpha=0.7,
        s=80
    )
    for i, row in cluster_points.iterrows():
        plt.text(row['PC1'], row['PC2'], str(row['Patient ID']), fontsize=8, alpha=0.7)

centers = kmeans.cluster_centers_
plt.scatter(
    centers[:, 0], 
    centers[:, 1], 
    s=200, 
    marker='.', 
    c='red', 
    label='Centroids'
)


plt.legend(title='Cluster', fontsize=10, title_fontsize=12, loc='lower left')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Clusters Visualized on First Two Principal Components')
plt.grid(True)
plt.tight_layout()
plt.show()
../_images/2fba9c971a763b91dd2cb5559818d5107f42c3c2011f4e3052bf3fbe001a29b1.png
optimal_pca_df.head()
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 Patient ID Cluster
0 3.647907 3.785523 1.336597 -0.208642 -2.173467 -0.159574 -0.201224 -0.175144 TCGA-D8-A1JU 0
1 0.321251 -5.989832 4.189945 -0.206970 0.171948 0.960831 -0.520194 1.106246 TCGA-AC-A3QQ 1
2 3.832310 1.285090 -0.477499 -2.193931 3.982175 -0.967499 -0.351455 0.389307 TCGA-C8-A12Q 0
3 -5.633592 3.690983 -3.478281 -0.803440 0.446135 0.610747 -2.052404 0.316239 TCGA-AR-A1AY 2
4 0.174023 1.517975 0.166070 2.912341 0.421168 -0.887175 -1.127990 -1.452414 TCGA-A8-A0A2 0
features_to_analyze = optimal_pca_df.columns.drop(['Patient ID', 'Cluster'])
features_to_analyze
Index(['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8'], dtype='object')
cluster_stats = optimal_pca_df.groupby('Cluster')[['PC1', 'PC2', 'PC3']].agg(['mean', 'std', 'min', 'max'])

print("Descriptive Statistics for Each Cluster:")
cluster_stats
Descriptive Statistics for Each Cluster:
PC1 PC2 PC3
mean std min max mean std min max mean std min max
Cluster
0 2.422555 2.669334 -1.042854 6.184113 1.314811 1.812225 -2.445795 3.785523 0.226586 1.816583 -3.150876 4.187933
1 -1.346714 2.241274 -5.334288 0.359772 -3.703019 2.310308 -7.265999 -1.123308 -0.570986 3.293631 -5.974913 4.189945
2 -8.970442 3.365790 -12.364445 -5.633592 2.066320 2.685685 -1.033643 3.690983 0.199371 3.280955 -3.478281 2.826122

Overall Observations:

  • PC1 and PC2 seem to be more effective at differentiating the clusters based on their mean values than PC3

PC1 (Captures the most variance in your original data):

  • Separation (mean):

    • PC1 strongly differentiates all three clusters based on their mean values

    • Cluster 0 has a positive mean (2.423), indicating samples here score high on the trait(s) PC1 represents

    • Cluster 1 has a moderately negative mean (-1.347)

    • Cluster 2 has a very strongly negative mean (-8.970), indicating samples in this cluster score very low on trait(s) PC1 represents

  • Spread (std):

    • Cluster 0 has an intermediate spread (std = 2.670)

    • Cluster 1 is the most compact (homogeneous) along PC1 (std = 2.241) relative to Cluster 0 and 2

    • Cluster 2 is the most spread out (heterogeneous) along PC1 (std = 3.366)

PC2 (Captures the second most variance):

  • Separation (mean):

    • PC2 is also effective at differentiating the clusters, particularly separating Cluster 1 from Clusters 0 and 2.

    • Cluster 1 has a strongly negative mean (-3.703).

    • Cluster 0 (1.315) and Cluster 2 (2.066) both have positive mean PC2 scores, with Cluster 2’s mean being slightly higher.

  • Spread (std):

    • Cluster 0 is the most compact along PC2 (std = 1.812).

    • Cluster 1 has an intermediate spread (std = 2.310).

    • Cluster 2 is the most spread out along PC2 (std = 2.686).

Cluster 0:

  • Patients in this cluster strongly exhibit the characteristics represented by the positive direction of PC1 and moderately by the positive direction of PC2

  • They are quite consistent in their PC2 (and PC3) values

Cluster 1:

  • Patients are strongly characterized by the negative direction of PC2 and moderately by the negative direction of PC1

  • They are quite consistent in their PC1 values (relatively)

Cluster 2:

  • Patients are strongly characterized by the negative direction of PC1 and strongly by the strongly positive direction of PC2

  • They are quite spread-out in their PC1 and PC2 values