Unsupervised Learning

Principal component analysis (PCA) and K-means clustering

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

Data Preparation

Clean the dataset by handling missing values
Scale/normalize the data
Check for outliers
Separate metadata from drug sensitivity values

! curl https://naicno.github.io/BioNT_Module2_handson/_downloads/d10b2868b0538b6d8e941c3b4de06a4c/BRCA_Drug_sensitivity_test_data.csv -o BRCA_Drug_sensitivity_test_data.csv
! ls -lh BRCA_Drug_sensitivity_test_data.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

100 18108  100 18108    0     0   113k      0 --:--:-- --:--:-- --:--:--  113k
100 18108  100 18108    0     0   113k      0 --:--:-- --:--:-- --:--:--  113k

-rw-r--r-- 1 runner docker 18K Jun 10 13:52 BRCA_Drug_sensitivity_test_data.csv

data = pd.read_csv('BRCA_Drug_sensitivity_test_data.csv')
data.head()

	Patient ID	bendamustine	ML320	BRD-K14844214	leptomycin B	Compound 23 citrate	BRD4132	dabrafenib	necrosulfonamide	PF-543	...	Compound 1541A	pevonedistat	ISOX	tosedostat	AT13387	BRD-A02303741	BRD-K99006945	GSK525762A	necrostatin-7	nakiterpiosin
0	TCGA-D8-A1JU	14.821152	14.286200	14.608184	4.960653	13.530885	13.202820	13.879099	12.839373	14.103941	...	14.778758	12.541663	12.113460	12.364817	8.913646	14.860838	13.170200	13.116395	14.548597	11.835573
1	TCGA-AC-A3QQ	14.304843	14.333355	14.960661	3.552977	13.122806	13.129905	13.911525	12.124210	14.127482	...	15.108698	10.936052	11.802052	11.893495	8.326862	15.265313	15.341146	13.509974	14.311074	10.772699
2	TCGA-C8-A12Q	14.658052	14.649223	14.547810	3.467945	13.438683	13.441228	13.823993	12.935296	14.052508	...	15.012134	12.837987	11.944058	11.694113	8.622052	15.205567	14.039701	13.252381	14.496764	11.387851
3	TCGA-AR-A1AY	14.342911	13.657840	14.426898	1.800027	13.242654	13.094447	13.531232	11.987928	14.046877	...	14.811534	11.098317	11.480102	11.379716	8.080488	14.906361	13.462326	13.030773	14.466555	9.665608
4	TCGA-A8-A0A2	14.655920	14.323148	14.527811	4.491210	13.293485	13.054068	13.807173	12.446720	14.012040	...	14.816099	12.075367	11.878005	11.800027	8.595326	14.988143	14.074985	12.959710	14.435152	11.143563

5 rows × 51 columns

print(f"Dataset shape: {data.shape}")
print(f"Patient IDs: {data['Patient ID'].nunique()} unique values")
print(f"Features: {data.shape[1]-1} drug sensitivity measurements")

Dataset shape: (25, 51)
Patient IDs: 25 unique values
Features: 50 drug sensitivity measurements

Dimensionality of the datasets

Notes: High-dimensional data

High-dimensional data refers to datasets where the number of features or attributes (dimensions, denoted as p) is significantly large
In such datasets, each observation can be thought of as residing in a high-dimensional space
The definition of “high-dimensional” data can vary depending on the context, the field of study, and the specific analysis being performed.

Dataset with 51 columns (drug compounds) and 25 rows (patients):

Contains more features (50 drug sensitivity scores) than samples (25 patients)
Difficulty in Visualization: Identifying patterns visually becomes impossible
Sparsity: With 50 features but only 25 samples, data points are scattered across a vast 50-dimensional space
Distance metrics become less meaningful: In high dimensions, the difference between the nearest and farthest neighbors becomes less significant - nearly all points appear similar-distances from each other
Overfitting risk: model has many potential combinations of features to consider, but limited examples to learn from, it has a large capacity to fit even noise in the limited training examples leading to overfitting and poor performance

Clean the dataset by handling missing values

# Check for missing values
data.isnull().sum().sum()

np.int64(0)

# Check data types
print("\nData types:")
print("Datatypes of first 10 columns:", data.dtypes[:10])
print("Different datatypes in the dataframe:", data.drop(columns=["Patient ID"]).dtypes.unique())

Data types:
Datatypes of first 10 columns: Patient ID              object
bendamustine           float64
ML320                  float64
BRD-K14844214          float64
leptomycin B           float64
Compound 23 citrate    float64
BRD4132                float64
dabrafenib             float64
necrosulfonamide       float64
PF-543                 float64
dtype: object
Different datatypes in the dataframe: [dtype('float64')]

# Basic statistics for drug sensitivity values
data.describe(include='all').T.sort_values('mean', ascending=False).head(10)

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
myricetin	25.0	NaN	NaN	NaN	15.126359	0.043691	15.053067	15.095327	15.120994	15.151308	15.219127
BRD-A02303741	25.0	NaN	NaN	NaN	15.058701	0.163287	14.70386	14.96314	15.036734	15.205567	15.398952
Compound 1541A	25.0	NaN	NaN	NaN	14.907687	0.136504	14.724434	14.811534	14.879828	14.981904	15.318321
carboplatin	25.0	NaN	NaN	NaN	14.711186	0.135093	14.459443	14.623642	14.671899	14.799047	14.999075
BRD-K27188169	25.0	NaN	NaN	NaN	14.624495	0.086419	14.428273	14.601246	14.623367	14.67901	14.801653
BRD-K14844214	25.0	NaN	NaN	NaN	14.610919	0.1589	14.393664	14.519765	14.597586	14.681205	15.019018
pifithrin-mu	25.0	NaN	NaN	NaN	14.589748	0.267786	14.076076	14.384124	14.600936	14.790655	15.095189
RO4929097	25.0	NaN	NaN	NaN	14.559876	0.181499	13.979993	14.480718	14.59874	14.66368	14.924847
RG-108	25.0	NaN	NaN	NaN	14.447688	0.218108	13.95736	14.283119	14.4542	14.592225	14.81662
bendamustine	25.0	NaN	NaN	NaN	14.447378	0.341383	13.492308	14.304843	14.46329	14.658052	15.100501

data.describe(include='all').T.sort_values('mean', ascending=False).tail(10)

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
nakiterpiosin	25.0	NaN	NaN	NaN	10.90167	0.831147	8.701056	10.381886	11.26099	11.37343	11.92009
YM-155	25.0	NaN	NaN	NaN	10.166452	0.883099	7.774662	9.441895	10.42738	10.774703	11.398507
NVP-BEZ235	25.0	NaN	NaN	NaN	9.83664	0.38227	8.935818	9.616815	9.876253	10.104948	10.476573
belinostat	25.0	NaN	NaN	NaN	9.719361	0.630235	8.049804	9.406909	9.762903	9.957621	11.005536
pluripotin	25.0	NaN	NaN	NaN	9.469802	0.687683	7.503132	9.085914	9.568229	9.910524	10.704892
MLN2238	25.0	NaN	NaN	NaN	8.930093	0.741736	7.218104	8.485371	9.008484	9.357212	10.017931
KX2-391	25.0	NaN	NaN	NaN	8.798709	1.396529	5.071334	7.925459	9.349282	9.607232	10.796027
AT13387	25.0	NaN	NaN	NaN	8.378987	0.764056	6.576897	8.046521	8.281312	8.913646	9.612334
leptomycin B	25.0	NaN	NaN	NaN	3.473396	1.242517	0.864948	2.824049	3.549045	4.49121	5.194158
Patient ID	25	25	TCGA-D8-A1JU	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN

# Step 2: Exploratory Data Analysis
def perform_eda(data):
    print("\n=== Exploratory Data Analysis ===")
    # Separate metadata from drug sensitivity values
    drug_data = data.drop('Patient ID', axis=1)
    
    # Distribution of drug sensitivity values
    plt.figure(figsize=(15, 6))
    plt.subplot(1, 2, 1)
    sns.histplot(drug_data.values.flatten(), kde=True)
    plt.title('Distribution of Drug Sensitivity Values')
    plt.xlabel('Sensitivity Value')
    
    # Boxplot of drug sensitivity values (first 10 drugs)
    plt.subplot(1, 2, 2)
    sns.boxplot(data=drug_data.iloc[:, :10])
    plt.title('Boxplot of First 10 Drugs')
    plt.xticks(rotation=90)
    plt.tight_layout()

    return drug_data
 

drug_data = perform_eda(data)

=== Exploratory Data Analysis ===

../_images/f625cf5f35c1cc46b3951a3b581a7de7562252810bcb221afa8ef1b829e2fce3.png

drug_data.head()

	bendamustine	ML320	BRD-K14844214	leptomycin B	Compound 23 citrate	BRD4132	dabrafenib	necrosulfonamide	PF-543	KX2-391	...	Compound 1541A	pevonedistat	ISOX	tosedostat	AT13387	BRD-A02303741	BRD-K99006945	GSK525762A	necrostatin-7	nakiterpiosin
0	14.821152	14.286200	14.608184	4.960653	13.530885	13.202820	13.879099	12.839373	14.103941	10.796027	...	14.778758	12.541663	12.113460	12.364817	8.913646	14.860838	13.170200	13.116395	14.548597	11.835573
1	14.304843	14.333355	14.960661	3.552977	13.122806	13.129905	13.911525	12.124210	14.127482	8.489440	...	15.108698	10.936052	11.802052	11.893495	8.326862	15.265313	15.341146	13.509974	14.311074	10.772699
2	14.658052	14.649223	14.547810	3.467945	13.438683	13.441228	13.823993	12.935296	14.052508	9.349282	...	15.012134	12.837987	11.944058	11.694113	8.622052	15.205567	14.039701	13.252381	14.496764	11.387851
3	14.342911	13.657840	14.426898	1.800027	13.242654	13.094447	13.531232	11.987928	14.046877	7.385516	...	14.811534	11.098317	11.480102	11.379716	8.080488	14.906361	13.462326	13.030773	14.466555	9.665608
4	14.655920	14.323148	14.527811	4.491210	13.293485	13.054068	13.807173	12.446720	14.012040	9.540638	...	14.816099	12.075367	11.878005	11.800027	8.595326	14.988143	14.074985	12.959710	14.435152	11.143563

5 rows × 50 columns

plt.figure(figsize=(20, 6))
sns.boxplot(data=drug_data)
plt.xticks(rotation=90);

../_images/0b2e89e686c2a1c801a0d73182520d92f7f236436a095b645a46a48333ed1525.png

Variance plots

variances = drug_data.var().sort_values(ascending=False)
plt.figure(figsize=(20, 6))
sns.barplot(x=variances.index, y=variances.values)
plt.xticks(rotation=90)
plt.title('Variance Across Columns')
plt.ylabel('Variance')
plt.show()

../_images/657a95d374dc9c19d317f03850231025848bb9fdada4d05751a2838f9042a92c.png

Normalize the data (Standardization)

PCA is sensitive to the variances of the original features, therefore data normalization before applying PCA is crucial

PCA works by identifying the directions (principal components) that capture the maximum variance in the data.
Variables with larger variances can dominate the principal components
- This means the principal components might primarily reflect the variability of the features with the largest ranges
- principal components do not capture the underlying relationships in the data
Drugs with inherently higher variability in their sensitivity scores would disproportionately influence the principal components, potentially masking the contributions and relationships involving drugs with lower score variability.

PCA looks for directions of maximum variance, standardization ensures that each feature contributes more equally to the determination of the principal components

# Create the scaler and standardize the data
scaler = StandardScaler()
drug_data = scaler.fit_transform(drug_data)

print("Type of drug_data:", type(drug_data))
print("Shape of drug_data:", drug_data.shape)
print("First 5 rows and columns of drug_data:\n", drug_data[:5, :5])

Type of drug_data: <class 'numpy.ndarray'>
Shape of drug_data: (25, 50)
First 5 rows and columns of drug_data:
 [[ 1.11745915 -0.14919196 -0.01756464  1.22165355  0.70067562]
 [-0.4261314  -0.02638244  2.24640919  0.06536914 -0.66625469]
 [ 0.62984536  0.7962644  -0.40535119 -0.00447747  0.3918278 ]
 [-0.3123208  -1.78569382 -1.18197636 -1.3745291  -0.26480228]
 [ 0.62347222 -0.0529662  -0.5338052   0.83604629 -0.09453399]]

plt.figure(figsize=(20, 6))
sns.boxplot(data=drug_data)
plt.xticks(rotation=90);

../_images/c64144c71820efdae9ded4c74865638435e542cfa81d0a4b1b9eb56a1b247d7a.png

PCA Implementation

Apply PCA transformation
Calculate explained variance

Apply PCA transformation

# Create the PCA instance and fit and transform the data with pca
pca = PCA()
pc = pca.fit_transform(drug_data)

## `fit()` in PCA:
# Calculates the principal components (eigenvectors) and their explained variances
# Determines the transformation matrix based on your data
# Stores these components internally in the PCA object
# Does not actually transform your data

## `transform()` in PCA:
# Projects your data onto the principal components using the previously calculated transformation matrix
# Actually reduces the dimensionality of your data
# Example: pca.transform(X_test) projects test data onto the same components learned from training data

## `fit_transform()` in PCA:
# Combines both operations: calculates principal components and then projects your data
# Equivalent to calling fit() followed by transform()
# More efficient than calling them separately
# Example: pca.fit_transform(X_train) learns components from training data and immediately projects it

print("Type of pc:", type(pc))
print("Shape of pc:", pc.shape)
print("First 5 rows and columns of pc:\n", pc[:5, :5])

Type of pc: <class 'numpy.ndarray'>
Shape of pc: (25, 25)
First 5 rows and columns of pc:
 [[ 3.64790724  3.78552267  1.33659718 -0.20864174 -2.17346721]
 [ 0.32125071 -5.9898321   4.18994535 -0.20697045  0.17194817]
 [ 3.83231009  1.28508967 -0.47749898 -2.19393142  3.98217509]
 [-5.63359232  3.69098317 -3.47828106 -0.8034397   0.44613498]
 [ 0.17402269  1.51797501  0.16606953  2.91234111  0.42116796]]

Note:

Original drug_data had 25 rows (observations), so pc also had 25 rows representing patients
We didn’t specify the number of components (n_components) when creating the PCA object (pca = PCA()), scikit-learn defaults to calculating min(n_samples, n_features) components. Original dataset had 50 features, PCA still only computes 25 components because the maximum number of meaningful principal components is limited by the number of samples (you can’t find more independent directions of variance than you have data points).

The fit_transform method did two things:

fit: It analyzed drug_data to find the 25 principal component directions (axes) based on the variance and correlations between your original features.
transform: It then projected your original 25 samples from their original feature space (with 50 dimensions) onto this new coordinate system defined by the 25 principal components

pc_df = pd.DataFrame(pc, columns=[f'PC_{i}' for i in range(1, pc.shape[1]+1)])
pc_df.head()

	PC_1	PC_2	PC_3	PC_4	PC_5	PC_6	PC_7	PC_8	PC_9	PC_10	...	PC_16	PC_17	PC_18	PC_19	PC_20	PC_21	PC_22	PC_23	PC_24	PC_25
0	3.647907	3.785523	1.336597	-0.208642	-2.173467	-0.159574	-0.201224	-0.175144	-0.442335	0.125566	...	0.015509	0.053778	-0.106066	-0.353473	-0.059685	-0.655202	0.315228	-0.239004	0.257046	1.959170e-16
1	0.321251	-5.989832	4.189945	-0.206970	0.171948	0.960831	-0.520194	1.106246	-0.674214	-0.811364	...	-0.724327	-0.503311	0.348717	-0.460284	-0.166123	0.121399	0.262114	0.042369	-0.096848	1.959170e-16
2	3.832310	1.285090	-0.477499	-2.193931	3.982175	-0.967499	-0.351455	0.389307	0.007497	0.070364	...	0.080597	-0.748926	-0.429462	-0.277130	-0.028362	0.054581	0.009554	0.045150	0.040776	1.959170e-16
3	-5.633592	3.690983	-3.478281	-0.803440	0.446135	0.610747	-2.052404	0.316239	-1.076850	-1.359736	...	0.029642	0.142202	0.806519	0.175962	-0.013190	-0.119514	-0.072060	0.053852	0.077116	1.959170e-16
4	0.174023	1.517975	0.166070	2.912341	0.421168	-0.887175	-1.127990	-1.452414	-1.104615	-0.970554	...	0.832998	-0.053671	-0.443638	-0.031360	-0.348642	0.422218	0.010565	-0.405376	-0.039028	1.959170e-16

5 rows × 25 columns

PCA context

PCA aims to summarize a large set of correlated variables with a smaller number of representative variables
The goal is to find a low-dimensional representation of the data that retains as much of the original variation as possible
Why focus on maximizing variance?
- By capturing most of the variation in the first few uncorrelated PCs, PCA allows you to summarize a large set of potentially correlated variables with a smaller number of representative variables
- This process can be viewed as identifying directions in the original feature space along which the data vary the most, or finding low-dimensional linear surfaces that are closest to the observations
- This is valuable for dimension reduction and compression, especially when dealing with many attributes that might be redundant
By transforming the original correlated variables into a set of uncorrelated principal components, PCA effectively removes redundancy in the data
- PCA transforms the original correlated variables into a new set of uncorrelated variables (the principal components), ordered by how much variance they explain. This allows you to potentially use a smaller subset of these new, uncorrelated components to represent the data while retaining most of the original variability
The first principal component is defined as the linear combination of the original features that has the largest sample variance
Subsequent principal components are then found such that they have the maximal variance out of all linear combinations that are uncorrelated with the preceding principal components
- For example, the second principal component must be uncorrelated with the first, the third with the first two, and so on

drug_data_normalised = pd.DataFrame(drug_data, columns=data.columns[1:])

# Correlation heatmap
# plt.figure(figsize=(12, 10))

fig, axes = plt.subplots(1, 2, figsize=(20, 10))

mask = np.triu(np.ones_like(drug_data_normalised.corr(), dtype=bool))
sns.heatmap(drug_data_normalised.corr(), mask=mask, annot=False, cmap='coolwarm', 
            linewidths=0.5, vmin=-1, vmax=1, ax=axes[0])
axes[0].set_title('Correlation Heatmap (All Drugs)')

mask = np.triu(np.ones_like(pc_df.corr(), dtype=bool))
sns.heatmap(pc_df.corr(), mask=mask, annot=False, cmap='coolwarm', 
            linewidths=0.5, vmin=-1, vmax=1, ax=axes[1])
axes[1].set_title('Correlation Heatmap (PCA Components)')


plt.tight_layout()
    
plt.show()
plt.close()

../_images/db72dccdbc8a9915ceb381195ba82aaec2f1ed948f70ba530bba1296f2b5fdcd.png

fig, axes = plt.subplots(2, 1, figsize=(20, 10))

drug_data_normalised.plot(kind="box", title="Drug Sensitivity Boxplot", ax=axes[0])
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=90)  # Rotate x-labels for axes[0]
pc_df.plot(kind="box", title="PCA Components Boxplot", ax=axes[1])
plt.tight_layout()

../_images/c8977976b111bc0454129f42e5ec07b651fe355df44b2898f29f24e31f22b2a7.png

Note:

PCA successfully achieved its goal of decorrelating our dataset
Decorrelation:
- The variance originally shared between features (correlated features) in the original dataset has been reorganized and captured along these new, independent PC axes
Correlated variables inherently contain overlapping information (shared variation)
When variables are correlated, it means their values tend to change together to some degree
- This implies that they are capturing some of the same underlying patterns or variance in the data
- Dealing with a large set of correlated variables suggests there might be redundancy
- The correlation structure can be thought of as arising from shared underlying sources or latent variables.
PCA transforms these into uncorrelated components
- The process of Principal Component Analysis explicitly finds a new set of variables - principal components (PCs), which are linear combinations of the original features
- A key property of these principal components is that they are uncorrelated with each other
- The second principal component is defined as having maximal variance out of all linear combinations that are uncorrelated with the first principal component, and this extends to subsequent components
Decorrelation ensures each PC captures a distinct aspect of variability:
- Because each subsequent principal component is constrained to be uncorrelated with the ones already found, it is forced to capture variation in a direction that is distinct from the directions represented by the earlier components
- The first PC captures the most variance. The second PC captures the most variance that is not captured by the first PC (due to the uncorrelated constraint), and so on.
- This ensures that each PC adds new, non-overlapping information about the data’s variability, allowing the first few components to collectively explain a “sizable amount of the variation” and provide a lower-dimensional representation that retains “as much of the information as possible”

pc_df.head()

	PC_1	PC_2	PC_3	PC_4	PC_5	PC_6	PC_7	PC_8	PC_9	PC_10	...	PC_16	PC_17	PC_18	PC_19	PC_20	PC_21	PC_22	PC_23	PC_24	PC_25
0	3.647907	3.785523	1.336597	-0.208642	-2.173467	-0.159574	-0.201224	-0.175144	-0.442335	0.125566	...	0.015509	0.053778	-0.106066	-0.353473	-0.059685	-0.655202	0.315228	-0.239004	0.257046	1.959170e-16
1	0.321251	-5.989832	4.189945	-0.206970	0.171948	0.960831	-0.520194	1.106246	-0.674214	-0.811364	...	-0.724327	-0.503311	0.348717	-0.460284	-0.166123	0.121399	0.262114	0.042369	-0.096848	1.959170e-16
2	3.832310	1.285090	-0.477499	-2.193931	3.982175	-0.967499	-0.351455	0.389307	0.007497	0.070364	...	0.080597	-0.748926	-0.429462	-0.277130	-0.028362	0.054581	0.009554	0.045150	0.040776	1.959170e-16
3	-5.633592	3.690983	-3.478281	-0.803440	0.446135	0.610747	-2.052404	0.316239	-1.076850	-1.359736	...	0.029642	0.142202	0.806519	0.175962	-0.013190	-0.119514	-0.072060	0.053852	0.077116	1.959170e-16
4	0.174023	1.517975	0.166070	2.912341	0.421168	-0.887175	-1.127990	-1.452414	-1.104615	-0.970554	...	0.832998	-0.053671	-0.443638	-0.031360	-0.348642	0.422218	0.010565	-0.405376	-0.039028	1.959170e-16

5 rows × 25 columns

# Calculate cumulative variance = cumulative proportion of variance explained by the principal components
variances = pc_df.var(axis=0)
cumulative_variance = variances.cumsum() / variances.sum()

# Plot cumulative variance
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--')
plt.title('Cumulative Variance Explained by Principal Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Variance Explained')
plt.grid()
plt.show()

../_images/aa390baf64a7a0cac1f4fb2d5aa36871e963041e3f705469542fcbccd55b5388.png

Explained variance:

PCA seeks a low-dimensional representation of a dataset that captures as much as possible of the variation in the original data
Variance Explained by a principal component (PC):
- Each PC captures a specific amount of variance (this is also known as variance explained by each PC)
- For a specific PC, say the m-th one, its variance explained is essentially the variance of the scores across m-th PC, when the data points are projected principal components
Relationship between PC Variance and Total Variance:
- A key property is that the sum of the variances of all principal components equals the total variance of the original data
- This means maximizing the variance of the n principal components is equivalent to minimizing the reconstruction error when approximating the data with those n components
Proportion of Variance Explained - PVE (Explained Variance Ratio):
- The PVE of the mth principal component is calculated as the variance of the mth principal component divided by the total variance in the original data
- The PVEs for all principal components (up to min(n-1, p)) sum to one
Cumulative PVE:
- The cumulative PVE of the principal components is simply the sum of the PVEs for those components

## Access the explained variance directly
explained_variance = pca.explained_variance_
print("Explained variance by component:", explained_variance)

Explained variance by component: [2.06118885e+01 9.46481592e+00 5.66619038e+00 4.25914905e+00
86483250e+00 1.91158440e+00 1.59439385e+00 1.22821102e+00
13563089e-01 8.09987389e-01 5.90796907e-01 4.07254006e-01
73077915e-01 2.98129789e-01 2.48097781e-01 2.31592814e-01
58434105e-01 1.29071221e-01 8.11724160e-02 7.48451289e-02
29952247e-02 5.23843184e-02 3.73057774e-02 1.35597766e-02
99827721e-32]

explained_variance_ratio = pca.explained_variance_ratio_
print("Explained variance ratio by component:", explained_variance_ratio)

Explained variance ratio by component: [3.95748260e-01 1.81724466e-01 1.08790855e-01 8.17756618e-02
50047841e-02 3.67024205e-02 3.06123619e-02 2.35816516e-02
75404113e-02 1.55517579e-02 1.13433006e-02 7.81927692e-03
16309597e-03 5.72409195e-03 4.76347739e-03 4.44658203e-03
04193482e-03 2.47816745e-03 1.55851039e-03 1.43702647e-03
20950831e-03 1.00577891e-03 7.16270926e-04 2.60347711e-04
67669224e-34]

# Option 3: Calculate the cumulative explained variance
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
print("Cumulative explained variance:", cumulative_explained_variance)

Cumulative explained variance: [0.39574826 0.57747273 0.68626358 0.76803924 0.82304403 0.85974645
89035881 0.91394046 0.93148087 0.94703263 0.95837593 0.96619521
9733583  0.9790824  0.98384587 0.98829246 0.99133439 0.99381256
99537107 0.99680809 0.9980176  0.99902338 0.99973965 1.
       ]

Visualize the explained variance

# Visualize the explained variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, alpha=0.7, label='Individual explained variance')
plt.step(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, where='mid', label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% explained variance threshold')
plt.axvline(x=10, color='r', linestyle='--')
plt.axhline(y=0.90, color='r', linestyle='dotted', label='90% explained variance threshold')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

../_images/3ad4ea091a5a9b2196091c065d01ed8de3b5292c340fecd767ddf82b5b27b201.png

plt.figure(figsize=(10, 6))
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker='o', linestyle='--')
plt.title('Cumulative Variance Explained by Principal Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid()
plt.show()

../_images/d060c8ff8d7786355365430d6d182f1458824c32639dce6b924530db5b6493db.png

Determine optimal number of components

Variance Threshold
Scree Plot (Elbow Method)

Variance Threshold

Find number of components that explain at least 95% (or 99%) of variance
Selects components that collectively explain a predefined percentage (e.g., 95%) of total variance
Ensures you keep enough information while reducing dimensions
Provides a clear cutoff criterion that doesn’t require subjective judgment (Automated selection)

n_components_95 = np.argmax(cumulative_explained_variance >= 0.95)
# np.argmax(): This function returns the index of the first occurrence of the maximum value in the array.

print(f"Number of components for 95% variance: {n_components_95 + 1}")  # +1 because index starts at 0

Number of components for 95% variance: 11

Scree Plot

Helps identify the point where additional components yield diminishing returns (Visual identification)
- Balances model complexity against information retention
The “elbow” often marks where principal components transition from capturing signal to capturing noise (Noise reduction)

Logic Behind Using a Scree Plot:

Understanding Eigenvalues

Eigenvalues are the foundation of scree plots. In PCA, eigenvalues represent the amount of variance captured by each principal component
Eigenvalue associated with a specific eigenvector is equal to the variance of the corresponding principal component
Specifically:
- Each eigenvalue corresponds to one principal component
- The magnitude of an eigenvalue indicates how much variance that component explains
- Eigenvalues are measured in the same units as the original data’s variance (When PCA is performed on the standardized data, eigenvalues are unitless because standardization removes units)
- The sum of all eigenvalues equals the total variance in the dataset

The Scree Plot Displays Eigenvalues

A scree plot displays the eigenvalues (not directly the proportion of variance explained) on the Y-axis for each principal component on the X-axis
The eigenvalues are plotted in descending order, creating the characteristic “scree” or slope appearance.

Relationship to Proportion of Variance Explained (PVE):

PVE for the m-th principal component = (Eigenvalue of m-th component) / (Sum of all eigenvalues)
Together, all principal components explain 100% of the variance (sum of all PVEs = 1)

Identifying the “Elbow”

The visual logic involves examining the scree plot and looking for a point where the eigenvalues drop off noticeably
This sharp drop-off is often referred to as an “elbow” in the plot
Since eigenvalues directly represent variance amounts, a steep decline indicates that subsequent components capture significantly less variance.

Interpreting the Elbow
- The idea is that the components before the elbow have large eigenvalues and therefore capture a “sizable amount of the variation,” representing the key dimensions along which the data vary the most
- Components after the elbow have small eigenvalues and explain significantly less additional variance, making them less “interesting” or necessary for understanding the main patterns in the data.
By choosing the number of components up to the elbow, you aim to retain the minimum number of components required to explain a substantial portion of the total variance while achieving dimensionality reduction.

Important Considerations:

This method of choosing the number of components by inspecting the scree plot is “ad hoc” and “inherently subjective”
There isn’t a single, universally accepted objective rule based solely on the scree plot
The decision often depends on the specific application and the dataset
In practice, analysts might look at the first few components for interesting patterns and continue examining subsequent ones until no further insights are gained
Alternative methods include using cumulative proportion of variance explained (e.g., retaining components that explain 80-90% of total variance) or cross-validation approaches

plt.figure(figsize=(10, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, 'o-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue (Variance)')
plt.grid(True)
plt.show()

../_images/18e7566391389df74808f023cc93946de2f0c92fff9b6036a34f5b7a6497f16f.png

Automated Selection with PCA

Instead of manually determining the number of components, you can initialize PCA with a variance threshold:

# Automatically select components to explain 90% of variance
optimal_pca = PCA(n_components=0.90)  # Keep enough components to explain 90% of variance
pc_auto = optimal_pca.fit_transform(drug_data)
print(f"Number of components selected: {optimal_pca.n_components_}")

optimal_pca_df = pd.DataFrame(
    data=pc_auto,
    columns=[f'PC{i+1}' for i in range(pc_auto.shape[1])]
)
optimal_pca_df['Patient ID'] = data["Patient ID"]

print(f"Optimal PCA DataFrame: \n{optimal_pca_df.head()}")

Number of components selected: 8
Optimal PCA DataFrame: 
        PC1       PC2       PC3       PC4       PC5       PC6       PC7  \
3.647907  3.785523  1.336597 -0.208642 -2.173467 -0.159574 -0.201224   
0.321251 -5.989832  4.189945 -0.206970  0.171948  0.960831 -0.520194   
3.832310  1.285090 -0.477499 -2.193931  3.982175 -0.967499 -0.351455   
-5.633592  3.690983 -3.478281 -0.803440  0.446135  0.610747 -2.052404   
0.174023  1.517975  0.166070  2.912341  0.421168 -0.887175 -1.127990   

        PC8    Patient ID  
-0.175144  TCGA-D8-A1JU  
1.106246  TCGA-AC-A3QQ  
0.389307  TCGA-C8-A12Q  
0.316239  TCGA-AR-A1AY  
-1.452414  TCGA-A8-A0A2  

Feature Loadings

Feature Loadings are
- the weights that transform original, potentially correlated variables into a new set of principal components
- i.e., contribution of each original feature to each principal component
Shows which original features most strongly influence each principal component - Feature influence
Helps interpret what each principal component represents - Component interpretation
Identifies which original features are most important for your dataset’s structure - Feature selection
Reveals relationships between features in your high-dimensional space - Dimensionality insights

Loading and correlation:

Loadings are not correlations themselves - they are the coefficients (weights) in the linear combination that defines the principal component
In PCA (with standardized data), loadings are proportional to correlations between original variables and PC scores

Positive Loading (e.g., +0.7):

This indicates a positive correlation
When the score of the principal component increases, the value of the original feature also tends to increase. Conversely, when the PC score decreases, the feature’s value tends to decrease.

Negative Loading (e.g., -0.6):

This indicates a negative correlation
When the score of the principal component increases, the value of the original feature tends to decrease. Conversely, when the PC score decreases, the feature’s value tends to increase

High absolute values (positive or negative) indicate strong influence on that component. A loading of 0 means no influence

loadings = optimal_pca.components_

# Get feature names (assuming you have them in a list or as column names)
feature_names = data.iloc[:, 1:].columns  # Or your list of feature names

# Create a DataFrame with the loadings
loadings_df = pd.DataFrame(
    loadings,
    columns=feature_names,
    index=[f'PC_{i+1}' for i in range(loadings.shape[0])]
)

loadings_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, PC_1 to PC_8
Data columns (total 50 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 bendamustine                    8 non-null      float64
 ML320                           8 non-null      float64
 BRD-K14844214                   8 non-null      float64
 leptomycin B                    8 non-null      float64
 Compound 23 citrate             8 non-null      float64
 BRD4132                         8 non-null      float64
 dabrafenib                      8 non-null      float64
 necrosulfonamide                8 non-null      float64
 PF-543                          8 non-null      float64
 KX2-391                         8 non-null      float64
ELCPK                           8 non-null      float64
carboplatin                     8 non-null      float64
SB-525334                       8 non-null      float64
CIL41                           8 non-null      float64
belinostat                      8 non-null      float64
Compound 7d-cis                 8 non-null      float64
lapatinib                       8 non-null      float64
tacrolimus                      8 non-null      float64
pifithrin-mu                    8 non-null      float64
RG-108                          8 non-null      float64
BRD-K97651142                   8 non-null      float64
NVP-BEZ235                      8 non-null      float64
BRD-K01737880                   8 non-null      float64
pluripotin                      8 non-null      float64
GW-405833                       8 non-null      float64
masitinib                       8 non-null      float64
YM-155                          8 non-null      float64
MLN2238                         8 non-null      float64
birinapant                      8 non-null      float64
RAF265                          8 non-null      float64
BRD-K27188169                   8 non-null      float64
PIK-93                          8 non-null      float64
N9-isopropylolomoucine          8 non-null      float64
myricetin                       8 non-null      float64
epigallocatechin-3-monogallate  8 non-null      float64
ceranib-2                       8 non-null      float64
BRD-K66532283                   8 non-null      float64
elocalcitol                     8 non-null      float64
RO4929097                       8 non-null      float64
BRD-K02251932                   8 non-null      float64
Compound 1541A                  8 non-null      float64
pevonedistat                    8 non-null      float64
ISOX                            8 non-null      float64
tosedostat                      8 non-null      float64
AT13387                         8 non-null      float64
BRD-A02303741                   8 non-null      float64
BRD-K99006945                   8 non-null      float64
GSK525762A                      8 non-null      float64
necrostatin-7                   8 non-null      float64
nakiterpiosin                   8 non-null      float64
dtypes: float64(50)
memory usage: 3.2+ KB

print("Contribution of each original feature to each principal component")
loadings_df

Contribution of each original feature to each principal component

	bendamustine	ML320	BRD-K14844214	leptomycin B	Compound 23 citrate	BRD4132	dabrafenib	necrosulfonamide	PF-543	KX2-391	...	Compound 1541A	pevonedistat	ISOX	tosedostat	AT13387	BRD-A02303741	BRD-K99006945	GSK525762A	necrostatin-7	nakiterpiosin
PC_1	0.188893	0.158907	-0.014409	0.184616	0.188557	0.148854	0.169982	0.195003	0.114682	0.186589	...	0.013166	0.186654	0.211779	0.153139	0.186287	0.091061	-0.092112	0.125780	0.127629	0.184714
PC_2	0.130845	-0.140145	-0.233375	0.073321	-0.021347	0.107959	-0.080697	-0.013266	-0.024182	0.121842	...	-0.272325	0.094608	0.014560	0.064277	0.026162	-0.255982	-0.258407	-0.169401	0.188433	0.083508
PC_3	-0.021312	-0.084223	0.186021	0.116362	-0.187231	0.127776	-0.129210	-0.165453	0.120737	0.124947	...	0.026282	-0.007923	-0.035613	0.082174	-0.057860	-0.049318	0.141969	-0.145518	-0.050762	0.125907
PC_4	0.002208	0.054969	-0.168718	0.204135	0.007250	-0.127992	0.012934	-0.043828	-0.228499	0.082702	...	-0.175538	0.146641	-0.057821	0.019380	0.129659	-0.015643	0.014241	0.011453	-0.116959	0.151041
PC_5	0.032004	0.169326	-0.154553	-0.061343	-0.063709	0.255198	0.039662	-0.012972	-0.180148	-0.060818	...	-0.012056	0.095784	-0.037420	-0.213437	-0.050259	0.151289	0.119979	-0.133028	-0.047117	-0.040750
PC_6	0.026600	-0.001502	-0.203015	0.044929	0.140286	0.042471	0.153289	-0.052095	0.269030	-0.099810	...	-0.158710	-0.120513	-0.109148	0.034131	0.272596	0.106136	-0.022863	-0.083687	0.295653	-0.157441
PC_7	0.110011	0.061453	-0.006347	0.049683	-0.096940	0.049480	-0.080599	0.157775	-0.281277	-0.054539	...	0.088570	0.008416	-0.082869	0.363718	-0.125535	0.122516	-0.145911	-0.100268	0.119357	0.078340
PC_8	-0.007307	-0.195364	-0.010461	0.000593	0.019415	0.083812	0.053971	-0.126673	0.113933	-0.000942	...	-0.087146	0.140472	-0.021129	-0.089389	-0.021654	0.044740	-0.014215	0.338453	-0.170848	0.043675

8 rows × 50 columns

Visualization - Feature Loadings

# Create a heatmap
plt.figure(figsize=(12, 8))
heatmap = sns.heatmap(
    loadings_df, 
    cmap='coolwarm',
    center=0,
    annot=False,
    fmt=".2f",
    linewidths=.5
)
plt.title('PCA Feature Loadings')
plt.tight_layout()
plt.show()

../_images/555d2bb8e7cda6a8d45b5794c47d8c310f955c830b5c23d98758d4ca28a8bc55.png

# Select top n contributing features for each component
def get_top_features(loadings_df, n=10):
    top_features = pd.DataFrame()
    
    for pc in loadings_df.index:
        pc_loadings = loadings_df.loc[pc].abs().sort_values(ascending=False)
        top_features[pc] = pc_loadings.index[:n]
        
    return top_features

top_features = get_top_features(loadings_df)
print("Top contributing features for each principal component:")
top_features

Top contributing features for each principal component:

	PC_1	PC_2	PC_3	PC_4	PC_5	PC_6	PC_7	PC_8
0	ISOX	lapatinib	NVP-BEZ235	pluripotin	CIL41	RO4929097	birinapant	SB-525334
1	necrosulfonamide	Compound 1541A	RO4929097	N9-isopropylolomoucine	birinapant	necrostatin-7	tosedostat	GSK525762A
2	ceranib-2	BRD-K99006945	BRD-K02251932	SB-525334	NVP-BEZ235	AT13387	BRD-K66532283	RO4929097
3	MLN2238	BRD-A02303741	elocalcitol	PF-543	BRD4132	PF-543	PF-543	myricetin
4	BRD-K66532283	RAF265	BRD-K27188169	birinapant	RG-108	tacrolimus	myricetin	BRD-K97651142
5	Compound 7d-cis	BRD-K14844214	YM-155	masitinib	lapatinib	CIL41	GW-405833	ML320
6	bendamustine	PIK-93	BRD-K01737880	leptomycin B	tosedostat	carboplatin	BRD-K97651142	epigallocatechin-3-monogallate
7	Compound 23 citrate	carboplatin	ELCPK	BRD-K02251932	RAF265	BRD-K27188169	RO4929097	necrostatin-7
8	GW-405833	BRD-K97651142	masitinib	PIK-93	belinostat	BRD-K14844214	RG-108	N9-isopropylolomoucine
9	pevonedistat	myricetin	CIL41	myricetin	PIK-93	elocalcitol	necrosulfonamide	BRD-K01737880

K-means Cluster analysis

Logic - Minimizing Variation:

The core logic of K-means is to find a partition that minimizes the within-cluster variation.
The algorithm aims to partition the observations into K clusters such that the total within-cluster variation, summed over all K clusters, is minimized.
The objective being minimized is the sum of squared distances between each point and the centroid of its assigned cluster

Inertia (within-cluster sum-of-squares)

Inertia is a mathematical measure that quantifies how tightly grouped the data points are within their assigned clusters
Definition: Inertia is the total within-cluster sum of squares. It is a measure of how internally coherent the clusters are
K-means clustering uses an iterative optimization strategy to minimize inertia
Inertia always decreases or stays the same, never increases
The rate of decrease typically follows a curve that looks like this:
- Rapid decrease initially (adding the first few clusters)
- Gradually diminishing returns as K continues to increase
- Eventually, minimal improvements with additional clusters

Why This Happens?

With more clusters, each data point can be assigned to a centroid that’s closer to it
When K=1: Maximum inertia (all points compared to global mean)
When K=n (number of data points): Zero inertia (each point is its own cluster)
The first few clusters capture the major structure in the data, while additional clusters only capture finer details (Diminishing returns)

The Elbow Method

This behavior forms the basis of the popular “elbow method” for determining the optimal number of clusters:

Plot inertia against K (for K=1,2,3…)
Look for the “elbow point” where the curve bends sharply
This point represents where adding more clusters stops providing significant reduction in inertia

Important Considerations:

Inertia will always decrease as K increases, even if you’re adding meaningless clusters
This is why we look for the elbow point rather than simply minimizing inertia

Silhouette Score

Evaluating clustering quality using inertia only raises some limitations
- It measures how similar each point is to its own cluster (cohesion)
- but do not evaluate how different it is from other clusters (separation)
Silhouette Score is a measure of how similar an object is to its own cluster compared to other clusters
Measures how similar a data point is to its own cluster (cohesion) compared to other clusters (separation)
While Inertia is the objective function that the K-means algorithm directly tries to minimize during its iterative process, the Silhouette Score is a measure used to evaluate the quality of the clustering result after the algorithm has finished. It is not part of the K-means algorithm’s iterative assignment and update steps.

Is each data point placed in the right cluster?

How close a point is to other points in its assigned cluster (inertia)
How close that same point is to points in the nearest neighboring cluster (Silhouette Score)

Interpretation of Silhouette Score

The silhouette score ranges from -1 to 1:
- Close to +1: The point is well matched to its own cluster and far from neighboring clusters
- Close to 0: The point lies near the boundary between two clusters
- Close to -1: The point might be assigned to the wrong cluster
The overall silhouette score of a cluster is simply the average of all individual silhouette scores.

Finding Optimal K Using Silhouette

Unlike inertia which always decreases as K increases, the silhouette score typically:

Increases as K approaches the natural number of clusters
Reaches a maximum at or near the optimal K
Decreases as K becomes too large
This makes it particularly useful for determining the optimal number of clusters - you simply choose the K value that maximizes the average silhouette score.

# 1. Determine optimal number of clusters using the Elbow method

inertia = []
silhouette_scores = []
k_range = range(2, 11)  # Testing from 2 to 10 clusters

# Exclude the Patient ID column for clustering
pca_data = optimal_pca_df.drop('Patient ID', axis=1)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(pca_data)
    inertia.append(kmeans.inertia_)
    
    # Calculate silhouette score (only valid for k >= 2)
    labels = kmeans.labels_
    silhouette_scores.append(silhouette_score(pca_data, labels))
    print(f"Silhouette score for {k} clusters: {silhouette_scores[-1]}")

Silhouette score for 2 clusters: 0.315860568817066
Silhouette score for 3 clusters: 0.23399486593926483
Silhouette score for 4 clusters: 0.1923227376052971
Silhouette score for 5 clusters: 0.21280378841022102
Silhouette score for 6 clusters: 0.2215781206218201
Silhouette score for 7 clusters: 0.18825389524527888

Silhouette score for 8 clusters: 0.16503544411475235
Silhouette score for 9 clusters: 0.13979820154212966
Silhouette score for 10 clusters: 0.13772690406314397

# Plot the Elbow curve
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(k_range, inertia, 'o-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)

# Plot silhouette scores
plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, 'o-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for Different k')
plt.grid(True)
plt.tight_layout()
plt.show()

../_images/2d3fa3dc2d2d4eade73a96b9f6a7b0c94eb5fc17c7af6d709ad9d4b51afa5292.png

# 2. Apply K-means with the optimal k
optimal_k = 3  # optimal k from the plots
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)

# Fit K-means with the optimal number of clusters
clusters = kmeans.fit(pca_data)

print(f"Number of clusters: {optimal_k}")
print(f"Cluster centers:\n{kmeans.cluster_centers_}")
print(f"Cluster labels for each patient sample: {clusters.labels_}, shape: {clusters.labels_.shape}")

Number of clusters: 3
Cluster centers:
[[ 2.42255496  1.31481132  0.22658619  0.31864774  0.11053081 -0.05490954
  -0.12686288 -0.10959923]
 [-1.34671408 -3.70301852 -0.57098639 -0.42079775 -0.54593247 -0.09185375
   0.23802588  0.04913763]
 [-8.97044194  2.06631993  0.19937064 -0.61137725  0.72118837  0.48887309
   0.0789207   0.43334165]]
Cluster labels for each patient sample: [0 1 0 2 0 1 0 0 0 0 0 1 0 0 0 2 0 1 0 1 0 2 1 1 0], shape: (25,)

# 3. Add cluster labels to the dataframe
optimal_pca_df['Cluster'] = clusters.labels_
print("First 5 rows of PCA DataFrame with cluster labels:")
optimal_pca_df.head()

First 5 rows of PCA DataFrame with cluster labels:

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	Patient ID	Cluster
0	3.647907	3.785523	1.336597	-0.208642	-2.173467	-0.159574	-0.201224	-0.175144	TCGA-D8-A1JU	0
1	0.321251	-5.989832	4.189945	-0.206970	0.171948	0.960831	-0.520194	1.106246	TCGA-AC-A3QQ	1
2	3.832310	1.285090	-0.477499	-2.193931	3.982175	-0.967499	-0.351455	0.389307	TCGA-C8-A12Q	0
3	-5.633592	3.690983	-3.478281	-0.803440	0.446135	0.610747	-2.052404	0.316239	TCGA-AR-A1AY	2
4	0.174023	1.517975	0.166070	2.912341	0.421168	-0.887175	-1.127990	-1.452414	TCGA-A8-A0A2	0

# Basic cluster analysis
print("Cluster distribution:")
print(optimal_pca_df['Cluster'].value_counts())

Cluster distribution:
Cluster
0    15
1     7
2     3
Name: count, dtype: int64

optimal_pca_df.groupby(["Cluster", "Patient ID"]).size().reset_index(name='Counts').drop(columns='Counts')

	Cluster	Patient ID
0	0	TCGA-A8-A07G
1	0	TCGA-A8-A08Z
2	0	TCGA-A8-A0A2
3	0	TCGA-AN-A0XW
4	0	TCGA-BH-A0AU
5	0	TCGA-BH-A0DV
6	0	TCGA-BH-A18K
7	0	TCGA-BH-A1F8
8	0	TCGA-C8-A12Q
9	0	TCGA-D8-A1J8
10	0	TCGA-D8-A1JN
11	0	TCGA-D8-A1JU
12	0	TCGA-EW-A1P1
13	0	TCGA-EW-A3E8
14	0	TCGA-LD-A7W6
15	1	TCGA-A2-A0CX
16	1	TCGA-A2-A0CY
17	1	TCGA-AC-A3QQ
18	1	TCGA-AO-A12F
19	1	TCGA-AR-A1AH
20	1	TCGA-E2-A10C
21	1	TCGA-E2-A1LL
22	2	TCGA-AR-A1AY
23	2	TCGA-EW-A3U0
24	2	TCGA-OL-A5D7

# Calculate mean of each feature for each cluster
cluster_means = optimal_pca_df.drop(columns="Patient ID").groupby('Cluster').mean()
print("\nCluster centers in PCA space:")
print(cluster_means)

Cluster centers in PCA space:
              PC1       PC2       PC3       PC4       PC5       PC6       PC7  \
Cluster                                                                         
0        2.422555  1.314811  0.226586  0.318648  0.110531 -0.054910 -0.126863   
1       -1.346714 -3.703019 -0.570986 -0.420798 -0.545932 -0.091854  0.238026   
2       -8.970442  2.066320  0.199371 -0.611377  0.721188  0.488873  0.078921   

              PC8  
Cluster            
0       -0.109599  
1        0.049138  
2        0.433342  

# 4. Visualize clusters using first two principal components
plt.figure(figsize=(10, 8))
for cluster in range(optimal_k):
    cluster_points = optimal_pca_df[optimal_pca_df['Cluster'] == cluster]
    plt.scatter(
        cluster_points['PC1'], 
        cluster_points['PC2'],
        label=f'Cluster {cluster}',
        alpha=0.7,
        s=80
    )
    for i, row in cluster_points.iterrows():
        plt.text(row['PC1'], row['PC2'], str(row['Patient ID']), fontsize=8, alpha=0.7)

centers = kmeans.cluster_centers_
plt.scatter(
    centers[:, 0], 
    centers[:, 1], 
    s=200, 
    marker='.', 
    c='red', 
    label='Centroids'
)


plt.legend(title='Cluster', fontsize=10, title_fontsize=12, loc='lower left')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Clusters Visualized on First Two Principal Components')
plt.grid(True)
plt.tight_layout()
plt.show()

../_images/2fba9c971a763b91dd2cb5559818d5107f42c3c2011f4e3052bf3fbe001a29b1.png

optimal_pca_df.head()

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	Patient ID	Cluster
0	3.647907	3.785523	1.336597	-0.208642	-2.173467	-0.159574	-0.201224	-0.175144	TCGA-D8-A1JU	0
1	0.321251	-5.989832	4.189945	-0.206970	0.171948	0.960831	-0.520194	1.106246	TCGA-AC-A3QQ	1
2	3.832310	1.285090	-0.477499	-2.193931	3.982175	-0.967499	-0.351455	0.389307	TCGA-C8-A12Q	0
3	-5.633592	3.690983	-3.478281	-0.803440	0.446135	0.610747	-2.052404	0.316239	TCGA-AR-A1AY	2
4	0.174023	1.517975	0.166070	2.912341	0.421168	-0.887175	-1.127990	-1.452414	TCGA-A8-A0A2	0

features_to_analyze = optimal_pca_df.columns.drop(['Patient ID', 'Cluster'])
features_to_analyze

Index(['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8'], dtype='object')

cluster_stats = optimal_pca_df.groupby('Cluster')[['PC1', 'PC2', 'PC3']].agg(['mean', 'std', 'min', 'max'])

print("Descriptive Statistics for Each Cluster:")
cluster_stats

Descriptive Statistics for Each Cluster:

	PC1				PC2				PC3
	mean	std	min	max	mean	std	min	max	mean	std	min	max
Cluster
0	2.422555	2.669334	-1.042854	6.184113	1.314811	1.812225	-2.445795	3.785523	0.226586	1.816583	-3.150876	4.187933
1	-1.346714	2.241274	-5.334288	0.359772	-3.703019	2.310308	-7.265999	-1.123308	-0.570986	3.293631	-5.974913	4.189945
2	-8.970442	3.365790	-12.364445	-5.633592	2.066320	2.685685	-1.033643	3.690983	0.199371	3.280955	-3.478281	2.826122

Overall Observations:

PC1 and PC2 seem to be more effective at differentiating the clusters based on their mean values than PC3

PC1 (Captures the most variance in your original data):

Separation (mean):
- PC1 strongly differentiates all three clusters based on their mean values
- Cluster 0 has a positive mean (2.423), indicating samples here score high on the trait(s) PC1 represents
- Cluster 1 has a moderately negative mean (-1.347)
- Cluster 2 has a very strongly negative mean (-8.970), indicating samples in this cluster score very low on trait(s) PC1 represents
Spread (std):
- Cluster 0 has an intermediate spread (std = 2.670)
- Cluster 1 is the most compact (homogeneous) along PC1 (std = 2.241) relative to Cluster 0 and 2
- Cluster 2 is the most spread out (heterogeneous) along PC1 (std = 3.366)

PC2 (Captures the second most variance):

Separation (mean):
- PC2 is also effective at differentiating the clusters, particularly separating Cluster 1 from Clusters 0 and 2.
- Cluster 1 has a strongly negative mean (-3.703).
- Cluster 0 (1.315) and Cluster 2 (2.066) both have positive mean PC2 scores, with Cluster 2’s mean being slightly higher.
Spread (std):
- Cluster 0 is the most compact along PC2 (std = 1.812).
- Cluster 1 has an intermediate spread (std = 2.310).
- Cluster 2 is the most spread out along PC2 (std = 2.686).

Cluster 0:

Patients in this cluster strongly exhibit the characteristics represented by the positive direction of PC1 and moderately by the positive direction of PC2
They are quite consistent in their PC2 (and PC3) values

Cluster 1:

Patients are strongly characterized by the negative direction of PC2 and moderately by the negative direction of PC1
They are quite consistent in their PC1 values (relatively)

Cluster 2:

Patients are strongly characterized by the negative direction of PC1 and strongly by the strongly positive direction of PC2
They are quite spread-out in their PC1 and PC2 values