Unsupervised Learning

Q & A session: Principal component analysis (PCA) and K-means clustering

Instructor note

Participants were given 45-60 minutes to
- go through the Jupter notebook and
- select correct answers for the questions
Instructor narrate the answers and reasoning after the self-study time
- time: 30 minutes

Dimensionality of the datasets

Questions

Question 1:

You have a dataset with 50 columns (drug compounds) and 25 rows (patients). Is this dataset considered high-dimensional?

A) No, because 50 columns is not a large number
B) No, because the dataset is too small overall
C) Yes, because there are more features (50) than samples (25)
D) Yes, because 25 patients is insufficient for any analysis

Question 2:

Why is your dataset with 50 drug compounds and 25 patients considered high-dimensional?

A) Because 50 is the threshold for high-dimensionality
B) Because the number of features (50) significantly exceeds the number of samples (25)
C) Because drug compound data is inherently high-dimensional
D) Because 25 patients represent too many medical conditions

Question 3:

In your dataset, the ratio of features to samples is 2:1 (50 features to 25 samples). This high-dimensional characteristic is most likely to cause:

A) Faster model training due to more information per patient
B) Better generalization because of the rich feature space
C) The “curse of dimensionality” and potential overfitting issues
D) Automatic feature selection by machine learning algorithms

Standardization

Questions

Before (A) / After standardization plots (B):

A. alt text

B. alt text

Question 1: Scale Uniformity After Standardization:

Comparing the original drug sensitivity scores (A) with the standardized versions (B), what is the most important change for PCA analysis?

A) The standardized data has fewer outliers than the original data
B) All drug compounds now have similar scales (roughly -3 to +3) instead of vastly different ranges
C) The standardized data shows stronger correlations between compounds
D) The standardized data has reduced the total number of features

Question 2: PCA Component Interpretation:

After standardization, all drug compounds now have approximately the same variance. How will this affect your PCA results compared to using the original unstandardized data?

A) PC1 will still be dominated by the originally high-variance compounds like leptoporin B
B) PCA components will now reflect actual biological/chemical relationships rather than just scale differences
C) PCA will find fewer meaningful components due to the uniform scaling
D) The explained variance percentages will be identical to the unstandardized analysis

Question 3: Practical PCA Decision:

You’re about to perform PCA for drug discovery research. Based on your standardization comparison, which approach would give you more interpretable principal components for identifying drug response patterns?

A) Use original data because it preserves the natural measurement scales of each drug
B) Use standardized data because it allows PCA to focus on underlying biological patterns rather than measurement scale artifacts
C) It doesn’t matter - PCA will automatically handle scale differences
D) Use original data because standardization removes important variance information

Question 4: Variance Landscape:

Looking at your standardized data plot where all compounds now have similar variance, what does this mean for PCA’s search for “maximum variance directions”?

A) PCA will no longer work because there’s no variance to capture
B) PCA will now find directions based on correlations and biological patterns rather than being biased by high-variance features
C) PCA will randomly select components since all variances are equal
D) PCA will focus only on outlier detection

Apply PCA transformation

Questions

alt text

Question 1:

Comparing the two heatmaps, what is the most significant difference between the correlation patterns of the original drug compounds (left) and the PCA components (right)?

A) The PCA components show stronger correlations than the original compounds
B) The original compounds are uncorrelated while PCA components are highly correlated
C) The original compounds show various correlation patterns, while PCA components are essentially uncorrelated (orthogonal)
D) Both heatmaps show identical correlation structures

Question 2:

In your PCA components heatmap (right), most correlations appear to be near zero (gray/white). This pattern demonstrates which fundamental property of PCA?

A) PCA randomly shuffles the original correlations
B) PCA creates principal components that are orthogonal (perpendicular) to each other, eliminating correlation
C) PCA amplifies the strongest correlations from the original data
D) PCA preserves all original correlation relationships between features

Question 3:

The original drug compounds heatmap shows clusters of correlated compounds (red blocks), while the PCA heatmap is predominantly gray. What does this tell you about what PCA has accomplished?

A) PCA has lost important information about drug relationships
B) PCA has transformed the correlated original features into a new set of uncorrelated components that still capture the data’s variance
C) PCA has created random noise instead of meaningful components
D) PCA has made the data analysis more complicated

Explained variance ratios

Questions

alt text

Question 1:

Your plot shows that 10 principal components explain 95% of the total variance (red dashed line). For a machine learning application, which approach would be most appropriate?

A) Always use all 25 components to avoid losing any information
B) Use exactly 10 components because they reach the 95% threshold
C) Use only PC1 since it explains the most variance (40%)
D) Choose the number of components based on your specific analysis goals and computational constraints

Question 2:

Examining the individual explained variance bars (blue), there’s a sharp drop after PC1 and PC2, then the contributions become much smaller. This pattern suggests:

A) The first two components capture the main data structure, while next ffew components may represent minor patterns and last few components represent noise
B) Only PC1 and PC2 are mathematically valid
C) Components 3-25 contain no useful information
D) This indicates an error in the PCA calculation

Question 3:

In the context of your drug sensitivity dataset, what does PC1’s 40% explained variance represent biologically?

A) 40% of the drugs are important for the analysis
B) The first principal component captures a major pattern of drug response variation across patients that accounts for 40% of the total variability
C) 40% of the patients respond similarly to drugs
D) 60% of the data is noise and should be discarded

Generate scree plot

Questions

alt text

Question 1:

In your scree plot, the eigenvalues drop sharply from PC1 (~21) to PC2 (~9) to PC3 (~6), then gradually flatten out after PC6-7. What is the logic behind choosing the number of components at the “elbow” point where the curve starts to flatten?

A) Components after the elbow contain no mathematical information
B) Components before the elbow capture major data patterns, while those after the elbow likely represent noise or minor variations
C) The elbow point is randomly determined and has no statistical meaning
D) You should always choose exactly at the steepest drop point

Question 2:

Looking at your scree plot, why do we typically avoid including principal components from the flat region (PC8 onwards, where eigenvalues are close to 0)?

A) These components are mathematically incorrect and will cause errors
B) These components represent very small amounts of variance and may capture noise rather than meaningful signal
C) These components take too much computational time to calculate
D) These components are always highly correlated with the first few components

Interpretation and Analysis

Questions

alt text

Question 1:

In this PCA loadings heatmap, some drug compounds show dark red or dark blue colors while others appear white/gray. What do these color intensities tell you about each compound’s contribution to the principal components?

A) Dark colors indicate drugs that are more effective therapeutically
B) Dark red/blue indicate high absolute loading values, meaning these compounds strongly contribute to defining that principal component
C) White/gray areas indicate missing data for those drug compounds
D) Color intensity represents the correlation between different drugs

Question 2:

Looking at PC1 (top row), you can see both red (positive) and blue (negative) loadings for different drug compounds. What does this pattern of positive and negative loadings indicate?

A) Positive loadings are “good” drugs and negative loadings are “bad” drugs
B) This indicates an error in the PCA calculation since all loadings should be positive
C) Compounds with positive loadings move in the same direction as PC1, while those with negative loadings move in the opposite direction
D) Positive and negative loadings cancel each other out, making PC1 meaningless

Question 3:

To understand what PC1 represents biologically in your drug sensitivity study, which compounds should you focus on for interpretation?

A) Only the compounds with positive loadings (red colors)
B) Only the compounds with negative loadings (blue colors)
C) The compounds with the highest absolute loadings (darker shades of red and blue), regardless of sign
D) The compounds with near-zero loadings (white/gray) because they’re most stable

Question 4:

If you wanted to name or characterize what biological pathway PC1 represents, how would you use the loading information?

A) Look up the biological functions of compounds with the highest absolute loadings in PC1
B) Only consider the single compound with the highest positive loading
C) Average all the loading values to get a general interpretation
D) Focus on compounds that appear in multiple principal components