Classification

Q & A session: Logistic regression

Instructor note

Participants were given 30 minutes to
- go through the Jupter notebook and
- select correct answers for the questions
Instructor narrate the answers and reasoning after the self-study time
- time: 30 minutes

Understanding biological context (ML-use case)

Questions

Question 1:

What type of machine learning problem is glioma grading/classification as described in this tutorial?

A) Unsupervised clustering problem
B) Binary classification problem
C) Multi-class classification problem
D) Regression problem

Question 2:

In this study, how is the target variable encoded?

A) LGG = 1, GBM = 0
B) LGG = 0, GBM = 1
C) Both are encoded as 1
D) Text labels are used without numeric encoding Answer: - B) LGG = 0, GBM = 1 Explanation: The target variable encoding is explicitly stated as 0 = “LGG” and 1 = “GBM”.

Question 3:

What types of features are being used in this logistic regression model?

A) Only genetic mutation data
B) Only clinical features
C) 20 most frequently mutated genes plus 3 clinical features
D) All available genetic and clinical data

Question 4:

Why is accurate glioma grading/classification clinically important?

A) It determines the research funding allocation
B) Different grades require different treatment approaches and have different prognoses
C) It’s only important for statistical purposes
D) It helps organize hospital records

Visualize data distributions

Questions

alt text

Question 1:

Binary features (keeping them as 0/1) while scaling Age_at_diagnosis. What is the primary reason why scaling binary features is generally NOT recommended in logistic regression?

A) Binary features are too simple to benefit from scaling
B) Scaling binary features would destroy their natural interpretability - coefficients would no longer represent the change from “absent” (0) to “present” (1)
C) Binary features automatically have equal variance, so scaling is unnecessary
D) Logistic regression algorithms cannot handle scaled binary features

Question 2:

Given the heavy imbalance in most binary features (with most values being 0 and small number of observations with 1s’), what potential issue might this create during logistic regression training?

A) The model will converge faster due to the simplicity of the data
B) The model may have difficulty learning meaningful patterns from rare events (1s) and might be biased toward predicting the majority class
C) The imbalanced features will automatically be weighted equally by the algorithm

Question 3:

When interpreting logistic regression coefficients for the heavily imbalanced binary features shown in these plots, what should you be particularly cautious about?

A) Coefficients for rare events may be unstable and can lead to poor generalization
B) Coefficients will be automatically adjusted by the algorithm to account for imbalance
C) Imbalanced features always produce more reliable coefficient estimates
D) The scaling of Age_at_diagnosis will make other coefficients uninterpretable

Split original dataset

Questions

Question 1:

In the train_test_split code, test_size=0.3 means 30% of data goes to testing. For a medical dataset like glioma classification, what is the primary consideration when choosing this split ratio?

alt text

A) Larger test sets always give better model performance
B) Balancing reliable performance evaluation with sufficient training data, especially important given limited medical data availability
C) Test size should always be exactly 30% regardless of dataset characteristics
D) Smaller test sets are always preferred to maximize training data

Question 2:

In the train_test_split code, why is the stratify=gliomas["Grade"] parameter crucial in this glioma classification problem?

A) It randomly shuffles the data for better performance
B) It ensures both training and test sets have proportional representation of LGG and GBM cases
C) It sorts the data by grade for easier processing
D) It removes outliers from the dataset

The Model Output (Probability)

Questions

alt text

Question 1:

The histogram shows a distinctive U-shaped distribution of predicted probabilities, with many predictions clustered near 0.0 and 1.0, and fewer predictions in the middle range (0.3-0.7). What does this pattern indicate about the model’s behavior?

A) The model is making unreliable predictions
B) The model is well-calibrated and confident in most of its predictions, clearly separating the two classes
C) The model has failed to converge properly during training
D) The sigmoid transformation is not working correctly

Question 2: What does lr.predict_proba(X_test) output, and why are probability predictions particularly valuable in medical diagnosis like glioma classification?

A) It outputs only the predicted class labels (0 for LGG, 1 for GBM)
B) It outputs probability estimates for each class (P(LGG) and P(GBM)) for each patient, allowing clinicians to assess prediction confidence
C) It outputs the raw coefficient values for each gene and clinical feature
D) It outputs the training accuracy of the model

Predict test-datasaets

Questions

Question 1:

What is the key difference between lr.predict(X_test) and lr.predict_proba(X_test) in this glioma classification model?

A) predict() gives probabilities for each class, while predict_proba() gives final class labels
B) predict() gives final class decisions (0 for LGG, 1 for GBM) by applying a 0.5 threshold to probabilities, while predict_proba() gives the actual probability values
C) predict() is more accurate than predict_proba()
D) There is no difference between the two methods

Examine and understand the importance of features in predicting classes

Questions

Question 1 (Coefficient Sign Interpretation):

In this glioma classification model, what does a positive coefficient in lr.coef_ indicate for a specific gene or clinical feature?

A) The feature has no effect on glioma classification
B) The feature increases the likelihood of predicting GBM (class 1) when present or higher in value
C) The feature increases the likelihood of predicting LGG (class 0) when present or higher in value
D) The feature should be removed from the model

alt text

Question 2 (Coefficient Sign Interpretation):

Looking at the feature importance plot, what can you conclude about the features GRIN2A (rightmost bar) and IDH1 (leftmost bar) in terms of their effect on the predicted outcome?

A) GRIN2A decreases the probability of the positive class, while IDH1 increases it
B) GRIN2A increases the probability of the positive class, while IDH1 decreases it
C) Both features have the same effect but different magnitudes
D) The sign of the coefficient doesn’t matter, only the magnitude

Evaluation of the model performance

Questions

alt text

Question 1:

Based on the confusion matrix shown, what are the True Positives, False Positives, True Negatives, and False Negatives for this glioma classification model?

A) TP=124, FP=22, TN=94, FN=12
B) TP=94, FP=22, TN=124, FN=12
C) TP=22, FP=94, TN=12, FN=124
D) TP=94, FP=12, TN=124, FN=22