Classification

Logistic regression

Dataset

  • Features:

    • Most frequently mutated 20 genes and

    • 2 clinical features: gender and age at diagnosis

  • Target variable (i.e, dependant variable or response variables): Glioma grade class information

    • 0 = “LGG”

    • 1 = “GBM”

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# Download the test dataset
! curl https://naicno.github.io/BioNT_Module2_handson/_downloads/041231c291c25343976f8b70db54f238/TCGA_InfoWithGrade_scaled.csv -o TCGA_InfoWithGrade_scaled.csv
! ls -lh TCGA_InfoWithGrade_scaled.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 29 53487   29 15640    0     0   100k      0 --:--:-- --:--:-- --:--:--  100k
100 53487  100 53487    0     0   342k      0 --:--:-- --:--:-- --:--:--  341k
-rw-r--r-- 1 runner docker 53K Jun 10 13:52 TCGA_InfoWithGrade_scaled.csv
# Create a DataFrame from the CSV file
gliomas = pd.read_csv('TCGA_InfoWithGrade_scaled.csv')
gliomas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 839 entries, 0 to 838
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Grade             839 non-null    int64  
 1   Gender            839 non-null    int64  
 2   Age_at_diagnosis  839 non-null    float64
 3   IDH1              839 non-null    int64  
 4   TP53              839 non-null    int64  
 5   ATRX              839 non-null    int64  
 6   PTEN              839 non-null    int64  
 7   EGFR              839 non-null    int64  
 8   CIC               839 non-null    int64  
 9   MUC16             839 non-null    int64  
 10  PIK3CA            839 non-null    int64  
 11  NF1               839 non-null    int64  
 12  PIK3R1            839 non-null    int64  
 13  FUBP1             839 non-null    int64  
 14  RB1               839 non-null    int64  
 15  NOTCH1            839 non-null    int64  
 16  BCOR              839 non-null    int64  
 17  CSMD3             839 non-null    int64  
 18  SMARCA4           839 non-null    int64  
 19  GRIN2A            839 non-null    int64  
 20  IDH2              839 non-null    int64  
 21  FAT4              839 non-null    int64  
 22  PDGFRA            839 non-null    int64  
dtypes: float64(1), int64(22)
memory usage: 150.9 KB
# Inspect the first few rows of the DataFrame
gliomas.head()
Grade Gender Age_at_diagnosis IDH1 TP53 ATRX PTEN EGFR CIC MUC16 ... FUBP1 RB1 NOTCH1 BCOR CSMD3 SMARCA4 GRIN2A IDH2 FAT4 PDGFRA
0 0 0 0.023233 1 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
1 0 0 -0.778400 1 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 -1.004616 1 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 1 -1.156913 1 1 1 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
4 0 0 -1.237841 1 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 23 columns

# Count the number of samples in each grade (Count observations in target variable)
gliomas.groupby('Grade').size()
Grade
0    487
1    352
dtype: int64

Visualize data distributions

# Plot the distribution of data in all columns
fig, axs = plt.subplots(nrows=6, ncols=4, figsize=(15, 20))
axs = axs.flatten()

# Iterate over each column and plot the distribution
for i, column in enumerate(gliomas.columns):
    axs[i].hist(gliomas[column], bins=20, color='skyblue', edgecolor='black')
    axs[i].set_title(column)
    axs[i].set_xlabel('Value')
    axs[i].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
../_images/ce04b5059988403fca974eec342e0635261431d4d174224a3c7361b223b4444a.png
  • Scaling binary features would destroy their natural interpretability - coefficients would no longer represent the log-odds change from “absent” (0) to “present” (1)

Binary features encoded as 0/1 have a natural, meaningful interpretation where the coefficient represents the change in log-odds when the feature goes from absent (0) to present (1). If you scale these features (e.g., to have mean 0 and standard deviation 1), a “0” might become -0.85 and a “1” might become +1.23, making the coefficient interpretation much less intuitive. You’d lose the direct interpretation of “how much does the presence of this feature change the odds?” Additionally, binary features are already on a bounded, comparable scale (0 to 1), unlike continuous features that might range from 0 to 100,000, so scaling for convergence purposes is typically unnecessary.

gliomas.describe().T
count mean std min 25% 50% 75% max
Grade 839.0 4.195471e-01 0.493779 0.000000 0.000000 0.000000 1.000000 1.000000
Gender 839.0 4.183552e-01 0.493583 0.000000 0.000000 0.000000 1.000000 1.000000
Age_at_diagnosis 839.0 1.355028e-16 1.000596 -2.326863 -0.820775 0.039163 0.756044 2.444061
IDH1 839.0 4.815256e-01 0.499957 0.000000 0.000000 0.000000 1.000000 1.000000
TP53 839.0 4.147795e-01 0.492978 0.000000 0.000000 0.000000 1.000000 1.000000
ATRX 839.0 2.586412e-01 0.438149 0.000000 0.000000 0.000000 1.000000 1.000000
PTEN 839.0 1.680572e-01 0.374140 0.000000 0.000000 0.000000 0.000000 1.000000
EGFR 839.0 1.334923e-01 0.340309 0.000000 0.000000 0.000000 0.000000 1.000000
CIC 839.0 1.323004e-01 0.339019 0.000000 0.000000 0.000000 0.000000 1.000000
MUC16 839.0 1.168057e-01 0.321380 0.000000 0.000000 0.000000 0.000000 1.000000
PIK3CA 839.0 8.700834e-02 0.282015 0.000000 0.000000 0.000000 0.000000 1.000000
NF1 839.0 7.985697e-02 0.271233 0.000000 0.000000 0.000000 0.000000 1.000000
PIK3R1 839.0 6.436234e-02 0.245544 0.000000 0.000000 0.000000 0.000000 1.000000
FUBP1 839.0 5.363528e-02 0.225431 0.000000 0.000000 0.000000 0.000000 1.000000
RB1 839.0 4.767580e-02 0.213206 0.000000 0.000000 0.000000 0.000000 1.000000
NOTCH1 839.0 4.529201e-02 0.208068 0.000000 0.000000 0.000000 0.000000 1.000000
BCOR 839.0 3.456496e-02 0.182784 0.000000 0.000000 0.000000 0.000000 1.000000
CSMD3 839.0 3.218117e-02 0.176586 0.000000 0.000000 0.000000 0.000000 1.000000
SMARCA4 839.0 3.218117e-02 0.176586 0.000000 0.000000 0.000000 0.000000 1.000000
GRIN2A 839.0 3.218117e-02 0.176586 0.000000 0.000000 0.000000 0.000000 1.000000
IDH2 839.0 2.741359e-02 0.163383 0.000000 0.000000 0.000000 0.000000 1.000000
FAT4 839.0 2.741359e-02 0.163383 0.000000 0.000000 0.000000 0.000000 1.000000
PDGFRA 839.0 2.622169e-02 0.159889 0.000000 0.000000 0.000000 0.000000 1.000000

Split original dataset

  • train_test_split function from scikit-learn split a dataset into random train and test subsets

  • Default behaviour,

    • Shuffles the data before splitting

    • Does not inherently preserve the distribution of the original dataset when splitting to train and test subsets

    • The distribution in train and test subsets the depends on the randomness of the split

  • Stratified sampling in train_test_split function ensures that the distribution of classes (e.g., LGG and GBM distribution) in original dataset is the same in split-datasets

X_train, X_test, y_train, y_test = train_test_split(
    gliomas.drop("Grade", axis=1),
    gliomas["Grade"],
    test_size=0.3,
    random_state=42,
    stratify=gliomas["Grade"],
)
print("X_train", X_train.shape)
print("X_test", X_test.shape)
print("y_train", y_train.shape)
print("y_test", y_test.shape)
X_train (587, 22)
X_test (252, 22)
y_train (587,)
y_test (252,)

Train a logistic regression model

# Initialize the Logistic Regression model
lr = LogisticRegression()
# Fit the model
lr.fit(X_train, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Logistic regression model

  • Think of logistic regression as having three distinct layers that work together

    • The Raw Model (Linear Predictor)

    • The Model Output (Probability)

    • The Decision Boundary

The Raw Model (Linear Predictor)

$z = β0 + β1x1 + β2x2 + β3*x3$

Where:

  • $z$ is the raw model output

    • Unbounded score that could be any real number from -∞ to +∞

  • $β0$ is the intercept or bias term

  • $β1, β2, β3$ are the coefficients or weights associated with features $x1, x2, x3$ respectively

  • $x1, x2, x3$ are the values of the features

Additional notes:

  • Raw model output of logistic regression is equal to the log-odds

  • Why does this matter?

    • Improves interpretability: A one-unit increase in a feature increases the log-odds by the corresponding coefficient

  • Homework: Explore mathematics relationship between “raw model output” and “log odds”

# Plot Raw model output histogram 
sns.histplot(lr.decision_function(X_train))
plt.title('Distribution of Raw model output')
plt.xlabel('Raw model output')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
../_images/bb16e84c4d4086c5548ca68a18d80c116fa945f41151a273deda17cbce432fa8.png

The decision_function() returns the raw linear combination: β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ. Since this is just a weighted sum of features, it has no mathematical bounds and can theoretically range from -∞ to +∞. The sigmoid function is then applied to transform these logits into probabilities between 0 and 1.

The Model Output (Probability)

  • Raw score then gets transformed through the sigmoid function

  • Sigmoid function: Converts our unbounded raw score into a probability between 0 and 1

# Access prediction probabilities
probabilities = lr.predict_proba(X_test)
print("Probabilities for first 5 test samples:")
print(probabilities[:5])
print("\nNote: Column 0 = P(class=0), Column 1 = P(class=1)")
Probabilities for first 5 test samples:
[[0.2877258  0.7122742 ]
 [0.30536037 0.69463963]
 [0.86905511 0.13094489]
 [0.24988542 0.75011458]
 [0.91616818 0.08383182]]

Note: Column 0 = P(class=0), Column 1 = P(class=1)
  • predict_proba() returns a two-column array where Column 0 = P(class=0) = P(LGG) and Column 1 = P(class=1) = P(GBM) for each test patient. These probabilities sum to 1.0 for each patient

  • In medical contexts, probability outputs are crucial because they provide confidence levels (e.g., “85% chance of GBM” vs “51% chance of GBM”) that help clinicians make more informed treatment decisions and determine when additional testing might be needed.

sns.histplot(lr.predict_proba(X_train))
plt.title('Distribution of Model Output (Probabilities)')
plt.xlabel('Probability')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
../_images/1549a4e7dae7ea6b9a784ae11eebd5bc9a1e108b4aa2535d23d9208929bd951d.png
  • If we’re looking at the probability of class 0:

    • Probability near 0.0 = low chance of being class 0 = likely class 1

    • Probability near 1.0 = high chance of being class 0 = confident class 0 prediction

  • A U-shaped probability distribution is actually desirable in logistic regression, indicating that the model can confidently distinguish between classes. Most predictions are either very likely class 0 (near 0.0) or very likely class 1 (near 1.0), with few ambiguous cases in between.

  • Looking at the histogram, there are noticeably more blue bars (class 0 predictions) near probability 0.0 than orange bars (class 1 predictions) near probability 1.0, indicating the model predicts more instances as belonging to class 0.

  • The model makes very confident predictions - most probabilities cluster at the extremes (0.0 and 1.0) with very few uncertain predictions in the middle ranges (0.3-0.7).

Predict the Glioma type of the new dataset

# Predict classes for the test set
predicted_classes = lr.predict(X_test)
print("Predicted classes for first 10 samples:", predicted_classes[:10])
Predicted classes for first 10 samples: [1 1 0 1 0 1 1 0 1 0]

Examine and understand the importance of features in predicting classes

Coefficients of the model

  • Magnitude of the coefficients indicates the relative importance of each feature on predicting positive class (in this case 1 - GBM)

  • Larger coefficients imply a stronger influence on the predicted probability

  • Interpretation of coefficients assumes that the features are independent of each other

# Get the list of features
feature_list = gliomas.columns[1:]
print("Number of Features", len(feature_list))
print("Features", feature_list)
Number of Features 22
Features Index(['Gender', 'Age_at_diagnosis', 'IDH1', 'TP53', 'ATRX', 'PTEN', 'EGFR',
       'CIC', 'MUC16', 'PIK3CA', 'NF1', 'PIK3R1', 'FUBP1', 'RB1', 'NOTCH1',
       'BCOR', 'CSMD3', 'SMARCA4', 'GRIN2A', 'IDH2', 'FAT4', 'PDGFRA'],
      dtype='object')
# Create a DataFrame of the coefficients and their corresponding features
# `.coef_.` attribute of the fitted LogisticRegression model (`lr`) contains the coefficients for each feature
coefficients = pd.DataFrame(lr.coef_.T, index=feature_list, columns=['Coefficient'])
# Sort the coefficients
coefficients.sort_values(by='Coefficient', ascending=False)
Coefficient
GRIN2A 1.055559
PIK3R1 0.947821
TP53 0.942698
PTEN 0.796895
Age_at_diagnosis 0.681210
MUC16 0.604221
PDGFRA 0.445061
BCOR 0.353871
CSMD3 0.273685
RB1 0.156207
PIK3CA 0.059562
Gender 0.033025
FAT4 -0.007310
ATRX -0.192959
FUBP1 -0.410101
SMARCA4 -0.414551
EGFR -0.538561
CIC -0.780800
NF1 -0.836918
NOTCH1 -1.459809
IDH2 -1.733926
IDH1 -3.287939
# Plot coefficients for each feature
plt.figure(figsize=(10, 8))
plt.title('Feature Importance')

plt.barh('index', 'Coefficient', align='center', data=coefficients.sort_values(by='Coefficient', ascending=True).reset_index())
<BarContainer object of 22 artists>
../_images/a6115ab47ce8c5c26bac3337b55170b21c215ca82c9b35e6b1a8d615add4f5c3.png
  • In logistic regression,

    • positive coefficients (like GRIN2A ~+1.0) increase the log-odds and therefore the probability of the positive class when the feature value increases

    • Negative coefficients (like IDH1 ~-3.0) decrease the log-odds and probability of the positive class

  • The direction of the bar indicates whether the feature pushes predictions toward or away from the positive class.

Evaluation of the model performance

  • Confusion matrix

Confusion matrix

# Compute confusion matrix
cm = confusion_matrix(y_test, predicted_classes)

# Plot confusion matrix as heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()
../_images/abe1db1c5083c4c0b0715f6022ba642888b28b607b0616824db11754af40113c.png