Complete Machine Learning Workflow

Machine Learning Workflow

Instructor note

  • Participants were given 60 minutes to

    • go through the Jupter notebook and

    • select correct answers for the questions

  • Instructor narrate the answers and reasoning after the self-study time

    • time: 45 minutes

Data exploration

Questions

Question 1:

When dealing with gene mutation features where >95% of samples are wild-type (0), what is the most important consideration?

  • A) Apply standard scaling (z-score normalization) to make features comparable

  • B) Use log transformation to reduce skewness in the distribution

  • C) Consider the impact on model training - rare events may be difficult to learn

  • D) Convert to categorical variables using one-hot encoding

Question 2:

Your dataset contains age (continuous, 20-80 range), gender (binary 0/1), and gene mutations (binary 0/1). What normalization strategy is most appropriate?

  • A) Apply Min-Max scaling to all features to get 0-1 range

  • B) Apply Z-score normalization to all features for zero mean, unit variance

  • C) Scale only the continuous features (age), leave binary features unchanged

  • D) Apply log transformation to all features to handle skewness

Question 3:

Why is feature scaling for Age_at_diagnosis is required?

  • A) To convert the Age_at_diagnosis distribution into a perfect Gaussian (normal) distribution.

  • B) To reduce the number of unique values in Age_at_diagnosis, thereby simplifying the model.

  • C) To ensure that Age_at_diagnosis values are transformed to be either 0 or 1, matching the gene features.

  • D) To prevent Age_at_diagnosis from disproportionately influencing the model’s parameter estimation due to its larger numerical range compared to the binary (0/1) features.

Question 4:

In this glioma classification dataset, what should be your primary concern regarding the rare gene mutations?

  • A) The computational cost will be too high with so many zero values

  • B) Rare mutations might be the most clinically important but hardest to detect

  • C) Binary features don’t need any preprocessing in machine learning

  • D) The dataset is too small and needs data augmentation

Question 5:

You’re analyzing a dataset with 10,000 patients where a particular gene mutation occurs in only 50 patients (0.5% prevalence). When should you be most concerned about this feature imbalance?

  • A) Always - any feature with <5% prevalence will bias the logistic regression model

  • B) Never - logistic regression inherently handles sparse features well

  • C) Only when the 50 mutation carriers don’t provide sufficient statistical power to reliably estimate the gene’s effect

  • D) Only when the mutation is randomly distributed and not associated with the outcome

Question 6:

In disease genomics, why might completely removing very rare genetic variants (occurring in <1% of samples) be problematic from a biological perspective?

  • A) Rare variants always have larger effect sizes than common variants

  • B) Some rare variants may be clinically actionable, even if statistically underpowered in current sample

  • C) Removing sparse features will always improve model generalization

  • D) Feature selection should only be based on statistical criteria, not biological knowledge

Missing data handling

Questions

Question 1:

Your target variable (Grade) has missing values in 0.119% of samples. What is the most appropriate approach?

  • A) Impute the missing grades using the mode (most frequent class)

  • B) Use a sophisticated imputation method like KNN to predict missing grades

  • C) Remove these samples entirely from both training and testing datasets

  • D) Replace missing grades with a third category “Unknown” and make it a 3-class problem

Question 2:


threshold = int(0.95 * len(gliomas))
# Keep columns with at least 95% non-missing values
gliomas.dropna(thresh=threshold, axis=1, inplace=True)
  • A) This code drops columns that have more than 5% missing values using a 95% completeness threshold (Drop ATRX_xNA - 25% missing and IDH1_xNA - 80% missing, while Keeping All other features (0.119% or 0% missing)

  • B) Only Drop IDH1_xNA

  • C) Drop all columns with missing values

  • D) Keep columns with 0% missing values

Question 3:

When should this column-dropping step be performed in your ML pipeline?

  • A) Before removing samples (rows) with missing target variables

  • B) After removing samples with missing target variables but before train/test split

  • C) After train/test split but before feature scaling

  • D) After model training to remove unimportant features

Train-test and Standerdisation

Questions

X_train, X_test, y_train, y_test = train_test_split(
    gliomas.drop("Grade", axis=1),
    gliomas["Grade"],
    test_size=0.3,
    random_state=42,
    stratify=gliomas["Grade"],
)

scaler = StandardScaler()
X_train['Age_at_diagnosis'] = scaler.fit_transform(X_train[['Age_at_diagnosis']])
X_test['Age_at_diagnosis'] = scaler.transform(X_test[['Age_at_diagnosis']])

Question 1:

Why do we use fit_transform() on training data but only transform() on test data?

  • A) fit_transform() is faster than transform() for larger datasets

  • B) To prevent data leakage by ensuring scaling parameters come only from training data

  • C) transform() automatically applies different scaling to test data for better performance

  • D) It’s a coding convention but doesn’t impact model performance

Question 2:

What would happen if you calculated scaling parameters (mean and standard deviation) using the entire dataset before splitting?

  • A) The model would perform better due to more stable scaling parameters

  • B) It would create data leakage because test set statistics influence training preprocessing

  • C) Nothing significant - the difference in scaling parameters would be minimal

  • D) The model would be more generalizable to new data

Question 3:

If a new patient has age = 85 years (outside the training age range of 20-80), what should happen during prediction?

  • A) Reject the prediction because the age is out of range

  • B) Retrain the scaler including this new data point

  • C) Apply the same training scaler transformation, even if it results in an extreme scaled value

  • D) Use a different scaling method specifically for this outlier

Question 4:

alt text

What is the most important reason for a machine learning practitioner to perform such a visual check after splitting the data?

  • A) To ensure that no data points were lost during the train_test_split operation.

  • B) To confirm that the Age_at_diagnosis feature has been transformed to a normal distribution in both sets.

  • C) To verify that the feature distributions are reasonably similar between the training and test sets, which helps ensure that the test set provides a fair and representative evaluation of the model’s performance.

  • D) To decide if the Age_at_diagnosis feature should be used as the target variable instead of “Grade”.

Corss-validation

Questions

Question 1:

You split your dataset into 70% train/30% test, getting a test precison of 87% (recall 85%). What’s the main limitation of this single performance estimate?

  • A) precison of 87% (recall 85%) is too low for medical applications

  • B) The estimate could vary with different random splits of the same data

  • C) 30% test size is too large and wastes training data

  • D) precison and recall are the wrong metric for binary classification problems

Question 2:

Say that Your single test set has 252 samples (30% of 840). If you want to evaluate model performance on rare glioma subtypes that represent 5% of cases, how many samples would you have?

  • A) About 42 samples - sufficient for reliable performance estimation

  • B) About 13 samples - too few for meaningful statistical conclusions

  • C) About 126 samples - more than adequate for analysis

  • D) The number doesn’t matter if the model is well-trained

Question 3:

A hospital wants to deploy your glioma classifier but asks: “How confident are you that this precison of 87% (recall 85%) will hold for our patient population?” With only a single holdout test, what’s your most honest answer?

  • A) “Very confident - precison of 87% (recall 85%) are true values since we used proper train/test split”

  • B) “Moderately confident - the precison and recall could realistically range from 80-94% based on this single test”

  • C) “Cannot provide confidence bounds - need cross-validation or multiple test sets for reliability estimates”

  • D) “Completely confident - precison and recall doesn’t vary between hospitals”

Corss-validation: Code

Questions

Question 1:

(A)

kf = KFold(n_splits=5, shuffle=True, random_state=1111)
splits = kf.split(gliomas.drop("Grade", axis=1))

# Initialize scaler
scaler = StandardScaler()
fold = 1

lr_cv = LogisticRegression()
scaler = StandardScaler()

for train_index, val_index in splits:
    ...
    # Initialize the Logistic Regression model with cross-validation

    X_train_scaled['Age_at_diagnosis'] = scaler.fit_transform(X_train[['Age_at_diagnosis']]).flatten()
    X_val_scaled['Age_at_diagnosis'] = scaler.transform(X_val[['Age_at_diagnosis']]).flatten()
    
    # Use the SCALED data for training and prediction
    lr_cv.fit(X_train_scaled, y_train)  # ← Now using scaled data
    predictions = lr_cv.predict(X_val_scale- D)  # ← Now using scaled data

(B)

kf = KFold(n_splits=5, shuffle=True, random_state=1111)
splits = kf.split(gliomas.drop("Grade", axis=1))

# Initialize scaler
scaler = StandardScaler()
fold = 1

for train_index, val_index in splits:
    ...
    # Initialize the Logistic Regression model with cross-validation
    lr_cv = LogisticRegression()

    X_train_scaled['Age_at_diagnosis'] = scaler.fit_transform(X_train[['Age_at_diagnosis']]).flatten()
    X_val_scaled['Age_at_diagnosis'] = scaler.transform(X_val[['Age_at_diagnosis']]).flatten()

    # Use the SCALED data for training and prediction
    lr_cv.fit(X_train_scaled, y_train)  # ← Now using scaled data
    predictions = lr_cv.predict(X_val_scale- D)  # ← Now using scaled data

In this A code-block, the same lr_cv object is used across all 5 folds. What potential issue could this create?

  • A) Each fold builds upon the previous fold’s learned parameters, creating data leakage

  • B) The model’s internal state gets reset automatically, so there’s no issue

  • C) Memory usage increases exponentially with each fold

  • D) The model converges faster in later folds due to better initialization

Question 2:

The code declares scaler = StandardScaler() before the loop and reuses it (above A & B code-blocks). What happens when you call fit_transform() on the same scaler object multiple times?

  • A) It accumulates statistics across all folds, causing data leakage

  • B) It overwrites previous statistics with new fold’s statistics - no leakage

  • C) It averages statistics across folds for more stable scaling

  • D) It causes an error because you can only fit once

Question 3:

This code uses KFold instead of StratifiedKFold. In a medical dataset where GBM (aggressive cancer) represents 40% of cases, what could go wrong?

  • A) Some folds might have 60% GBM while others have 20%, creating inconsistent evaluation conditions

  • B) The total number of samples evaluated will be different across folds

  • C) Cross-validation will take significantly longer to complete

  • D) The precision and recall calculations will become invalid

Hyperparameter tuning

Questions

Question 1:

You train a logistic regression model for glioma classification using default scikit-learn parameters and achieve 78% F1-score. After hyperparameter tuning with GridSearchCV, you achieve 85% F1-score. What does this improvement primarily demonstrate?

  • A) Default parameters are intentionally set to poor values to encourage tuning

  • B) Machine learning algorithms need parameter optimization to match specific dataset characteristics

  • C) Hyperparameter tuning always guarantees at least 7% improvement in any metric

  • D) The original 78% score was due to a coding error in the implementation

Question 2:

# original grid in the notebook
param_grid = [
    # For l2 penalty 
    {
        'penalty': ['l2'],
        'C': [0.001, 0.01, 0.1, 1, 10, 100],
        'solver': ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga'],
        'max_iter': [1000, 5000],  # Higher iterations for sparse data
        'class_weight': [None, 'balanced']  # Handle class imbalance if present
    },
    # For l1 penalty (best for sparse features)
    {
        'penalty': ['l1'],
        'C': [0.0001, 0.001, 0.01, 0.1, 1, 10],  # Extended lower range
        'solver': ['liblinear', 'saga'],
        'max_iter': [1000, 5000],
        'class_weight': [None, 'balanced']
    },
    # For elasticnet penalty
    {
        'penalty': ['elasticnet'],
        'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9],  # Finer granularity
        'solver': ['saga'],
        'C': [0.001, 0.01, 0.1, 1, 10],  # Extended range
        'max_iter': [1000, 5000],
        'class_weight': [None, 'balanced']
    }
]

# Alternative: Focused grid for very sparse genomics data
sparse_focused_grid = [
    # Emphasize L1 and ElasticNet for feature selection
    {
        'penalty': ['l1'],
        'C': [0.001, 0.01, 0.1, 1],  # Focus on stronger regularization
        'solver': ['liblinear'],
        'max_iter': [5000]
    },
    {
        'penalty': ['elasticnet'],
        'l1_ratio': [0.5, 0.7, 0.9],  # Favor L1 component
        'solver': ['saga'],
        'C': [0.01, 0.1, 1],
        'max_iter': [5000]
    }
]

The sparse_focused_grid excludes L2 regularization and only includes L1 and ElasticNet penalties. Why is this choice particularly effective for sparse genomics data?

  • A) L2 regularization is computationally too expensive for high-dimensional sparse data

  • B) L1 and ElasticNet can set coefficients to exactly zero, performing automatic feature selection on irrelevant sparse features (while addressing multicollinearity)

  • C) L2 regularization requires balanced features and cannot handle any level of sparsity

  • D) L1 and ElasticNet converge faster than L2 when most features are sparse