Data Preparation
The process of collecting, cleaning, and organizing large datasets of variants along with their corresponding pathogenicity annotations
Critical for ensuring the quality and reliability of the model’s predictions
Note
Dataset Splitting :mega:
Training data:
Train the AlphaMissense model
Bening: missense variants frequently observed in human and primate populations
Pathogenic: missense variants absent from human and primate populations
Validation data:
Tune model parameters
Held-out data - Pathogenic missense variants in ClinVar (ClinVar ) & Bening variants from population-databases
Test data:
Evaluate the model’s performance on unseen data
Held-out data - Pathogenic missense variants in ClinVar and other selected studies & Bening variants from population-databases
Attention
Validation data vs test data
- Validation data:
Used during model training to guide many model-building decisions
- Test data:
Completely unseen dataset used at the very end to get an unbiased evaluation of the fully trained model