Model training and validation
This step use AlphaFold model, which is already trained to predict protein structures (pre-trained). Then this pre-trained model is trained on a dataset prepared in the previous step.
Note
Bening labelled variants:
Human variants: gnomAD
Primate variants: Great Ape project and a few other sources
Remove training variants at protein positions if they appear on validation or test sets
Low-frequency variants are likely to be pathogenic than the frequent ones
Used machine learning technique that adjusts model parameters according to the variant frequency so that the frequent variants contribute more strongly to the training signal
Pathogenic labelled variants
Unobserved in human and primate populations (unobserved data)
Training data
Balanced dataset reflecting the true distribution of pathogenic and benign missense variants in the population
The pathogenic set contains the same number of variants for each trinucleotide context as the benign set
Probability of sampling a pathogenic variant depends on abundance of its protein in the benign set
Variants in proteins that play a role in critical cellular function are likely to be pathogenic. Minimize the bias of functional context of protein in training set
Effects of using balanced training and test dataset
Unobserved data (pathogenic variants in training data) can have likely bening variants. Following technique can be used to filter-out likely bening variants in unobserved data
Use
AlphaMissense initial model
- Already learned to differentiate pathogenic and benign variantsPredict pathogenicity of unobserved variants using initial model
Use new scores to remove likely bening variants in unobserved data
Minimize the probability of including likely bening unobserved variants in downstream AlphaMissense model training rounds