Hands-on experiment :mega:

Objective

Annotate variants in VCF file with AlphaMissense and RAVEL datasets, and then compare likely pathogenic variants identified in both methods

Main steps

Read AlphaMissense_hg38 missense proteome data (71 million missense variant predictions covering the human proteome)
Read REVEL score dataset (ensemble method for combing scores from 13 tools predicting the pathogenicity of missense variants)
Reformat REVEL dataframe so that same operations can be performed for both datasets
Read variants in a VCF file and select missense variants. Then create a dataframe with variant info
Identify likely pathogenic VCF-variants based on AlphaMissense classifications
Identify likely pathogenic VCF-variants based on REVEL scores

import pandas as pd
import pysam
import seaborn as sns

Read AlphaMissense_hg38 `missense proteome data` (71 million missense variant predictions covering the human proteome)

Source: https://console.cloud.google.com/storage/browser/dm_alphamissense;tab=objects?prefix=&forceOnObjectsSortingFiltering=false

am = pd.read_csv("AlphaMissense_hg38.tsv", sep="\t", skiprows=3)
am.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71697556 entries, 0 to 71697555
Data columns (total 10 columns):
 #   Column            Dtype
---  ------            -----
 0   #CHROM            object
 1   POS               int64
 2   REF               object
 3   ALT               object
 4   genome            object
 5   uniprot_id        object
 6   transcript_id     object
 7   protein_variant   object
 8   am_pathogenicity  float64
 9   am_class          object
dtypes: float64(1), int64(1), object(8)
memory usage: 5.3+ GB

Read REVEL score dataset (ensemble method for combing scores from 13 tools predicting the pathogenicity of missense variants)

Source: https://sites.google.com/site/revelgenomics/downloads

ravel = pd.read_csv("revel_with_transcript_ids")
ravel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82100677 entries, 0 to 82100676
Data columns (total 9 columns):
 #   Column                Dtype
---  ------                -----
 0   chr                   object
 1   hg19_pos              int64
 2   grch38_pos            object
 3   ref                   object
 4   alt                   object
 5   aaref                 object
 6   aaalt                 object
 7   REVEL                 float64
 8   Ensembl_transcriptid  object
dtypes: float64(1), int64(1), object(7)
memory usage: 5.5+ GB

Reformat REVEL dataframe so that same operations can be performed for both datasets

ravel = ravel.rename(
    columns={"chr": "#CHROM", "grch38_pos": "POS", "ref": "REF", "alt": "ALT"}
)
ravel.drop(["hg19_pos", "aaref", "aaalt", "Ensembl_transcriptid"], axis=1)
ravel = ravel.drop(
    ["hg19_pos", "aaref", "aaalt", "Ensembl_transcriptid"], axis=1
).copy()
ravel["#CHROM"] = ravel["#CHROM"].apply(lambda x: "chr{}".format(x))

Read variants in a VCF file and select missense variants. Then create a dataframe with variant info

Chromosome: #CHROM
Genomic loci: POS
Reference allele: REF
Alternate allele: ALT

## Functions to read vcf file
def read_vcf_file(vcf_file):
    # Open the VCF file
    vcf_file = pysam.VariantFile(vcf_file)
    vc_dict = {"#CHROM": [], "POS": [], "REF": [], "ALT": []}
    # Iterate over each variant in the VCF file
    for variant in vcf_file:
        # Get the variant position and alleles
        chrom = variant.chrom
        position = variant.pos
        alleles = variant.alleles
        # Print the variant position and alleles
        if (len(alleles) == 2) and (len(alleles[0]) == 1) and (len(alleles[1]) == 1):
            vc_dict["#CHROM"].append(chrom)
            vc_dict["POS"].append(position)
            vc_dict["REF"].append(alleles[0])
            vc_dict["ALT"].append(alleles[1])
    return vc_dict

## Set the VCF file as a variable
vcf_file = (
    "Garvan_NA12878_HG001_HiSeq_Exome_bwa-mem_CPU.sam.bam_markdup.bam_BQSR.bam.vcf"
)

## Read the VCF file and create a dataframe with missense variant info
vc_dict = read_vcf_file(vcf_file)
vc_df = pd.DataFrame.from_dict(vc_dict)

vc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 708845 entries, 0 to 708844
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   #CHROM  708845 non-null  object
 1   POS     708845 non-null  int64
 2   REF     708845 non-null  object
 3   ALT     708845 non-null  object
dtypes: int64(1), object(3)
memory usage: 21.6+ MB

vc_df.head()

	#CHROM	POS	REF	ALT
0	chr1	13684	C	T
1	chr1	13813	T	G
2	chr1	13838	C	T
3	chr1	14522	G	A
4	chr1	14542	A	G

## Identify likely pathogenic VCF-variants
##        based on AlphaMissense classifications

am_vc = pd.merge(vc_df, am, on=["#CHROM", "POS", "REF", "ALT"])
am_vc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10328 entries, 0 to 10327
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   #CHROM            10328 non-null  object
 1   POS               10328 non-null  int64
 2   REF               10328 non-null  object
 3   ALT               10328 non-null  object
 4   genome            10328 non-null  object
 5   uniprot_id        10328 non-null  object
 6   transcript_id     10328 non-null  object
 7   protein_variant   10328 non-null  object
 8   am_pathogenicity  10328 non-null  float64
 9   am_class          10328 non-null  object
dtypes: float64(1), int64(1), object(8)
memory usage: 807.0+ KB

am_vc["am_class"].groupby(am_vc["am_class"]).size()

ambiguous             235
likely_benign        9805
likely_pathogenic     288
Name: am_class, dtype: int64

Count the number of `likely_pathogenic` Missense variants in the VCF based on AlphaMissense classification

len(am_vc[am_vc["am_class"]=="likely_pathogenic"])

Identify likely pathogenic VCF-variants based on with REVEL scores

ravel_cv = pd.merge(vc_df, ravel, on=["#CHROM", "POS", "REF", "ALT"])
ravel_cv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9522 entries, 0 to 9521
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   #CHROM  9522 non-null   object
 1   POS     9522 non-null   object
 2   REF     9522 non-null   object
 3   ALT     9522 non-null   object
 4   REVEL   9522 non-null   float64
dtypes: float64(1), object(4)
memory usage: 372.1+ KB

Count the number of `likely_pathogenic` Missense variants in the VCF based on REVEL scores

Use less stringent and more stringent thresholds recommended in REVEL publication

#### Count the number of `likely_pathogenic` Missense variants less stringent threshold
len(ravel_cv[ravel_cv["REVEL"]>0.5])

REVEL publication

REVEL score above 0.5, corresponding to a sensitivity of 0.754 and specificity of 0.891
REVEL recommends more stringent REVEL score threshold - 0.75 that lead to higher specificity but lower sensitivity, with 52.1% of disease mutations, 3.3% of neutral variants

sns.displot(ravel_cv[ravel_cv["REVEL"]>0.5]["REVEL"])

#### Count the number of `likely_pathogenic` Missense variants more stringent threshold
len(ravel_cv[ravel_cv["REVEL"]>0.75]["REVEL"])

Compare VCF variant annotations - AlphaMissense classifications vs REVEL scores

### List of AlphaMissense annotated likely pathogenic variants
am_vc_list = am_vc[am_vc["am_class"]=="likely_pathogenic"].apply(lambda x: "{}_{}".format(x["#CHROM"],x["POS"]), axis=1).to_list()
len(am_vc_list)

### List of RAVEL score annotated likely pathogenic variants
ravel_cv_list = ravel_cv[ravel_cv["REVEL"]>0.75].apply(lambda x: "{}_{}".format(x["#CHROM"],x["POS"]), axis=1).to_list()
len(ravel_cv_list)

### Import plotting modules
import matplotlib.pyplot as plt
from matplotlib_venn import venn2

Venn diagram showing the common and unique variant counts in AlphaMissense and RAVEL annotations

venn2([set(am_vc_list), set(ravel_cv_list)], set_labels = ('AlphaMissense annotations', 'RAVEL annotations'))
plt.show()

Hands-on experiment :mega:

Objective

Main steps

Read AlphaMissense_hg38 missense proteome data (71 million missense variant predictions covering the human proteome)

Read REVEL score dataset (ensemble method for combing scores from 13 tools predicting the pathogenicity of missense variants)

Reformat REVEL dataframe so that same operations can be performed for both datasets

Read variants in a VCF file and select missense variants. Then create a dataframe with variant info

Count the number of likely_pathogenic Missense variants in the VCF based on AlphaMissense classification

Identify likely pathogenic VCF-variants based on with REVEL scores

Count the number of likely_pathogenic Missense variants in the VCF based on REVEL scores

REVEL publication

Compare VCF variant annotations - AlphaMissense classifications vs REVEL scores

Venn diagram showing the common and unique variant counts in AlphaMissense and RAVEL annotations

Read AlphaMissense_hg38 `missense proteome data` (71 million missense variant predictions covering the human proteome)

Count the number of `likely_pathogenic` Missense variants in the VCF based on AlphaMissense classification

Count the number of `likely_pathogenic` Missense variants in the VCF based on REVEL scores