In a benchmarking, it’s often that we need to calculate the
concordance rate between the test set and truth set. In the truth VCF
file, there are always true genotypes GT otherwise we
can’t validate the test VCF. A test VCF may be from the variant caller
or the genotype imputation program, and the format can be variable, e.g
GP, DS and GT. Hence,
this article guides you how to use the vcfppR::vcfcomp
function to rapidly examine various statistics for different scenarios
and formats, such as the Pearson correlation of genotyping (stats=“r2”),
the Non-Reference Concordance (stats=“nrc”), the F1-score (stats=“f1”)
or the Phasing Switch Error (stats=“pse”).
We normally get genotype posterior GP and genotype
dosage DS from the diploid imputation software, eg QUILT
and GLIMPSE. To examine the imputation accuracy, we calculate the
Pearson correlation between the imputed genotype dosage and the true
genotypes. With vcfcomp, we need to specify the desired
stats="r2" and formats=c("DS","GT"), which
will extract the respective FORMAT items for the testvcf
and truthvcf.
Besides, the QUILT2-nipt method outputs MDS and
FDS for both maternal and fetal genotype dosages in
constract to the DS in diploid mode. To assess the
imputation accuracy of both the maternal and fetal, we only need to
specify the corresponding formats.
In this case, we are interested in the called genotype concordance
and the sensitivity / specificity in genotype calling. In addition to
stats="r2", we choose stats="f1" or
stats="nrc" and specify the
formats=c("GT", "GT"). Normally, we want the results for
each sample, which can be achieved by using
by.sample=TRUE.
In this case, we look for a functionality to assess the phasing switch error(PSE). First of all, we need the two VCF files to contain the phased GT, which is represented through the ‘|’. We can choose to return the sites that have PSE.
Note: Currently, the pse statatics is a
simple form that doesn’t take the completeness and quality into
account.
In the comprehensive benchmarking, we often run many tests against
the same true sets. In this scenario, we can save the truth object and
reuse it. Actually, both test and truth can
take as input a vcftable object or a RDS file. The RDS file
stores an object that returned by vcftable.
saveRDS(vcftable(truthvcf), "truth.rds")
vcfcomp(test=testvcf1, truth="truth.rds")
vcfcomp(test=testvcf2, truth="truth.rds")
vcfcomp(test=testvcf3, truth="truth.rds")Note: Since the input is R objects instead of VCF
path, then you can not use the formats option in
vcfcomp to specify new formats other than the ones in your
input R objects.
In default, the allele frequency of each vairant is calculated on the
fly from the truth VCF based on GT. Yes, you can specify an external
file that stores allele frequencies from a database. Currently,
vcfcomp can only take a space-separated text file with five
columns and a header named with “chr” “pos” “ref” “alt” “af”.
If one is certain about the samples name in the testvcf can be
replaced with other names that can match the samples in the truthvcf.
One can use the names option to specify a vector of new
names that can be found in the truthvcf.
bins to bigger intervals because there will be NAs
if no variants exist in the specified intervals bin.