complete workflow

From raw VCF to comprehensive genomic insights for rare variant discovery

Upload any VCF, optionally specify a target phenotype, and let Genetase handle the rest. Our ultra-fast engine delivers a prioritized report with an interactive genome browser in under one minute. Future updates will include expanded AI-assisted interpretation capabilities.

📁

Upload VCF

Any VCF 4.x
GATK, DeepVariant, FreeBayes, DRAGEN

→

🎯

Specify Phenotype
(optional)

Target phenotype to prioritize
matching genes and variants

→

3
⚙️
Preprocessing
Reference detection + liftover
Normalization + Haplotype decomposition

→

⚡

Ultra-fast Annotation

10+ databases in parallel
ClinVar, gnomAD, Pangolin, AlphaMissense...

→

5
📊
Report + Genome Browser
Prioritized variants
Interactive filtering & visualization

🎯

Phenotype-targeted prioritization

Specify a target phenotype and Genetase runs a parallel analysis to:

Highlight variants in genes previously reported in association with the phenotype in public databases
Prioritize variants in genes with curated gene–disease associations from reference resources
Rank variants using annotation-based scoring, including reported clinical classifications, computational impact predictions, and phenotype-related evidence

📋

Intelligent Report Features

Your report summarizes variant-level annotations and scoring, including:

Top-ranked variant – The highest-scoring variant based on annotation-derived impact metrics within genes with reported disease associations.
Flagged variants for review – Variants highlighted for further investigation or manual curation.
Inheritance-aware annotation – Incorporates inheritance pattern context (LoF, GoF, DN).
Compound heterozygosity detection – Identifies and classifies heterozygous, homozygous, and compound heterozygous variant combinations.

🧬

Interactive Genome & Regulatory Browser

Explore and analyze variants using flexible filters, annotations, and scoring tracks:

Variant filtering – By predicted functional impact, inheritance mode, and gene-level annotation signals.
Rich gene information – Function, inheritance patterns, expression, biological pathways, curated disease and phenotype annotations, cancer relevance, and drug metabolism
Regulatory scores – PhyloP100, PhastCons100, and ReMM, all togglable for flexible analysis
Inheritance-aware filters – Apply LoF (recessive/dominant), GoF, or DN variant filters
Data export – Export prioritized variant lists for downstream analysis

technical deep‑dive

Why preprocessing matters: the foundation of accuracy

Before annotation, every VCF undergoes a 4-stage preprocessing pipeline. This ensures that complex variants are properly interpreted, regardless of the caller or reference genome used. Data are prepared carefully for the next sub-minute WGS analysis without compromising accuracy.

1. Raw VCF Input

Accepts any VCF 4.x format from any caller: GATK, DeepVariant, FreeBayes, DRAGEN, and more. Handles bgzip/gzip compression and multi-sample VCFs.

2. Reference Detection + Liftover

Auto-detects hg19/hg38/GRCh38. Automatically lifts over to the reference build used by our annotation databases. No manual intervention required.

3. VCF Normalization

Left-aligns variants and splits multi-allelic sites into biallelic records. Ensures consistent representation across all downstream analyses.

4. Haplotype Decomposition

Splits MNVs and complex indels while preserving haplotype context via CMPLX_id tracking. Enables accurate annotation of every complex variant.

🔬 Complex variant decomposition example

# Original complex record (haplotype block) detected and annotated
9	133255928	.	CCCCCCAG	GCCCCCAT	.	PASS	CMPLX_id=9:133255928-CCCCCCAG>GCCCCCAT;CHILD=9:133255928-C>G,9:133255935-G>T	GT	1/1

# Decomposed into atomic variants
9	133255928	.	C	G	.	PASS	CMPLX_id=9:133255928-CCCCCCAG>GCCCCCAT	GT	1/1
9	133255935	.	G	T	.	PASS	CMPLX_id=9:133255928-CCCCCCAG>GCCCCCAT	GT	1/1

# Original complex record (insertion / unequal-length indel) detected and annotated
9	135927964	.	GAGCACACACG	CAGAGCACACACGCA	.	PASS	CMPLX_id=9:135927964-GAGCACACACG>CAGAGCACACACGCA;CHILD=9:135927963-A>ACA,9:135927974-G>GCA	GT	0/1

# Decomposed into atomic variants
9	135927963	.	A	ACA	.	PASS	CMPLX_id=9:135927964-GAGCACACACG>CAGAGCACACACGCA	GT	0/1
9	135927974	.	G	GCA	.	PASS	CMPLX_id=9:135927964-GAGCACACACG>CAGAGCACACACGCA	GT	0/1

✔ Equal-length variants → decomposed into consecutive SNVs (MNV resolution) for more accurate consequence prediction
✔ Unequal-length indels → resolved via pairwise alignment and reconstruction into atomic changes for improved functional annotation
✔ Relationship preservation → retains original CMPLX_id links to support pedigree, phasing, and haplotype consistency
✔ Maximized annotation coverage → attempts to match both the original complex variant and its decomposed components against annotation databases

💡

Without proper decomposition, complex variants may be misinterpreted or misannotated:

Multi-nucleotide variants (MNVs) or small indels can be incorrectly annotated if treated as separate SNVs without considering their haplotype context.
Functional impact predictions may be inaccurate if the variant context is not considered.
Cross-referencing with databases that store variants in atomic form may fail unless variants are decomposed.

Genetase's haplotype-aware preprocessing helps ensure accurate interpretation of complex variants.

performance engineering

Algorithmic efficiency & stratified data processing

Sub-minute WGS analysis is achieved through a combination of algorithmic complexity reduction, a stratified filtering model, and parallel I/O. The following outlines the core principles that enable scalable, rapid variant interpretation.

🗄️

Optimized Data Topology & Indexing

Many traditional variant pipelines rely on full-table scans or row-wise processing, which can become a bottleneck for large whole-genome datasets. Our system achieves sub-minute rare variant analysis by combining:

Efficient interval-based indexing – Genomic positions are indexed with high-performance data structures, enabling rapid overlap and positional queries.
Cached precomputed scores – Variant annotations are pre-aggregated to reduce repeated computation.
Rapid preliminary filtering with BigWig – Precomputed genomic tracks enable quick funneling of variants, eliminating low-priority candidates before full annotation.

⚡

Stratified Funnel Filtering (Early Termination)

Instead of a monolithic annotation pass, the pipeline applies a cascade of increasingly expensive operations

This stratified approach reduces the number of variants requiring full, computationally expensive annotation by >99%, directly decreasing overall runtime while maintaining high sensitivity for variants of potential biological interest.

🔄

Parallelism & Asynchronous I/O

Hardware-level concurrency is leveraged to minimize idle time:

Parallel chromosome processing – Each autosome, sex chromosome, and mitochondrial genome is processed independently, allowing near-linear speedup with multiple CPU cores.
Batch variant annotation – Variants are grouped (1,000–5,000 per batch) and processed together, reducing per-variant overhead.
Concurrent annotation queries – Multiple annotation sources are queried simultaneously using non-blocking operations to minimize waiting time.

📊 Complexity & Benchmark Overview

Conventional Pipeline

VCF Parsing
O(N)

→

Normalization
O(N)

→

Full Annotation
O(N · D)

→

Total: ~O(N · D)

N = total number of variants, D = number of annotation databases. All variants are passed to the annotation stage.

Genetase Optimized Pipeline

Streaming Preprocessing
O(N)

→

Filtering → Nf
O(N)

→

Targeted Annotation
O(Nf · D)

→

~O(N + Nf · D)

Nf ≪ N (typically <0.1%), so only a small subset is fully annotated.

* Empirical performance: The annotation stage completes in ~1 minute for a whole-genome sample (~4.5M variants, 30× coverage) on consumer-grade hardware. Preprocessing time (e.g., parsing and normalization) depends on input characteristics and is not included.

🎯

Scientific principle: reduce complexity of the most expensive operations

By applying early filtering and parallelization, the pipeline shifts from O(N * D) to O(N + Nf * D), where Nf ≪ N. This minimizes costly full annotations, enabling rapid, scalable whole-genome analysis without sacrificing the depth of variant interpretation.

intelligent prioritization

Machine learning–driven variant ranking

Genetase uses a machine learning model to automatically rank variants by impact and potential biological relevance. Model performance is continuously improved through updates to underlying datasets, annotations, and training signals.

ML-based ranking algorithm

A gradient-boosted model (XGBoost) learns feature weights from curated variant annotation datasets. Each variant receives a Genetase Priority Score based on:

Reported clinical annotations – Variant classifications from ClinVar with confidence weighting
Population rarity – Allele frequency with adaptive thresholds
Functional impact – Missense predictors, Pangolin scores, loss-of-function signals (splice donor/acceptor, frameshift, start/stop gained or lost), reMM scores
Genomic context – Coding vs. non-coding regions, splice proximity, functional domains

Dynamic thresholds

Thresholds adapt automatically based on dataset composition and annotation context.

📚 Annotation databases

ClinVar (July 2025) – Retained all Pathogenic/Likely Pathogenic variants with AF <5% to ensure inclusion of rare, high-confidence annotated variants that may be enriched in specific populations (e.g., ΔF508 in Europeans or malaria-associated variants in African populations). Data from ClinVar, NCBI; public domain to the extent possible; see ClinVar Data Use
gnomAD v4.1 – Allele frequencies stratified by ancestry and gene constraint metrics from a broad aggregation of exome and genome data. Data from gnomAD (Broad Institute), open access; see gnomAD website for terms
dbSNP Build 157 – Allele frequency estimates aggregated with gnomAD. Source: dbSNP (NCBI), use subject to NCBI data use terms; see dbSNP Data Use
dbNSFP v4.9c – curated aggregation of missense functional prediction scores from multiple published algorithms, used for variant effect prioritization (MutationAssessor, M-CAP, MetaRNN, BayesDel, LIST-S2, PHACTboost, PrimateAI, EVE, MutPred, MutFormer, gMVP, fathmm-XF, Eigen PC, deogen2, VARITY, SIFT4G, ESM1b, PROVEAN)
Ensembl VEP Release 112 – Consequence terms, transcript impact, MANE select, regulatory anotations.
REVEL – Rare variant pathogenicity scores for missense variants. Source: REVEL, © Mount Sinai, available under the ODbL 1.0 licence
ClinPred – Machine learning-based computational tool designed to predict the pathogenicity of missense variants Alirezaie N, Kernohan KD, Hartley T, Majewski J, Hocking TD. “ClinPred: Prediction Tool to Identify Disease-Relevant Nonsynonymous Single-Nucleotide Variants.” American Journal of Human 2018 Oct 4;103(4):474-483. PMID:30220433
AlphaMissense – Predicts the pathogenicity of all possible human single amino acid substitutions, with scores ranging from 0 to 1. Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, Pritzel A, Wong LH, Zielinski M, Sargeant T, Schneider RG, Senior AW, Jumper J, Hassabis D, Kohli P, Avsec Ž. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023 Sep 22;381(6664):eadg7492. doi: 10.1126/science.adg7492. Epub 2023 Sep 22. PMID: 37733863, available under CC BY 4.0
PhyloP100 – Basewise conservation scores across 100 vertebrate species. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21. doi: 10.1101/gr.097857.109. Epub 2009 Oct 26. PMID: 19858363; PMCID: PMC2798823.
PhyloP447 – Basewise conservation scores across 447 vertebrate species. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21. doi: 10.1101/gr.097857.109. Epub 2009 Oct 26. PMID: 19858363; PMCID: PMC2798823.
PhastCons100 – Scores measuring evolutionary conservation across 100 vertebrate species. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. doi: 10.1101/gr.3715005. Epub 2005 Jul 15. PMID: 16024819; PMCID: PMC1182216.
reMM – Predicts the pathogenicity of non-coding variants in the human genome Damian Smedley, Max Schubach, Julius OB Jacobsen, Sebastian Köhler, Tomasz Zemojtel, Malte Spielmann, Marten Jäger, Harry Hochheiser, Nicole L Washington, Julie A McMurry, Melissa A Haendel, Christopher J Mungall, Suzanna E Lewis, Tudor Groza, Giorgio Valentini, Peter N Robinson. (2016). A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. The American Journal of Human Genetics, 99(3), 595–606. http://doi.org/10.1016/j.ajhg.2016.07.005 Licensed under MIT License
Pangolin – Precomputed variant impact scores on mRNA splicing. Wagner, N., & Neverov, A. (2025). Pangolin precomputed scores [Data set]. Zenodo. Licensed under CC BY 4.0.
VEP (Variant Effect Predictor) v.115 – Predicts functional effects of genomic variants. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The Ensembl Variant Effect Predictor. Genome Biol. 2016 Jun 6;17(1):122. doi: 10.1186/s13059-016-0974-4. PMID: 27268795; PMCID: PMC4893825.
UniProt – Database The UniProt Consortium , UniProt: the Universal Protein Knowledgebase in 2025, Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D609–D617, https://doi.org/10.1093/nar/gkae1010. Licensed under CC BY 4.0.
GTEx – Adult tissue gene expression database The GTEx Consortium, The Genotype-Tissue Expression (GTEx) Project, GTEx Portal. Data accessed on 01/07/2025. dbGaP accession: phs000424.vN.pN. Usage subject to GTEx Portal Data License: Terms and Conditions The GTEx Project does not endorse products/services derived from these data.
ClinGen – Database for clinical relevance of genes ClinGen The Clinical Genome Resource. Heidi L. Rehm, Ph.D., Jonathan S. Berg, M.D., Ph.D., Lisa D. Brooks, Ph.D., Carlos D. Bustamante, Ph.D., James P. Evans, M.D., Ph.D., Melissa J. Landrum, Ph.D., David H. Ledbetter, Ph.D., Donna R. Maglott, Ph.D., Christa Lese Martin, Ph.D., Robert L. Nussbaum, M.D., Sharon E. Plon, M.D., Ph.D., Erin M. Ramos, Ph.D., Stephen T. Sherry, Ph.D., and Michael S. Watson, Ph.D., for ClinGen. N Engl J Med 2015; 372:2235-2242 June 4, 2015 DOI: 10.1056/NEJMsr1406261.
GenCC – Global initiative that standardizes and collects evidence on gene-disease relationships DiStefano MT, Goehringer S, Babb L, Alkuraya FS, Amberger J, Amin M, Austin-Tse C, Balzotti M, Berg JS, Birney E, Bocchini C, Bruford EA, Coffey AJ, Collins H, Cunningham F, Daugherty LC, Einhorn Y, Firth HV, Fitzpatrick DR, Foulger RE, Goldstein J, Hamosh A, Hurles MR, Leigh SE, Leong IUS, Maddirevula S, Martin CL, McDonagh EM, Olry A, Puzriakova A, Radtke K, Ramos EM, Rath A, Riggs ER, Roberts AM, Rodwell C, Snow C, Stark Z, Tahiliani J, Tweedie S, Ware JS, Weller P, Williams E, Wright CF, Yates TM, Rehm HL. The Gene Curation Coalition: A global effort to harmonize gene-disease evidence resources. Genet Med. 2022 Aug;24(8):1732-1742. doi: 10.1016/j.gim.2022.04.017. Epub 2022 May 4. PMID: 35507016; PMCID: PMC7613247.
Reactome – Biological pathways database Milacic M, Beavers D, Conley P, Gong C, Gillespie M, Griss J, Haw R, Jassal B, Matthews L, May B, Petryszak R, Ragueneau E, Rothfels K, Sevilla C, Shamovsky V, Stephan R, Tiwari K, Varusai T, Weiser J, Wright A, Wu G, Stein L, Hermjakob H, D’Eustachio P. The Reactome Pathway Knowledgebase 2024. Nucleic Acids Research. 2024. doi: 10.1093/nar/gkad1025.

Allele frequency strategy: Weighted combination of gnomAD (primary) and dbSNP. When both are available, gnomAD frequencies are used for higher accuracy.

🧬 Gene information used

pLI, pNull, pRec LOEUF, missense Z-score
Biological pathways (Reactome)
Protein function (UniProt)
Gene expression (GTEx adult tissues + UniProt)
Haploinsufficiency (ClinGen)
Disease associations (genCC)
Drug and cancer relevance (text mining)

📍 Variant position within gene

Coding region
Splice region (±10 bp around exon-intron boundaries)
5′/3′ UTR regions
Regulatory regions if annotated by relevant expression/function datasets

📊 Final output: Genetase Priority Score

Each variant is assigned a priority score reflecting predicted impact:

Pathogenic – Reported pathogenic in ClinVar by more than one submitter without conflict
VH (Very High) – Very high impact prediction; prioritized for further review
H (High) – High impact prediction; considered VUS with stronger supporting evidence of impact
M (Moderate) – Moderate impact prediction; considered VUS with moderate supporting evidence of impact
L (Low) – Low impact prediction; typically deprioritized unless phenotype-specific relevance is observed
VL (Very Low) – Very low impact prediction; generally deprioritized
Benign – Reported benign in ClinVar by more than one submitter without conflict

Variants with Pathogenic or VH scores are surfaced at the top of analysis outputs for manual inspection. Scores of H and M indicate intermediate evidence strength and may require additional contextual evaluation.

Explore Genetase for research

Try our interactive demo to explore variant scoring and annotations. Genetase is currently provided for research purposes only. Learn how it can support genomic studies and share your feedback or interest in future access.

Interactive demo Share feedback / Get in touch

Inside Genetase: Ultra‑fast rare variant pipeline