How can I turn on or customize forum notifications?
By default, the forum does not send notification messages about new comments or discussions. If you want to turn on notifications or customize the type of notifications you want to receive (email,...
View ArticleWhat is "Phone Home" and how does it affect me?
1. What it is and how it helps us improve the GATK Since September, 2010, the GATK has had a "phone-home" feature that sends us information about each GATK run via the Broad filesystem (within the...
View ArticleCollected FAQs about VCF files
1. What file formats do you support for variant callsets? We support the Variant Call Format (VCF) for variant callsets. No other file formats are supported. 2. How can I know if my VCF file is valid?...
View ArticleWhy are some of the annotation values different with VariantAnnotator...
As featured in this forum question. Two main things account for these kinds of differences, both linked to default behaviors of the tools: The tools downsample to different depths of coverage The tools...
View ArticleWhat is Map/Reduce and why are GATK tools called "walkers"?
Overview One of the key challenges of working with next-gen sequence data is that input files are usually very large. We can’t just make the program open the files, load all the data into memory and...
View ArticleHow do I submit a detailed bug report?
Note: only do this if you have been explicitly asked to do so. Scenario: You posted a question about a problem you had with GATK tools, we answered that we think it's a bug, and we asked you to submit...
View ArticleWhat is GATK-Lite and how does it relate to "full" GATK 2.x? [RETIRED]
Please note that GATK-Lite was retired in February 2013 when version 2.4 was released. See the announcement here. You probably know by now that GATK-Lite is a free-for-everyone and completely...
View ArticleHow can I invoke read filters and their arguments?
Most GATK tools apply several read filters by default. You can look up exactly what are the defaults for each tool in their respective Technical Documentation pages. But sometimes you want to specify...
View ArticleWhat are the prerequisites for running GATK?
1. Operating system The GATK runs natively on most if not all flavors of UNIX, which includes MacOSX, Linux and BSD. It is possible to get it running on Windows using Cygwin, but we don't provide any...
View ArticleCan I use different versions of the GATK at different steps of my analysis?
Short answer: NO. Medium answer: no, at least not if you want to run a low-risk pipeline. Long answer: see below for details. The rationale There are several reasons why you might want to do this:...
View ArticleWhat types of variants can GATK tools detect / handle?
The answer depends on what tool we're talking about, and whether we're considering variant discovery or variant manipulation. Variant manipulation GATK variant manipulation tools are able to recognize...
View ArticleWhere can I get the GATK source code?
We distinguish "Classic GATK" (major versions 1 through 3) and GATK 4, the next generation of GATK tools. "Classic GATK" (major versions 1 through 3) (current distribution) We provide the current GATK...
View ArticleWhat do the VariantEval modules do?
VariantEval accepts two types of modules: stratification and evaluation modules. Stratification modules will stratify (group) the variants based on certain properties. Evaluation modules will compute...
View ArticleCollected FAQs about input files for sequence read data (BAM/CRAM)
1. What file formats do you support for sequence data input? The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample...
View ArticleWhat is the structure of a GATK command?
Overview This document describes how GATK commands are structured and how to add arguments to basic command examples. Basic java syntax Commands for GATK always follow the same basic syntax: java [Java...
View ArticleWhere can I get a gene list in RefSeq format?
1. About the RefSeq Format From the NCBI RefSeq website The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including...
View ArticleWhich tools use pedigree information?
There are two types of GATK tools that are able to use pedigree (family structure) information: Tools that require a pedigree to operate PhaseByTransmission and CalculateGenotypePosterior will not run...
View ArticleI'm new to GATK. Where do I start?
If this is your first rodeo, you're probably asking yourself: What can GATK do for me? Identify variants in a bunch of sample sequences, with great sensitivity and specificity. How do I get GATK to do...
View ArticleCan I use GATK on non-diploid organisms?
In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order...
View ArticleWhat is the GATKReport file format?
A GATKReport is simply a text document that contains well-formatted, easy to read representation of some tabular data. Many GATK tools output their results as GATKReports, so it's important to...
View ArticleWhat do I need to do before attending a workshop hands-on session?
So you're going to a GATK workshop, and you've been selected to participate in a hands-on session? Fantastic! We're looking forward to walking you through some exercises that will help you master the...
View ArticleHow can I use parallelism to make GATK tools run faster?
This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results. Overview As explained in the primer on...
View ArticleWhat input files does the GATK accept / require?
All analyses done with the GATK typically involve several (though not necessarily all) of the following inputs: Reference genome sequence Sequencing reads Intervals of interest Reference-ordered data...
View ArticleWhat is uBAM and why is it better than FASTQ for storing unmapped sequence data?
Most sequencing providers generate FASTQ files with the raw unmapped read sequences, so that is the most common form in which the data is input into the mapping step of the pre-processing pipeline....
View ArticleWhat is a GVCF and how is it different from a 'regular' VCF?
Overview GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra...
View ArticleHow should I cite GATK in my own publications?
To date we have published three papers on GATK (citation details below). The ideal way to cite the GATK is to use all as a triple citation, as in: We sequenced 10 samples on 10 lanes on an Illumina...
View ArticleWhat should I use as known variants/sites for running tool X?
1. Notes on known sites Why are they important? Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is...
View ArticleShould I analyze my samples alone or together?
Together is (almost always) better than alone We recommend performing variant discovery in a way that enables joint analysis of multiple samples, as laid out in our Best Practices workflow. That...
View ArticleI have multiple read groups for 1 sample. How should I pre-process them?
Things can get a bit messy when you have multiple libraries (or read groups) for a sample. You may not know how to organize the data for the pre-processing steps or how to feed the data into Haplotype...
View ArticleLane, Library, Sample and Cohort -- what do they mean and why are they...
There are four major organizational units for next-generation DNA sequencing processes that used throughout the GATK documentation: Lane: The basic machine unit for sequencing. The lane reflects the...
View ArticleWhat is the difference between QUAL and GQ annotations?
There has been a lot of confusion about the difference between QUAL and GQ, and we hope this FAQ will clarify the difference. The basic difference is that QUAL refers to the variant site whereas GQ...
View ArticleWhat is a VCF and how should I interpret it?
This document describes "regular" VCF files. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document. Contents What is...
View ArticleHow should I pre-process data from multiplexed sequencing and multi-library...
Our Best Practices Pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per...
View ArticleCollected FAQs about interval lists
1. What file formats do you support for interval lists? We support three types of interval lists, as mentioned here. Interval lists should preferentially be formatted as Picard-style interval lists,...
View ArticleShould I use UnifiedGenotyper or HaplotypeCaller to call variants on my data?
Use HaplotypeCaller! The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call...
View ArticleWhen should I use -L to pass in a list of intervals?
The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for...
View ArticleHow can I prepare a FASTA file to use as reference?
This article describes the steps necessary to prepare your reference file (if it's not one that you got from us). As a complement to this article, see the relevant tutorial. Why these steps are...
View ArticleWhich training sets / arguments should I use for running VQSR?
This document describes the resource datasets and arguments that we recommend for use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on...
View ArticleWhat's in the resource bundle and how can I get it?
1. Obtaining the bundle Inside of the Broad, the latest bundle will always be available in: /humgen/gsa-hpprojects/GATK/bundle/current with a subdirectory containing for each reference sequence and...
View ArticleCan I apply the germline variant joint calling workflow to my RNAseq data?
We have not yet validated the joint genotyping methods (HaplotypeCaller in -ERC GVCF mode per-sample then GenotypeGVCFs per-cohort) on RNAseq data. Our standard recommendation is to process RNAseq...
View ArticleHow should I select samples for a Panel of Normals for somatic analysis?
The Panel of Normals (PoN) plays two important roles in somatic variant analysis: Exclude germline variant sites that are found in the normals to avoid calling them as potential somatic variants in the...
View ArticleWhat do GATK workshops cover?
This is a summary description of our standard 3-day workshop, with optional 1-day pipelining add-on at the discretion of the organizer. Overview This workshop formula focuses on the core steps involved...
View Article