SimGBS
A Julia package for simulating Genotyping-by-Sequencing (GBS) data.
Highlights
SimGBS provides a simple and efficient method of simulating Genotyping-by-Sequencing (GBS) data. It provides valuable resources for evaluating and developing bioinformatics pipelines and statistical methods.
SimGBS generates GBS data from any reference genome (in FASTA format) using common restriction enzyme(s).
SimGBS defines large, complex population following the gene-drop method.
SimGBS samples genetic and GBS-specific variations from statistical models to reflect randomness at both individual and population level.
SimGBS is capable of simulating GBS data from thousands of samples under customizable settings.
Installation
SimGBS is registered in the General
registry, it can be directly installed in Julia
using Pkg.add
julia> using Pkg
julia> Pkg.add("SimGBS")
or
julia> ]
pkg> add SimGBS
For more information about how to install Julia package, please visit Getting Started with Julia.
Overview
SimGBS is designed to simulate GBS data in three simple steps:
Step One: Generate GBS Fragments
SimGBS generates raw GBS fragments through in silico digestion. It cuts the reference genome into smaller fragments using common restriction enzyme(s).
An optional fragment size-selection step is available to filter out GBS fragments where length falls outside the user-defined thresholds.
Step Two: Define Population Structure
SimGBS first defines a founder population, where variants (i.e., SNPs and QTL) of each founder are sampled based on sampled allele frequencies.
By taking the gene drop approach , SimGBS models crossing overs between any pair of individuals, and sample recombination sites randomly across the genome. Haplotypes of their progeny can be then defined by tracking these recombination events (including the origin of maternal and parental chromosomes, as well as the recombination sites) at each generation.
The demographic history of the final population can be defined by controlling the size of founder population, manipulating the growth rate of the intermediate population.
Once the base population becomes available, users can supply a pedigree file to generate a breeding population.
Step Three: Simulate GBS Process
With selected GBS fragments (after fragment size-selection) and sampled SNP positions, SimGBS can determine short haplotypes by selecting only those variants that are captured by GBS fragments and keeping GBS fragments that contains variant(s).
For each retained GBS fragment (after size-selection), two GBS reads are generated for every diploid individual based on its short haplotypes. A unique short DNA sequence (barcode, as a sample identifier) will be attached to the 5' end of each GBS read.
To capture variation in sequencing depth, a realised depth matrix is generated by sampling per-sample and per-locus depth from independent Gamma models respectively. Both Gamma models have identical means, equal to the average total sequencing depth, and the per-locus depth model comes with larger variance than per-sample depth model, as GBS protocols have been optimised to minimise between-sample depth variations. Actual read counts of each sample at per locus basis are then drawn from a Negative Binomial distribution with mean equal to its realised depth score.
Two GBS reads generated from a single locus for each sample (i.e., each carries maternal/paternal haplotype) will share the read counts at equal chance (i.e., Binomial sampling with probability equals 0.5).
Once the extact copies of each GBS read become known, SimGBS replicates each GBS read at the specified level. A header, which stores the information of sequencing, and a string of quality scores are then added to each GBS read to generate a GBS sequence. All simulated GBS reads are in FASTQ format.
Getting Started
Once installed, we can import SimGBS in Julia
julia> using SimGBS
Required Input
The following files are required for running SimGBS
A compressed reference genome (e.g., ref.fa.gz)
GBS barcodes (e.g., GBS_Barcodes.txt)
To help users with testing SimGBS, an example script and associated input files can be found on GitHub or figshare.
Note that input files must be stored in the working directory, please use the following commands to check/change your current working directory
julia> pwd()
julia> cd()
Usage and Options
The main functionalities of SimGBS have been organised into three wrapper functions, where each function corresponds to one step as described in the previous section.
Step One: Generate GBS Fragments
SimGBS.digestGenome
— MethoddigestGenome(genofile, re, useChr, useChrLen, lower ,upper, plotOutput, writeOutput)
Perform virtual digestion using restriction enzyme and generate GBS fragments.
This function uses specified restriction enzyme(s) to digest genome and therefore generate GBS fragments. Fragment size-selection step is also included.
Arguments
genofile
: file containing the reference genomere
: restriction enzyme(s) to be useduseChr
: either the number of chromosome or a set of chromosome(s) to be simulateduseChrLen
: length of chromosome in cM to be used in simulation, otherwise using entire chromosomelower
: lower threshold of fragment size-selectionupper
: upper threshold of fragment size-selectionwinSize
: size of window used for calculating average genomic coverageplotOutput
: set to true if graphical outputs are requiredwriteOutput
: set to true if text outputs are required
...
Examples
julia> digestGenome("ref.fa.gz", [SimGBS.ApeKI], [1], Array{Float64}(undef,0), 65 ,195, 1000000, false, true)
Step Two: Define Population Structure
SimGBS.definePopulation
— MethoddefinePopulation(numFounders, endSize, numGenCha, numGenCon, numGenFinal, numInd, useWeights, usePedigree, pedFile, pedOutput);
Create population structure for simulation.
This function generates different population structure, and there is an option to follow a user-defined pedigree.
Arguments
numFounders
: number of founders in the base populationendSize
: number of individuals to end up in the changingPopSize stepnumInd
: number of individuals to be simulatednumGenCha
: number of generations for changingPopSize functionnumGenCon
: number of generations for constantPopSize functionnumGenFinal
: number of final generations to be used to select individualuseWeights
: weights of each contributing genetarion in the final population compositionusePedigree
: set to false if you don't use pedigree, otherwise specify the pedigree file to be usedpedFile
: pedigree filepedOutput
: set to true if return pedigree output
Notes
- Please consider modifyin the combination of constant and changing population (as well as combining multiple populations) when defining complicated population structure.
Examples
julia> definePopulation(100, 500, 20, 100, 4, 96, Array{Float64}(undef,0), false, "sim.ped", false);
Step Three: Simulate GBS Process
SimGBS.GBS
— MethodGBS(totalQTL, totalSNP, muDensity, sigmasqDensity, winSize, muAlleleFreq, sigmasqAlleleFreq, re, meanDepth, barcodeFile, useChr, plotOutput, writeOutput, onlyOutputGBS)
Simulate Genotyping-by-Sequencing (GBS) data.
This function generates GBS reads by inserting genomic variants into in silico digested genomic fragments, ligates the polymorphic sequence with barcodes and replicates based on sequencing depth.
Arguments
totalQTL
: total number of QTL to be simulatedtotalSNP
: total number of SNPs to be simulated (set to "0" if sampling SNP positions based on density)muDensity
: location parameter of log-Laplace distribution (for sampling SNP density)sigmasqDensity
: scale parameter of log-Laplace distribution (for sampling SNP density)winSize
: Size of window and bin for sampling SNP positionsmuAlleleFreq
: mean of sampled allele frequencysigmasqAlleleFreq
: variance of sampled allele frequencyre
: restriction enzyme(s) to be usedbarcodeFile
: file containing GBS barcodesuseChr
: either the number of chromosomes or a set of chromosome(s) to be simulatedplotOutput
: set to true if graphical outputs are requiredwriteOutput
: set to true if text outputs are requiredonlyOutputGBS
: set to true if only GBS data is kept
Examples
julia> GBS(100, 0, -2.0, 0.001, 1000000, 0.5, 0.001, [SimGBS.ApeKI], 20.0, "GBS_Barcodes.txt", [1], false, true, true)
After importing SimGBS, users can execute all three functions sequentially to generate GBS data. Alternatively, users may run the Julia script directly
$ julia example.jl
Note that each function can also be executed independently to carry out specific task (e.g., run digestGenome
for testing GBS coverage under different choice of restriction enzyme).
Expected Output
Following outputs will be returned
GBS Fragments
- RawFrag.txt: raw GBS fragments following in silico digestion
- GBSFrag.txt: selected GBS fragments after fragment size-selection
- GBSCoverage.txt: genomic coverage of GBS fragments
- snpFragGBS.txt: GBS fragments that contain SNPs
Variants
qtlGeno.txt
: QTL genotype matrix (number of individual x number of QTL)snpGeno.txt
: SNP genotype matrix (number of individual x number of SNP)qtlInfo.txt
: information about QTL, including chromosome, position and allele frequency
QTL ID | Chromosome | Position | Sampled Allele Frequnecy |
---|---|---|---|
QTL_1 | 1 | 2064912 | 0.003 |
QTL_2 | 1 | 3088811 | 0.15 |
QTL_3 | 1 | 4974206 | 0.36 |
snpInfo.txt
: information about SNPs, including chromosome, position and allele frequency
SNP ID | Chromosome | Position | Sampled Allele Frequnecy |
---|---|---|---|
SNP_1 | 1 | 109078 | 0.27 |
SNP_2 | 1 | 109083 | 0.02 |
SNP_3 | 1 | 247119 | 0.07 |
shortHap.txt
: a (number of haplotypes (2 x number of individuals for diploid species) x number of SNP) matrix of short haplotypes generated by GBS fragments (i.e., SNPs captured within each GBS fragment)readDepth.txt
: a (number of individuals x number of SNP) matrix records the number of copies per GBS fragment
GBS Data
keyFile\_ABC12AAXX\_1.txt
: pseudo-information about GBS sample, e.g.
Flowcell | Lane | Barcode | Sample Name | Plate | Row | Column |
---|---|---|---|---|---|---|
ABC12AAXX | 1 | CCAATCAGA | Ind_1 | Plate1 | A | 1 |
ABC12AAXX | 1 | CCAATCAGA | Ind_2 | Plate1 | A | 2 |
ABC12AAXX | 1 | CCAATCAGA | Ind_3 | Plate1 | A | 3 |
ABC12AAXX\_1\_fastq.txt.gz
: simulated GBS sequences
What's Next?
The following tools are recommended for downstream analyses of GBS data,
snpGBS: a simple bioinformatics workflow to identify single nucleotide polymorphism (SNP) from Genotyping-by-Sequencing (GBS) data.
KGD: R code for the analysis of genotyping-by-sequencing (GBS) data, primarily to construct a genomic relationship matrix for the genotyped individuals.
GUSLD: An R package for estimating linkage disequilibrium using low and/or high coverage sequencing data without requiring filtering with respect to read depth.
SMAP a software package that analyzes read mapping distributions and performs haplotype calling to create multi-allelic molecular markers.