SimGBS

A Julia package for simulating Genotyping-by-Sequencing (GBS) data.

Highlights

SimGBS provides a simple and efficient method of simulating Genotyping-by-Sequencing (GBS) data. It provides valuable resources for evaluating and developing bioinformatics pipelines and statistical methods.

SimGBS generates GBS data from any reference genome (in FASTA format) using common restriction enzyme(s).

SimGBS defines large, complex population following the gene-drop method.

SimGBS samples genetic and GBS-specific variations from statistical models to reflect randomness at both individual and population level.

SimGBS is capable of simulating GBS data from thousands of samples under customizable settings.

Installation

SimGBS is registered in the General registry, it can be directly installed in Julia using Pkg.add

julia> using Pkg
julia> Pkg.add("SimGBS")

julia> ]
pkg> add SimGBS

For more information about how to install Julia package, please visit Getting Started with Julia.

Overview

SimGBS is designed to simulate GBS data in three simple steps:

Step One: Generate GBS Fragments

SimGBS generates raw GBS fragments through in silico digestion. It cuts the reference genome into smaller fragments using common restriction enzyme(s).

An optional fragment size-selection step is available to filter out GBS fragments where length falls outside the user-defined thresholds.

Step Two: Define Population Structure

SimGBS first defines a founder population, where variants (i.e., SNPs and QTL) of each founder are sampled based on sampled allele frequencies.

By taking the gene drop approach , SimGBS models crossing overs between any pair of individuals, and sample recombination sites randomly across the genome. Haplotypes of their progeny can be then defined by tracking these recombination events (including the origin of maternal and parental chromosomes, as well as the recombination sites) at each generation.

The demographic history of the final population can be defined by controlling the size of founder population, manipulating the growth rate of the intermediate population.

Once the base population becomes available, users can supply a pedigree file to generate a breeding population.

Step Three: Simulate GBS Process

With selected GBS fragments (after fragment size-selection) and sampled SNP positions, SimGBS can determine short haplotypes by selecting only those variants that are captured by GBS fragments and keeping GBS fragments that contains variant(s).

For each retained GBS fragment (after size-selection), two GBS reads are generated for every diploid individual based on its short haplotypes. A unique short DNA sequence (barcode, as a sample identifier) will be attached to the 5' end of each GBS read.

To capture variation in sequencing depth, a realised depth matrix is generated by sampling per-sample and per-locus depth from independent Gamma models respectively. Both Gamma models have identical means, equal to the average total sequencing depth, and the per-locus depth model comes with larger variance than per-sample depth model, as GBS protocols have been optimised to minimise between-sample depth variations. Actual read counts of each sample at per locus basis are then drawn from a Negative Binomial distribution with mean equal to its realised depth score.

Two GBS reads generated from a single locus for each sample (i.e., each carries maternal/paternal haplotype) will share the read counts at equal chance (i.e., Binomial sampling with probability equals 0.5).

Once the extact copies of each GBS read become known, SimGBS replicates each GBS read at the specified level. A header, which stores the information of sequencing, and a string of quality scores are then added to each GBS read to generate a GBS sequence. All simulated GBS reads are in FASTQ format.

Getting Started

Once installed, we can import SimGBS in Julia

julia> using SimGBS

Required Input

The following files are required for running SimGBS

A compressed reference genome (e.g., ref.fa.gz)
GBS barcodes (e.g., GBS_Barcodes.txt)

To help users with testing SimGBS, an example script and associated input files can be found on GitHub or figshare.

Note that input files must be stored in the working directory, please use the following commands to check/change your current working directory

julia> pwd()
julia> cd()

Usage and Options

The main functionalities of SimGBS have been organised into three wrapper functions, where each function corresponds to one step as described in the previous section.

Step One: Generate GBS Fragments

SimGBS.digestGenome — Method

digestGenome(genofile, re, useChr, useChrLen, lower ,upper, plotOutput, writeOutput)

Perform virtual digestion using restriction enzyme and generate GBS fragments.

This function uses specified restriction enzyme(s) to digest genome and therefore generate GBS fragments. Fragment size-selection step is also included.

Arguments

genofile: file containing the reference genome
re: restriction enzyme(s) to be used
useChr: either the number of chromosome or a set of chromosome(s) to be simulated
useChrLen: length of chromosome in cM to be used in simulation, otherwise using entire chromosome
lower: lower threshold of fragment size-selection
upper: upper threshold of fragment size-selection
winSize: size of window used for calculating average genomic coverage
plotOutput: set to true if graphical outputs are required
writeOutput: set to true if text outputs are required

...

Examples

julia> digestGenome("ref.fa.gz", [SimGBS.ApeKI], [1], Array{Float64}(undef,0), 65 ,195, 1000000, false, true)

source

Step Two: Define Population Structure

SimGBS.definePopulation — Method

definePopulation(numFounders, endSize, numGenCha, numGenCon, numGenFinal, numInd, useWeights, usePedigree, pedFile, pedOutput);

Create population structure for simulation.

This function generates different population structure, and there is an option to follow a user-defined pedigree.

Arguments

numFounders: number of founders in the base population
endSize: number of individuals to end up in the changingPopSize step
numInd: number of individuals to be simulated
numGenCha: number of generations for changingPopSize function
numGenCon: number of generations for constantPopSize function
numGenFinal: number of final generations to be used to select individual
useWeights: weights of each contributing genetarion in the final population composition
usePedigree: set to false if you don't use pedigree, otherwise specify the pedigree file to be used
pedFile: pedigree file
pedOutput: set to true if return pedigree output

Notes

Please consider modifyin the combination of constant and changing population (as well as combining multiple populations) when defining complicated population structure.

Examples

julia> definePopulation(100, 500, 20, 100, 4, 96,  Array{Float64}(undef,0), false, "sim.ped", false);

source

Step Three: Simulate GBS Process

SimGBS.GBS — Method

GBS(totalQTL, totalSNP, muDensity, sigmasqDensity, winSize, muAlleleFreq, sigmasqAlleleFreq, re, meanDepth, barcodeFile, useChr, plotOutput, writeOutput, onlyOutputGBS)

Simulate Genotyping-by-Sequencing (GBS) data.

This function generates GBS reads by inserting genomic variants into in silico digested genomic fragments, ligates the polymorphic sequence with barcodes and replicates based on sequencing depth.

Arguments

totalQTL: total number of QTL to be simulated
totalSNP: total number of SNPs to be simulated (set to "0" if sampling SNP positions based on density)
muDensity: location parameter of log-Laplace distribution (for sampling SNP density)
sigmasqDensity: scale parameter of log-Laplace distribution (for sampling SNP density)
winSize: Size of window and bin for sampling SNP positions
muAlleleFreq: mean of sampled allele frequency
sigmasqAlleleFreq: variance of sampled allele frequency
re: restriction enzyme(s) to be used
barcodeFile: file containing GBS barcodes
useChr: either the number of chromosomes or a set of chromosome(s) to be simulated
plotOutput: set to true if graphical outputs are required
writeOutput: set to true if text outputs are required
onlyOutputGBS: set to true if only GBS data is kept

Examples

julia> GBS(100, 0, -2.0, 0.001, 1000000, 0.5, 0.001, [SimGBS.ApeKI], 20.0, "GBS_Barcodes.txt", [1], false, true, true)

source

After importing SimGBS, users can execute all three functions sequentially to generate GBS data. Alternatively, users may run the Julia script directly

$ julia example.jl

Note that each function can also be executed independently to carry out specific task (e.g., run digestGenome for testing GBS coverage under different choice of restriction enzyme).

Expected Output

Following outputs will be returned

GBS Fragments

RawFrag.txt: raw GBS fragments following in silico digestion
GBSFrag.txt: selected GBS fragments after fragment size-selection
GBSCoverage.txt: genomic coverage of GBS fragments
snpFragGBS.txt: GBS fragments that contain SNPs

Variants

qtlGeno.txt: QTL genotype matrix (number of individual x number of QTL)
snpGeno.txt: SNP genotype matrix (number of individual x number of SNP)
qtlInfo.txt: information about QTL, including chromosome, position and allele frequency

QTL ID	Chromosome	Position	Sampled Allele Frequnecy
QTL_1	1	2064912	0.003
QTL_2	1	3088811	0.15
QTL_3	1	4974206	0.36

snpInfo.txt: information about SNPs, including chromosome, position and allele frequency

SNP ID	Chromosome	Position	Sampled Allele Frequnecy
SNP_1	1	109078	0.27
SNP_2	1	109083	0.02
SNP_3	1	247119	0.07

shortHap.txt: a (number of haplotypes (2 x number of individuals for diploid species) x number of SNP) matrix of short haplotypes generated by GBS fragments (i.e., SNPs captured within each GBS fragment)
readDepth.txt: a (number of individuals x number of SNP) matrix records the number of copies per GBS fragment

GBS Data

keyFile\_ABC12AAXX\_1.txt: pseudo-information about GBS sample, e.g.

Flowcell	Lane	Barcode	Sample Name	Plate	Row	Column
ABC12AAXX	1	CCAATCAGA	Ind_1	Plate1	A	1
ABC12AAXX	1	CCAATCAGA	Ind_2	Plate1	A	2
ABC12AAXX	1	CCAATCAGA	Ind_3	Plate1	A	3

ABC12AAXX\_1\_fastq.txt.gz: simulated GBS sequences

What's Next?

The following tools are recommended for downstream analyses of GBS data,

snpGBS: a simple bioinformatics workflow to identify single nucleotide polymorphism (SNP) from Genotyping-by-Sequencing (GBS) data.
KGD: R code for the analysis of genotyping-by-sequencing (GBS) data, primarily to construct a genomic relationship matrix for the genotyped individuals.
GUSLD: An R package for estimating linkage disequilibrium using low and/or high coverage sequencing data without requiring filtering with respect to read depth.
SMAP a software package that analyzes read mapping distributions and performs haplotype calling to create multi-allelic molecular markers.