SimGBS

A Julia package for simulating Genotyping-by-Sequencing (GBS) data.

Highlights

SimGBS provides a simple and efficient method of simulating Genotyping-by-Sequencing (GBS) data. It provides valuable resources for evaluating and developing bioinformatics pipelines and statistical methods.

SimGBS generates GBS data from any reference genome (in FASTA format) using common restriction enzyme(s).

SimGBS defines large, complex population following the gene-drop method.

SimGBS samples genetic and GBS-specific variations from statistical models to reflect randomness at both individual and population level.

SimGBS is capable of simulating GBS data from thousands of samples under customizable settings.

Installation

SimGBS is registered in the General registry, it can be directly installed in Julia using Pkg.add

julia> using Pkg
julia> Pkg.add("SimGBS")

or

julia> ]
pkg> add SimGBS

For more information about how to install Julia package, please visit Getting Started with Julia.

Overview

SimGBS is designed to simulate GBS data in three simple steps:

Step One: Generate GBS Fragments

SimGBS generates raw GBS fragments through in silico digestion. It cuts the reference genome into smaller fragments using common restriction enzyme(s).

An optional fragment size-selection step is available to filter out GBS fragments where length falls outside the user-defined thresholds.

Step Two: Define Population Structure

SimGBS first defines a founder population, where variants (i.e., SNPs and QTL) of each founder are sampled based on sampled allele frequencies.

By taking the gene drop approach , SimGBS models crossing overs between any pair of individuals, and sample recombination sites randomly across the genome. Haplotypes of their progeny can be then defined by tracking these recombination events (including the origin of maternal and parental chromosomes, as well as the recombination sites) at each generation.

The demographic history of the final population can be defined by controlling the size of founder population, manipulating the growth rate of the intermediate population.

Once the base population becomes available, users can supply a pedigree file to generate a breeding population.

Step Three: Simulate GBS Process

With selected GBS fragments (after fragment size-selection) and sampled SNP positions, SimGBS can determine short haplotypes by selecting only those variants that are captured by GBS fragments and keeping GBS fragments that contains variant(s).

For each retained GBS fragment (after size-selection), two GBS reads are generated for every diploid individual based on its short haplotypes. A unique short DNA sequence (barcode, as a sample identifier) will be attached to the 5' end of each GBS read.

To capture variation in sequencing depth, a realised depth matrix is generated by sampling per-sample and per-locus depth from independent Gamma models respectively. Both Gamma models have identical means, equal to the average total sequencing depth, and the per-locus depth model comes with larger variance than per-sample depth model, as GBS protocols have been optimised to minimise between-sample depth variations. Actual read counts of each sample at per locus basis are then drawn from a Negative Binomial distribution with mean equal to its realised depth score.

Two GBS reads generated from a single locus for each sample (i.e., each carries maternal/paternal haplotype) will share the read counts at equal chance (i.e., Binomial sampling with probability equals 0.5).

Once the extact copies of each GBS read become known, SimGBS replicates each GBS read at the specified level. A header, which stores the information of sequencing, and a string of quality scores are then added to each GBS read to generate a GBS sequence. All simulated GBS reads are in FASTQ format.

Getting Started

Once installed, we can import SimGBS in Julia

julia> using SimGBS

Required Input

The following files are required for running SimGBS

  • A compressed reference genome (e.g., ref.fa.gz)

  • GBS barcodes (e.g., GBS_Barcodes.txt)

To help users with testing SimGBS, an example script and associated input files can be found on GitHub or figshare.

Note that input files must be stored in the working directory, please use the following commands to check/change your current working directory

julia> pwd()
julia> cd()

Usage and Options

The main functionalities of SimGBS have been organised into three wrapper functions, where each function corresponds to one step as described in the previous section.

Step One: Generate GBS Fragments

SimGBS.digestGenomeMethod
digestGenome(genofile, re, useChr, useChrLen, lower ,upper, plotOutput, writeOutput)

Perform virtual digestion using restriction enzyme and generate GBS fragments.

This function uses specified restriction enzyme(s) to digest genome and therefore generate GBS fragments. Fragment size-selection step is also included.

Arguments

  • genofile: file containing the reference genome
  • re: restriction enzyme(s) to be used
  • useChr: either the number of chromosome or a set of chromosome(s) to be simulated
  • useChrLen: length of chromosome in cM to be used in simulation, otherwise using entire chromosome
  • lower: lower threshold of fragment size-selection
  • upper: upper threshold of fragment size-selection
  • winSize: size of window used for calculating average genomic coverage
  • plotOutput: set to true if graphical outputs are required
  • writeOutput: set to true if text outputs are required

...

Examples

julia> digestGenome("ref.fa.gz", [SimGBS.ApeKI], [1], Array{Float64}(undef,0), 65 ,195, 1000000, false, true)
source

Step Two: Define Population Structure

SimGBS.definePopulationMethod
definePopulation(numFounders, endSize, numGenCha, numGenCon, numGenFinal, numInd, useWeights, usePedigree, pedFile, pedOutput);

Create population structure for simulation.

This function generates different population structure, and there is an option to follow a user-defined pedigree.

Arguments

  • numFounders: number of founders in the base population
  • endSize: number of individuals to end up in the changingPopSize step
  • numInd: number of individuals to be simulated
  • numGenCha: number of generations for changingPopSize function
  • numGenCon: number of generations for constantPopSize function
  • numGenFinal: number of final generations to be used to select individual
  • useWeights: weights of each contributing genetarion in the final population composition
  • usePedigree: set to false if you don't use pedigree, otherwise specify the pedigree file to be used
  • pedFile: pedigree file
  • pedOutput: set to true if return pedigree output

Notes

  • Please consider modifyin the combination of constant and changing population (as well as combining multiple populations) when defining complicated population structure.

Examples

julia> definePopulation(100, 500, 20, 100, 4, 96,  Array{Float64}(undef,0), false, "sim.ped", false);
source

Step Three: Simulate GBS Process

SimGBS.GBSMethod
GBS(totalQTL, totalSNP, muDensity, sigmasqDensity, winSize, muAlleleFreq, sigmasqAlleleFreq, re, meanDepth, barcodeFile, useChr, plotOutput, writeOutput, onlyOutputGBS)

Simulate Genotyping-by-Sequencing (GBS) data.

This function generates GBS reads by inserting genomic variants into in silico digested genomic fragments, ligates the polymorphic sequence with barcodes and replicates based on sequencing depth.

Arguments

  • totalQTL: total number of QTL to be simulated
  • totalSNP: total number of SNPs to be simulated (set to "0" if sampling SNP positions based on density)
  • muDensity: location parameter of log-Laplace distribution (for sampling SNP density)
  • sigmasqDensity: scale parameter of log-Laplace distribution (for sampling SNP density)
  • winSize: Size of window and bin for sampling SNP positions
  • muAlleleFreq: mean of sampled allele frequency
  • sigmasqAlleleFreq: variance of sampled allele frequency
  • re: restriction enzyme(s) to be used
  • barcodeFile: file containing GBS barcodes
  • useChr: either the number of chromosomes or a set of chromosome(s) to be simulated
  • plotOutput: set to true if graphical outputs are required
  • writeOutput: set to true if text outputs are required
  • onlyOutputGBS: set to true if only GBS data is kept

Examples

julia> GBS(100, 0, -2.0, 0.001, 1000000, 0.5, 0.001, [SimGBS.ApeKI], 20.0, "GBS_Barcodes.txt", [1], false, true, true)
source

After importing SimGBS, users can execute all three functions sequentially to generate GBS data. Alternatively, users may run the Julia script directly

$ julia example.jl

Note that each function can also be executed independently to carry out specific task (e.g., run digestGenome for testing GBS coverage under different choice of restriction enzyme).

Expected Output

Following outputs will be returned

GBS Fragments

  • RawFrag.txt: raw GBS fragments following in silico digestion
  • GBSFrag.txt: selected GBS fragments after fragment size-selection
  • GBSCoverage.txt: genomic coverage of GBS fragments
  • snpFragGBS.txt: GBS fragments that contain SNPs

Variants

  • qtlGeno.txt: QTL genotype matrix (number of individual x number of QTL)
  • snpGeno.txt: SNP genotype matrix (number of individual x number of SNP)
  • qtlInfo.txt: information about QTL, including chromosome, position and allele frequency
QTL IDChromosomePositionSampled Allele Frequnecy
QTL_1120649120.003
QTL_2130888110.15
QTL_3149742060.36
  • snpInfo.txt: information about SNPs, including chromosome, position and allele frequency
SNP IDChromosomePositionSampled Allele Frequnecy
SNP_111090780.27
SNP_211090830.02
SNP_312471190.07
  • shortHap.txt: a (number of haplotypes (2 x number of individuals for diploid species) x number of SNP) matrix of short haplotypes generated by GBS fragments (i.e., SNPs captured within each GBS fragment)

  • readDepth.txt: a (number of individuals x number of SNP) matrix records the number of copies per GBS fragment

GBS Data

  • keyFile\_ABC12AAXX\_1.txt: pseudo-information about GBS sample, e.g.
FlowcellLaneBarcodeSample NamePlateRowColumn
ABC12AAXX1CCAATCAGAInd_1Plate1A1
ABC12AAXX1CCAATCAGAInd_2Plate1A2
ABC12AAXX1CCAATCAGAInd_3Plate1A3
  • ABC12AAXX\_1\_fastq.txt.gz: simulated GBS sequences

What's Next?

The following tools are recommended for downstream analyses of GBS data,

  • snpGBS: a simple bioinformatics workflow to identify single nucleotide polymorphism (SNP) from Genotyping-by-Sequencing (GBS) data.

  • KGD: R code for the analysis of genotyping-by-sequencing (GBS) data, primarily to construct a genomic relationship matrix for the genotyped individuals.

  • GUSLD: An R package for estimating linkage disequilibrium using low and/or high coverage sequencing data without requiring filtering with respect to read depth.

  • SMAP a software package that analyzes read mapping distributions and performs haplotype calling to create multi-allelic molecular markers.