vcfsim

vcfsim is a command-line tool for generating simulated VCF's (variant call format files for encoding genetic data). It combines a coalescent simulation backend (msprime) with clean and efficient postprocessing to produce a wide variety of biologically realistic VCFs, with parameterized levels of missing data. Realistic VCF's can now be easily simulated with just a few command line arguments!

Authors

Paimon Goulart (UC Riverside), Kieran Samuk (UC Riverside)

Installation

First create and activate a conda environment for vcfsim:

conda create -n vcfsim_env python=3.10
conda activate vcfsim_env

vcfsim is currently available on bioconda, and can be installed by using the following command:

conda install bioconda::vcfsim

For more detailed installation instructions, please visit:
https://bioconda.github.io/recipes/vcfsim/README.html?highlight=vcfsi#package-package%20'vcfsim'

Arguments

Here is the list of required/optional arguments to run vcfsim

Required

--seed [SEED] Random seed for vcfsim to use

--percent_missing_sites [PERCENT_MISSING_SITES] Percent of rows missing from your VCF

--percent_missing_genotypes [PERCENT_MISSING_GENOTYPES] Percent of samples missing from your VCF

One of the following three options must also be provided to set the samples:

--sample_size [SAMPLE_SIZE] Amount of samples from population in VCF
--samples [SAMPLES ...] Custom sample names, space separated (e.g. A1 B1 C1)
--samples_file [SAMPLES_FILE] File containing one whitespace separated line of custom sample names

Optional

--chromosome [CHROMOSOME] Chromosome name/label

--replicates [REPLICATES] Number of replicate VCFs to produce (with varying seeds)

--sequence_length [SEQUENCE_LENGTH] Length of the chromosome to be simulated, in basepairs

--ploidy [PLOIDY] Ploidy for your VCF

--Ne [NE] Effective population size of the simulated population(s)

--mu [MU] Mutation rate in the simulated population(s)

--output_file [OUTPUT_FILE] Filename of outputed vcf, will automatically be followed by seed

--chromosome_file [CHROMOSOME_FILE] Specified file for multiple chromosome inputs (see below for details)

--population_mode [1|2] Number of populations simulate. 1 = single population (default), 2 = two populations with a shared history (C splits into A & B).

--time [TIME] Split time for population mode 2 (e.g. generations before present). Required if --population_mode 2 is specified.

Usage

Typical usage for vcfsim is as follows:

vcfsim --chromosome 1 --replicates 1 --seed 1234 --sequence_length 10000 --ploidy 2 --Ne 100000 --mu .000001 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --sample_size 10

This will create a VCV with the name "myvcf1234.vcf", i.e. "myvcf" followed by the seed given for the input.
If input for replicates were requested higher number than 1, 2 for example, then vcfsim will create two output files by the name of myvcf1234 and myvcf1235, adding one to the seed after every run.

NOTE: An output file doesn't needed to be specified. If no output file is specified, then the vcf will be redirected to STDOUT.

Screenshot of output file:

Using custom sample names

Instead of --sample_size, you can provide explicit sample names:

vcfsim --chromosome 1 --replicates 1 --seed 1234 --sequence_length 10000 --ploidy 2 --Ne 100000 --mu .000001 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --samples A1 B1 C1 D1

This will automatically set the sample size to 4 and label the VCF columns A1 B1 C1 D1.

You can also read the names from a file containing a single whitespace separated line:

vcfsim --chromosome 1 --replicates 1 --seed 1234 --sequence_length 10000 --ploidy 2 --Ne 100000 --mu .000001 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --samples_file names.txt

Where names.txt might contain:

A1 B1 C1 D1 E1

Otherwise, sample identifiers will default to tsk_0,...,tsk_n

Simulating a structured population split (population_mode = 2)

To simulate a demographic split between populations A and B from an ancestral population C:

vcfsim --chromosome 1 --replicates 1 --seed 1234 --sequence_length 10000 --ploidy 2 --Ne 100000 --mu .000001 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --sample_size 10 --population_mode 2 --time 1000

Multiple chromosome inputs

Another way vcfsim can be used is by providing a file for multiple chromosome inputs.

Your input file should be in the form of a text file, and should be formatted as such:

The columns are in the order of: chromosone, ploidy, sequence length, population size, mutation rate.
Each row will represent a seperate run of vcfsim, all these runs will be concatenated to the same file in the end.

The following command should be used when running vcfsim in this way:

vcfsim --seed 1234 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --sample_size 10 --chromosome_file input.txt

You can also combine a param file with custom names:

vcfsim --seed 1234 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --samples_file names.txt --chromosome_file input.txt

When done this way, the output should look like such:

With the concatenated vcf looking like:

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
vcfsim		vcfsim
LICENSE.txt		LICENSE.txt
README.md		README.md
meta.yaml		meta.yaml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vcfsim

Authors

Installation

Arguments

Required

Optional

Usage

Using custom sample names

Simulating a structured population split (population_mode = 2)

Multiple chromosome inputs

About

Uh oh!

Releases 2

Packages

Contributors 2

Uh oh!

Languages

License

samuk-lab/vcfsim

Folders and files

Latest commit

History

Repository files navigation

vcfsim

Authors

Installation

Arguments

Required

Optional

Usage

Using custom sample names

Simulating a structured population split (population_mode = 2)

Multiple chromosome inputs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages