vcfsim is a command-line tool for generating simulated VCF's (variant call format files for encoding genetic data). It combines a coalescent simulation backend (msprime) with clean and efficient postprocessing to produce a wide variety of biologically realistic VCFs, with parameterized levels of missing data. Realistic VCF's can now be easily simulated with just a few command line arguments!
Paimon Goulart (UC Riverside), Kieran Samuk (UC Riverside)
First create and activate a conda environment for vcfsim:
conda create -n vcfsim_env python=3.10
conda activate vcfsim_envvcfsim is currently available on bioconda, and can be installed by using the following command:
conda install bioconda::vcfsimFor more detailed installation instructions, please visit:
https://bioconda.github.io/recipes/vcfsim/README.html?highlight=vcfsi#package-package%20'vcfsim'
Here is the list of required/optional arguments to run vcfsim
--seed [SEED] Random seed for vcfsim to use
--percent_missing_sites [PERCENT_MISSING_SITES] Percent of rows missing from your VCF
--percent_missing_genotypes [PERCENT_MISSING_GENOTYPES] Percent of samples missing from your VCF
One of the following three options must also be provided to set the samples:
- --sample_size [SAMPLE_SIZE] Amount of samples from population in VCF
- --samples [SAMPLES ...] Custom sample names, space separated (e.g. A1 B1 C1)
- --samples_file [SAMPLES_FILE] File containing one whitespace separated line of custom sample names
--chromosome [CHROMOSOME] Chromosome name/label
--replicates [REPLICATES] Number of replicate VCFs to produce (with varying seeds)
--sequence_length [SEQUENCE_LENGTH] Length of the chromosome to be simulated, in basepairs
--ploidy [PLOIDY] Ploidy for your VCF
--Ne [NE] Effective population size of the simulated population(s)
--mu [MU] Mutation rate in the simulated population(s)
--output_file [OUTPUT_FILE] Filename of outputed vcf, will automatically be followed by seed
--chromosome_file [CHROMOSOME_FILE] Specified file for multiple chromosome inputs (see below for details)
--population_mode [1|2] Number of populations simulate. 1 = single population (default), 2 = two populations with a shared history (C splits into A & B).
--time [TIME] Split time for population mode 2 (e.g. generations before present). Required if --population_mode 2 is specified.
Typical usage for vcfsim is as follows:
vcfsim --chromosome 1 --replicates 1 --seed 1234 --sequence_length 10000 --ploidy 2 --Ne 100000 --mu .000001 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --sample_size 10This will create a VCV with the name "myvcf1234.vcf", i.e. "myvcf" followed by the seed given for the input.
If input for replicates were requested higher number than 1, 2 for example, then vcfsim will create two output files by the name of myvcf1234 and myvcf1235, adding one to the seed after every run.
NOTE: An output file doesn't needed to be specified. If no output file is specified, then the vcf will be redirected to STDOUT.
Instead of --sample_size, you can provide explicit sample names:
vcfsim --chromosome 1 --replicates 1 --seed 1234 --sequence_length 10000 --ploidy 2 --Ne 100000 --mu .000001 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --samples A1 B1 C1 D1This will automatically set the sample size to 4 and label the VCF columns A1 B1 C1 D1.
You can also read the names from a file containing a single whitespace separated line:
vcfsim --chromosome 1 --replicates 1 --seed 1234 --sequence_length 10000 --ploidy 2 --Ne 100000 --mu .000001 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --samples_file names.txtWhere names.txt might contain:
A1 B1 C1 D1 E1
Otherwise, sample identifiers will default to tsk_0,...,tsk_n
To simulate a demographic split between populations A and B from an ancestral population C:
vcfsim --chromosome 1 --replicates 1 --seed 1234 --sequence_length 10000 --ploidy 2 --Ne 100000 --mu .000001 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --sample_size 10 --population_mode 2 --time 1000Another way vcfsim can be used is by providing a file for multiple chromosome inputs.
Your input file should be in the form of a text file, and should be formatted as such:

The columns are in the order of: chromosone, ploidy, sequence length, population size, mutation rate.
Each row will represent a seperate run of vcfsim, all these runs will be concatenated to the same file in the end.
The following command should be used when running vcfsim in this way:
vcfsim --seed 1234 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --sample_size 10 --chromosome_file input.txtYou can also combine a param file with custom names:
vcfsim --seed 1234 --percent_missing_sites 0 --percent_missing_genotypes 0 --output_file myvcf --samples_file names.txt --chromosome_file input.txt

