Skip to content

Commit 26360ab

Browse files
authored
Merge pull request #84 from J35P312/master
TIDDIT 3.0.0
2 parents bd1e6a8 + d6432fb commit 26360ab

File tree

221 files changed

+1832
-79134
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

221 files changed

+1832
-79134
lines changed

README.md

Lines changed: 43 additions & 124 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,40 @@
11
DESCRIPTION
22
==============
33
TIDDIT: Is a tool to used to identify chromosomal rearrangements using Mate Pair or Paired End sequencing data. TIDDIT identifies intra and inter-chromosomal translocations, deletions, tandem-duplications and inversions, using supplementary alignments as well as discordant pairs.
4-
4+
TIDDIT searches for discordant reads and splti reads (supplementary alignments). The supplementary alignments are assembled and aligned using a fermikit-like workflow.
5+
Next all signals (contigs, split-reads, and discordant pairs) are clustered using DBSCAN. The resulting clusters are filtered and annotated, and reported as SV depending on the statistics.
56
TIDDIT has two analysis modules. The sv mode, which is used to search for structural variants. And the cov mode that analyse the read depth of a bam file and generates a coverage report.
6-
7+
On a 30X human genome, the TIDDIT SV module typically completetes within 5 hours, and requires less than 10Gb ram.
78

89
INSTALLATION
910
==============
10-
TIDDIT requires standard c++/c libraries, python 2.7 or 3.6, cython, and Numpy. To compile TIDDIT, cmake must be installed.
11-
samtools is reuquired for reading cram files (but not for reading bam).
11+
TIDDIT requires python3, cython, pysam, and Numpy; as well as bwa and fermikit (fermi2 and ropebwt2).
12+
13+
Installation
1214

1315
```
1416
git clone https://github.com/SciLifeLab/TIDDIT.git
1517
```
1618

1719
To install TIDDIT:
1820
```
19-
cd TIDDIT
20-
./INSTALL.sh
21+
cd tiddit
22+
pip install -e .
23+
```
24+
25+
Next install fermikit, I recommend using conda:
26+
27+
```
28+
conda install fermikit
2129
```
22-
The install script will compile python and use pip to install the python dependencies
23-
TIDDIT is run via the TIDDIT.py script:
30+
31+
You may also compile bwa, fermi2, and ropebwt2 yourself. Remember to add executables to path, or provide path through the command line parameters.
32+
2433
```
2534
26-
python TIDDIT.py --help
27-
python TIDDIT.py --sv --help
28-
python TIDDIT.py --cov --help
35+
tiddit --help
36+
tiddit --sv --help
37+
tiddit --cov --help
2938
```
3039

3140
TIDDIT may be installed using bioconda:
@@ -38,110 +47,52 @@ Next, you may run TIDDIT like this:
3847
tiddit --sv
3948
tiddit --cov
4049

41-
TIDDIT is also distributed with a Singularity environment (http://singularity.lbl.gov/index.html). Type the following command to download the container:
50+
TIDDIT is also distributed with a Docker container (http://singularity.lbl.gov/index.html). Type the following command to download the container:
4251

43-
singularity pull --name TIDDIT.simg shub://J35P312/TIDDIT:latest
52+
singularity pull --name TIDDIT.simg
4453

4554
Type the following to run tiddit:
4655

47-
singularity exec TIDDIT.simg TIDDIT.py
48-
49-
You may also build it yourself (if you have sudo permisions)
56+
singularity exec TIDDIT.simg tiddit
5057

51-
sudo singularity build TIDDIT.simg Singularity
52-
53-
The singularity container will download and install the latest commit on the scilifelab branch of TIDDIT.
54-
The "versioned_singularity" folder contains singularity recipes for installing certain releases of TIDDIT.
55-
These releases may also be downloaded through singularity hub
56-
57-
singularity pull --name TIDDIT.simg shub://J35P312/TIDDIT:2.7.1
5858

5959
The SV module
6060
=============
6161
The main TIDDIT module, detects structural variant using discordant pairs, split reads and coverage information
6262

63-
python TIDDIT.py --sv [Options] --bam in.bam
64-
65-
66-
TIDDIT support streaming of the bam file:
67-
68-
samtools view -buh in.bam | python TIDDIT.py --sv [Options] --bam /dev/stdin
69-
70-
Optionally, TIDDIT acccepts a reference fasta for GC correction:
71-
72-
python TIDDIT.py --sv [Options] --bam bam --ref reference.fasta
73-
74-
75-
Reference is required for analysing cram files:
76-
77-
python TIDDIT.py --sv [Options] --bam in.cram --ref reference.fasta
78-
79-
80-
Where bam is the input bam or cran file. And reference.fasta is the reference fasta used to align the sequencing data: TIDDIT will crash if the reference fasta is different from the one used to align the reads. The reads of the input bam file must be sorted on genome position.
81-
82-
The reference is required for analysing cram files.
83-
84-
NOTE: It is important that you use the TIDDIT.py wrapper for SV detection. The TIDDIT binary in the TIDDIT/bin folder does not perform any clustering, it simply extract SV signatures into a tab file.
63+
python tiddit --sv [Options] --bam in.bam --ref reference.fa
8564

65+
Where bam is the input bam or cram file. And reference.fasta is the reference fasta used to align the sequencing data: TIDDIT will crash if the reference fasta is different from the one used to align the reads. The reads of the input bam file must be sorted on genome position.
8666

8767
TIDDIT may be fine-tuned by altering these optional parameters:
8868

8969
-o output prefix(default=output)
90-
91-
-i paired reads maximum allowed insert size. Pairs aligning
92-
on the same chr at a distance higher than this are
93-
considered candidates for SV (default= 99.9th percentile of insert size)
94-
95-
-d expected reads orientations, possible values "innie" (-> <-) or "outtie" (<- ->).
96-
Default: major orientation within the dataset
97-
98-
-p Minimum number of supporting pairs in order to call a variation event (default 3)
99-
100-
-r Minimum number of supporting split reads to call a small variant (default 3)
101-
102-
-q Minimum mapping quality to consider an alignment (default= 5)
103-
104-
-Q Minimum regional mapping quality (default 20)
105-
70+
-i paired reads maximum allowed insert size. Pairs aligning on the same chr at a distance higher than this are considered candidates for SV (default= 99.9th percentile of insert size)
71+
-d expected reads orientations, possible values "innie" (-> <-) or "outtie" (<- ->). Default: major orientation within the dataset
72+
-p Minimum number of supporting pairs in order to call a variant (default 3)
73+
-r Minimum number of supporting split reads to call a variant (default 3)
74+
-q Minimum mapping quality to consider an alignment (default 5)
10675
-n the ploidy of the organism,(default = 2)
107-
108-
-e clustering distance parameter, discordant pairs closer
109-
than this distance are considered to belong to the same
110-
variant(default = sqrt(insert-size*2)*12)
111-
76+
-e clustering distance parameter, discordant pairs closer than this distance are considered to belong to the same variant(default = sqrt(insert-size*2)*12)
77+
-c average coverage, overwrites the estimated average coverage (useful for exome or panel data)
11278
-l min-pts parameter (default=3),must be set >= 2
113-
11479
-s Number of reads to sample when computing library statistics(default=25000000)
115-
116-
-z minimum variant size (default=100), variants smaller than
117-
this will not be printed ( z < 10 is not recomended)
118-
119-
--force_ploidy force the ploidy to be set to -n across the entire genome
120-
(i.e skip coverage normalisation of chromosomes)
121-
122-
--no_cluster Run only the TIDDIT signal extraction
123-
124-
--debug rerun the tiddit clustering procedure
125-
80+
-z minimum variant size (default=50), variants smaller than this will not be printed ( z < 10 is not recomended)
81+
--force_ploidy force the ploidy to be set to -n across the entire genome (i.e skip coverage normalisation of chromosomes)
12682
--n_mask exclude regions from coverage calculation if they contain more than this fraction of N (default = 0.5)
127-
128-
--ref reference fasta, used for GC correction and for reading cram
129-
130-
--p_ratio minimum discordant pair/normal pair ratio at the breakpoint junction(default=0.2)
131-
83+
--bwa path to bwa executable file(default=bwa)
84+
--fermi2 path to fermi2 executable file (default=fermi2)
85+
--ropebwt2 path to ropebwt2 executable file (default=ropebwt2)
86+
--p_ratio minimum discordant pair/normal pair ratio at the breakpoint junction(default=0.1)
13287
--r_ratio minimum split read/coverage ratio at the breakpoint junction(default=0.1)
133-
88+
--max_coverage filter call if X times higher than chromosome average coverage (default=4)
89+
--min_contig Skip calling on small contigs (default < 10000 bp)
13490

13591

13692

13793
output:
13894

139-
TIDDIT SV module produces three output files, a vcf file containing SV calls, a tab file describing the coverage across the genome in bins of size 50 bp, and a tab file dscribing the estimated ploidy and coverage across each contig.
140-
141-
Useful settings:
142-
143-
144-
In noisy datasets you may get too many small variants. If this is the case, then you may increase the -l parameter, or set the -i parameter to a high value (such as 2000) (on 10X linked read data, I usually set -l to 5).
95+
TIDDIT SV module produces two output files, a vcf file containing SV calls, and a tab file dscribing the estimated ploidy and coverage across each contig.
14596

14697

14798
The cov module
@@ -168,8 +119,6 @@ TIDDIT uses four different filters to detect low quality calls. The filter field
168119
The number of discordant pairs supporting the variant is too low compared to the number of discordant pairs within that genomic region.
169120
Unexpectedcoverage
170121
High coverage
171-
Smear
172-
The two windows that define the regions next to the breakpoints overlap.
173122

174123
Failed Variants may be removed using tools such as VCFtools or grep. Removing these variants greatly improves the precision of TIDDIT, but may reduce the sensitivity. It is adviced to remove filtered variants or prioritize the variants that have passed the quality checks.
175124
This command may be usedto filter the TIDDIT vcf:
@@ -183,36 +132,6 @@ The variant support of each call is compared to these values, and the quality co
183132

184133
Note: SVs usually occur in repetetive regions, hence these scores are expected to be relatively low. A true variant may have a low score, and the score itself depends on the input data (mate-pair vs pe for instance).
185134

186-
Contents of the VCF INFO field
187-
=============
188-
The INFO field of the VCF contains the following entries:
189-
190-
SVTYPE
191-
Type of structural variant(DEL,DUP,BND,INV,TDUP)
192-
END
193-
End position of an intra-chromosomal variant
194-
LFA
195-
The number of discordant pairs at the the first breakpoint of the variant
196-
LFB
197-
The number of discordant pairs at the the second breakpoint of the variant
198-
LTE
199-
The number of discordnat pairs that form the structural variant.
200-
COVA
201-
Coverage on window A
202-
COVM
203-
The coverage between A and B
204-
COVB
205-
Coverage on window B
206-
CIPOS
207-
start and stop positon of window A
208-
CIEND
209-
start and stop position of window B
210-
QUALA
211-
The average mapping quality of the reads in window A
212-
QUALB
213-
The average mapping quality of the reads in window B
214-
215-
The content of the INFO field can be used to filter out false positives and to gain more understanding of the structure of the variant. More info is found in the vcf file.
216135
Merging the vcf files
217136
=====================
218137
I usually merge vcf files using SVDB (https://github.com/J35P312)
@@ -247,8 +166,8 @@ genes may be annotated using vep or snpeff. NIRVANA may be used for annotating C
247166
Algorithm
248167
=========
249168

250-
Discordant pairs and split reads (supplementary alignments) are extracted and stored in the ".signals.tab" file. A discordant pair is any pair having a larger insert size than the -i paramater, or a pair where the reads map to different chromosomes.
251-
supplementary alignments and discordant pairs are only extracted if their mapping quality exceed the -q parameter.
169+
Discordant pairs, split reads (supplementary alignments), and contigs are extracted. A discordant pair is any pair having a larger insert size than the -i paramater, or a pair where the reads map to different chromosomes.
170+
supplementary alignments and discordant pairs are only extracted if their mapping quality exceed the -q parameter. Contigs are generated by assembling all reads with supplementary alignment using fermi2
252171

253172
The most recent version of TIDDIT uses an algorithm similar to DBSCAN: A cluster is formed if -l or more signals are located within the -e distance. Once a cluster is formed, more signals may be added if these signals are within the
254173
-e distance of -l signals within a cluster.

TIDDIT.py

Lines changed: 0 additions & 133 deletions
This file was deleted.

lib/CMakeLists.txt

Lines changed: 0 additions & 13 deletions
This file was deleted.

0 commit comments

Comments
 (0)