Build phylogenetic trees guided by hierarchical lineage structures.
Tidy Tree constructs phylogenetic trees by building subtrees for individual lineages and stitching them together according to a lineage guide tree. This approach is useful when you have:
- A hierarchical classification of sequences into lineages
- Representative founder sequences for each lineage
- A known or hypothesized relationship between lineages
The tool works in three main steps:
-
Assign sequences to lineages: Each input sequence is assigned to its lineage based on the provided mapping table
-
Build subtrees: For each lineage, a subtree is constructed containing:
- All sequences belonging to that lineage
- The founder sequence of that lineage
- The founder sequences of its child lineages (connection points)
-
Stitch subtrees: Subtrees are combined following the lineage guide tree structure. Child lineage subtrees are grafted onto parent subtrees at the positions of the child founder sequences.
The subtrees are built using IQ-TREE, a fast and accurate phylogenetic inference tool.
- Python 3.7+
- BioPython
- IQ-TREE (must be installed separately)
-
Clone or download this repository
-
Install Python dependencies:
pip install -r requirements.txt- Install IQ-TREE:
- Download from http://www.iqtree.org/
- Or install via conda:
conda install -c bioconda iqtree - Or install via apt:
sudo apt install iqtree
python tidy_tree.py \
--alignment aligned_sequences.fasta \
--founder-sequences aligned_founders.fasta \
--guide-tree lineage_tree.newick \
--lineage-assignments assignments.tsv \
--output-tree output_tree.newick--alignment,-s: Aligned input sequences in FASTA format--founder-sequences,-f: Aligned lineage founder sequences in FASTA format--guide-tree,-g: Lineage guide tree in Newick format--lineage-assignments,-a: Tab-separated file mapping sequences to lineages--output-tree,-o: Output file for the final tree (Newick format)
--seq-id-column: Column name for sequence IDs in assignments file (default:seq_id)--lineage-column: Column name for lineage in assignments file (default:lineage)--iqtree-path: Path to IQ-TREE executable (default:iqtree)--model: Substitution model for IQ-TREE (default:GTR+G)--threads: Number of CPU threads for IQ-TREE (default: 1)--iqtree-args: Additional arguments to pass to IQ-TREE
Standard FASTA format with aligned sequences:
>seq001
ATGCGATCGATCG---ATCG
>seq002
ATGCGATCGATCGATGATCG
>seq003
ATGCGAT---TCGATGATCG
Important: All sequences must be pre-aligned and have the same length.
FASTA format with one founder sequence per lineage:
>LineageA
ATGCGATCGATCGATGATCG
>LineageB
ATGCGAT---TCGATGATCG
>LineageC
ATGCGATCG---GATGATCG
Important:
- Founder sequence IDs must match the lineage names in the guide tree
- Founders must be aligned with the same alignment as the input sequences
Newick format describing the hierarchical relationship between lineages:
(LineageA,(LineageB,LineageC));
Or with branch lengths:
(LineageA:0.1,(LineageB:0.05,LineageC:0.05):0.05);
Tab-separated file mapping sequence IDs to lineage names. Must include a header row with column names:
seq_id lineage
seq001 LineageA
seq002 LineageA
seq003 LineageB
seq004 LineageB
seq005 LineageC
Important:
- The file must have a header row with column names
- Default column names are
seq_idandlineage, but you can specify different names using--seq-id-columnand--lineage-column - Lines starting with
#are treated as comments and ignored
Example with custom column names:
sequence_name clade
seq001 LineageA
seq002 LineageA
Then use: --seq-id-column sequence_name --lineage-column clade
See the examples/ directory for a complete example with sample data:
python tidy_tree.py \
--alignment examples/sequences.fasta \
--founder-sequences examples/founders.fasta \
--guide-tree examples/guide_tree.nwk \
--lineage-assignments examples/assignments.tsv \
--output-tree examples/output_tree.nwk \
--model GTR+F+I+G4 \
--threads 4Common models for DNA sequences:
GTR+G: General time reversible with gamma rate heterogeneity (default)GTR+F+I+G4: GTR with empirical base frequencies, invariant sites, and 4 rate categoriesHKY+G: Hasegawa-Kishino-Yano with gamma ratesJC: Jukes-Cantor (simplest model)
For model selection, you can use TEST or MFP:
--model TESTPass extra arguments to IQ-TREE using --iqtree-args:
--iqtree-args "-bb 1000 -alrt 1000"This would add 1000 ultrafast bootstrap replicates and SH-aLRT branch support.
The program outputs a single Newick tree file containing:
- All input sequences
- All lineage founder sequences
- Tree topology reflecting both within-lineage relationships and between-lineage relationships
-
Alignment quality: The quality of the output tree depends heavily on the quality of the input alignment
-
Sequence count: Each lineage should have at least one sequence (plus founder and child founders) to build a subtree. Lineages with fewer than 3 total sequences will be skipped.
-
Grafting approach: The current grafting method replaces founder sequence leaves in parent trees with their corresponding subtrees. This is a simplified approach and may not be biologically optimal in all cases.
-
Branch lengths: Branch length interpretation may be complex due to the stitching process. Consider this when interpreting divergence times.
IQ-TREE not found:
Use --iqtree-path to specify the full path to the IQ-TREE executable
Sequence length mismatch:
All sequences must be pre-aligned with the same length
Lineage not found:
Check that lineage names match exactly between the guide tree, founders, and assignments files
MIT License - see LICENSE file for details