diff --git a/README.md b/README.md index b466d0e763..3bbf2d4d42 100644 --- a/README.md +++ b/README.md @@ -2,10 +2,11 @@ ### Genome assembly evaluation tool -QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics. +QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome/diploid assemblies by computing various metrics. The current QUAST toolkit includes the general QUAST tool for genome assemblies, MetaQUAST, the extension for metagenomic datasets, -QUAST-LG, the extension for large genomes (e.g., mammalians), and Icarus, the interactive visualizer for these tools. +QUAST-LG, the extension for large genomes (e.g., mammalians), +dipQUAST, the extension for diploid assemblies, and Icarus, the interactive visualizer for these tools. The QUAST package works both with and without reference genomes. However, it is much more informative if at least a close reference genome is provided along with the assemblies. @@ -99,10 +100,12 @@ or MetaGeneMark (for metagenomes). When a reference is given: -* Numbers of misassemblies of different kinds (inversions, relocations, translocations, interspecies translocations (metaQUAST only) or local). +* Numbers of misassemblies of different kinds (inversions, relocations, translocations, interspecies translocations (metaQUAST only), interhaplotype translocations (dipQUAST only) or local). * Number and total length of unaligned contigs. -* Numbers of mismatches and indels, over the assembly and per 100 kb. -* Genome fraction %, assembled part of the reference. +* Numbers of mismatches and indels, over the assembly and per 100 kb. +* Number of cases of missed heterozygosity (dipQUAST only). +* Genome fraction %, assembled part of the reference. +* Genome fraction % of haplotype 1 / haplotype 2, assembled parts of each haplotype reference (dipQUAST only). * Duplication ratio, the total number of aligned bases in the assembly divided by the total number of those in the reference. If the assembly contains many contigs that cover the same regions, its duplication ratio will significantly exceed 1. This occurs due to multiple reasons, including overestimating repeat multiplicities and overlaps between contigs. diff --git a/dipquast.py b/dipquast.py new file mode 120000 index 0000000000..9ac9fca5f1 --- /dev/null +++ b/dipquast.py @@ -0,0 +1 @@ +quast.py \ No newline at end of file diff --git a/manual.html b/manual.html index 358bd79d70..d8e3134279 100644 --- a/manual.html +++ b/manual.html @@ -71,7 +71,7 @@

QUAST 5.2.0 manual

QUAST stands for QUality ASsessment Tool. The tool evaluates genome assemblies by computing various metrics. This document provides instructions for the general QUAST tool for genome assemblies, MetaQUAST, the extension for metagenomic datasets, - QUAST-LG, the extension for large genomes (e.g., mammalians), and Icarus, the interactive visualizer for these tools. + QUAST-LG, the extension for large genomes (e.g., mammalians), dipQUAST, the extension for diploid assemblies, and Icarus, the interactive visualizer for these tools.

You can find key project news and the latest version of the tool @@ -96,6 +96,7 @@

QUAST 5.2.0 manual

Krona tools, BLAST, and SILVA 16S rRNA database. + dipQUAST utilizes Mash.
Almost all tools listed above are built in into the QUAST package which is ready for use by academic, non-profit institutions and U.S. Government agencies. If you are not in one of these categories please refer to LICENSE section 'Third-party tools incorporated into QUAST' @@ -120,6 +121,7 @@

Contents

  • Input data
  • Command line options
  • Metagenomic assemblies
  • +
  • Diploid assemblies
  • QUAST output @@ -495,6 +497,16 @@

    2.3 Command line options

    Comma-separated list of contig length thresholds (in bp). Used in # contigs ≥ x and total length (≥ x) metrics (see section 3). The default value is 0,1000,5000,10000,25000,50000. +
    + --ploid +
    +Genome is diploid. Use optimal parameters for evaluation of diploid assemblies. Affects calculation of metrics and adds new metrics specific for diploid case. In particular, adds new type of misassemblies (interhaplotype translocations) and # missed heterozygosity. + +
    + --ploid-assembly-type <consensus|phased> +
    +Specifies type of diploid genome assembly: consensus or phased. The default value is phased. +
    --x-for-Nx <int>
    @@ -958,6 +970,16 @@

    2.4 Metagenomic assemblies

    (2) may change the stage #2 results, especially when used together with --unique-mapping. + +

    2.5 Diploid assemblies

    +

    + The dipquast.py script accepts diploid reference genomes. + Another way is to use quast.py with --ploid option. + +

    General usage: +

    +    python dipquast.py contigs_1 contigs_2 ... -r diploid_reference ...
    +

    3. QUAST output

    @@ -1058,6 +1080,8 @@

    3.1.1 Summary report

    Reference length is the total number of bases in the reference genome.

    +

    Reference length of haplotype 1 / haplotype 2 is the total number of bases in the corresponding haplotype reference (dipQUAST only).

    +

    GC (%) is the total number of G and C nucleotides in the assembly, divided by the total length of the assembly.

    @@ -1080,7 +1104,8 @@

    3.1.1 Summary report

  • the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference;
  • flanking sequences overlap on more than 1 kbp;
  • flanking sequences align to different strands or different chromosomes; -
  • flanking sequences align on different reference genomes (MetaQUAST only). +
  • flanking sequences align on different reference genomes (MetaQUAST only); +
  • flanking sequences align on different haplotype references (dipQUAST only). This metric requires a reference genome. Note that default threshold of 1 kbp is increased to 7 kbp in QUAST-LG. @@ -1149,6 +1174,9 @@

    3.1.1 Summary report

    A base in the reference genome is aligned if there is at least one contig with at least one alignment to this base. Contigs from repetitive regions may map to multiple places, and thus may be counted multiple times (see --ambiguity-usage).

    +

    Genome fraction (%) of haplotype 1 / haplotype 2 is the percentage of aligned bases in the corresponding haplotype reference. + See Genome fraction (%) for details.

    +

    Duplication ratio is the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference genome (see Genome fraction (%) for the 'aligned base' definition). If the assembly contains many contigs that cover the same @@ -1168,6 +1196,8 @@

    3.1.1 Summary report

    # indels per 100 kbp is the average number of indels per 100,000 aligned bases in the assembly. Several consecutive single nucleotide indels are counted as one indel.

    +

    # missed heterozygosity is the number of events when a single nucleotide polymorphism (SNP) matches a nucleotide in a corresponding position of another haplotype reference.

    +

    # genomic features is the number of genomic features (genes, CDS, etc) in the assembly (complete and partial), based on a user-provided list of genomic features positions in the reference genome. A feature is 'partially covered' if the assembly contains at least 100 bp of this feature but not the whole one.
    @@ -1234,6 +1264,7 @@

    3.1.2 Misassemblies report

    # misassemblies is the same as # misassemblies from section 3.1.1. However, this report also contains a classification of all misassembly events into three groups: relocations, translocations, and inversions (see below). For metagenomic assemblies, this classification also includes interspecies translocations. + For diploid assemblies it includes interhaplotype translocations. We also separately count breakpoints containing scaffold-like gaps (at least 10 consecutive N's) and breakpoints without them. The former are called scaffold misassemblies (relocations, inversions, etc) and the latter are called contig misassemblies (relocations, inversions, etc). See more details in the related github feature request and @@ -1241,7 +1272,6 @@

    3.1.2 Misassemblies report

    -

    Relocation is a misassembly event (breakpoint) where the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference genome, or they overlap by more than 1 kbp, and both flanking sequences align on the same chromosome. Note that default threshold of 1 kbp can be changed by --extensive-mis-size.

    @@ -1250,6 +1280,8 @@

    3.1.2 Misassemblies report

    Interspecies translocation is a misassembly event (breakpoint) where the flanking sequences align on different reference genomes (MetaQUAST only).

    +

    Interhaplotype translocation is a misassembly event (breakpoint) where the flanking sequences align on the homologous chromosomes of different haplotypes (dipQUAST only).

    +

    Inversion is a misassembly event (breakpoint) where the flanking sequences align on opposite strands of the same chromosome.

    diff --git a/quast.py b/quast.py index db432bde1e..a4c825580e 100755 --- a/quast.py +++ b/quast.py @@ -12,7 +12,7 @@ import sys import shutil -from quast_libs import qconfig +from quast_libs import qconfig, diputils qconfig.check_python_version() @@ -132,6 +132,12 @@ def main(args): qconfig.assemblies_fpaths = contigs_fpaths + # Fill dip_dict and calculate length of haplotypes + if qconfig.ploid_mode: + diputils.dip_genome_by_chr = diputils.fill_dip_dict_by_chromosomes(ref_fpath) + diputils.length_of_haplotypes = diputils.get_haplotypes_len(ref_fpath) + diputils.ploid_aligned = dict(zip(diputils.dip_genome_by_chr.keys(), [0]*len(diputils.dip_genome_by_chr.keys()))) + # Where all pdfs will be saved all_pdf_fpath = None if qconfig.draw_plots and plotter.can_draw_plots: @@ -161,9 +167,59 @@ def main(args): ######################################################################## from quast_libs import contigs_analyzer is_cyclic = qconfig.prokaryote and not qconfig.check_for_fragmented_ref + + # Counting cases of missed heterozygosity + if qconfig.ploid_mode and (not qconfig.large_genome) and qconfig.show_snps: + aux_files_dirpath = os.path.join(output_dirpath, 'dipquast_auxiliary_files') + if not os.path.isdir(aux_files_dirpath): + os.mkdir(aux_files_dirpath) + + # Creating separate reference files for each haplotype + ref1_fpath, ref2_fpath = diputils.split_ref_file_by_haplotypes(ref_fpath, diputils.dip_genome_by_chr, + aux_files_dirpath) + ref1_fpaths, old_ref1_fpaths = qutils.correct_contigs([ref1_fpath], corrected_dirpath, ['haplotype_1'], + reporting) + ref2_fpaths, old_ref2_fpaths = qutils.correct_contigs([ref2_fpath], corrected_dirpath, ['haplotype_2'], + reporting) + ref1_fpath, ref2_fpath = ref1_fpaths[0], ref2_fpaths[0] + + # Making initial .used_snps file + snps_fpath = contigs_analyzer.do_ref_to_ref_align( + ref_fpath, contigs_fpaths, is_cyclic, + os.path.join(aux_files_dirpath, qconfig.detailed_contigs_reports_dirname), + old_contigs_fpaths, qconfig.bed) + + # Changing ploid_mode temporarily in order to align haplotypes to one another + qconfig.ploid_mode = False # Check! + + # Aligning haplotype_1 to haplotype_2: + snps_ref1_to_ref2_fpath = contigs_analyzer.do_ref_to_ref_align( + ref2_fpath, ref1_fpaths, is_cyclic, + os.path.join(aux_files_dirpath, qconfig.detailed_contigs_reports_dirname + '_ref1_to_ref2'), + old_ref1_fpaths, qconfig.bed) + + # Aligning haplotype_2 to haplotype_1: + snps_ref2_to_ref1_fpath = contigs_analyzer.do_ref_to_ref_align( + ref1_fpath, ref2_fpaths, is_cyclic, + os.path.join(aux_files_dirpath, qconfig.detailed_contigs_reports_dirname + '_ref2_to_ref1'), + old_ref2_fpaths, qconfig.bed) + + # Remove ref to ref alignments from assembly_fpaths + reporting.assembly_fpaths.remove(ref1_fpath) + reporting.assembly_fpaths.remove(ref2_fpath) + + # Counting cases of missed heterozygosity + diputils.n_missed_heterozygosity = diputils.count_missed_heterozygosity(snps_fpath[0], + snps_ref1_to_ref2_fpath[0], + snps_ref2_to_ref1_fpath[0]) + + # Changing ploid_mode back + qconfig.ploid_mode = True + aligner_statuses, aligned_lengths_per_fpath = contigs_analyzer.do( - ref_fpath, contigs_fpaths, is_cyclic, os.path.join(output_dirpath, qconfig.detailed_contigs_reports_dirname), - old_contigs_fpaths, qconfig.bed) + ref_fpath, contigs_fpaths, is_cyclic, + os.path.join(output_dirpath, qconfig.detailed_contigs_reports_dirname), old_contigs_fpaths, qconfig.bed) + for contigs_fpath in contigs_fpaths: if aligner_statuses[contigs_fpath] == contigs_analyzer.AlignerStatus.OK: aligned_contigs_fpaths.append(contigs_fpath) @@ -303,7 +359,6 @@ def main(args): cleanup(corrected_dirpath) return logger.finish_up(check_test=qconfig.test) - if __name__ == '__main__': try: return_code = main(sys.argv[1:]) diff --git a/quast_libs/basic_stats.py b/quast_libs/basic_stats.py index 1c470a6216..5d340bfa4e 100644 --- a/quast_libs/basic_stats.py +++ b/quast_libs/basic_stats.py @@ -14,6 +14,8 @@ from quast_libs import fastaparser, qconfig, qutils, reporting, plotter from quast_libs.circos import set_window_size from quast_libs.log import get_logger +from quast_libs.diputils import length_of_haplotypes + logger = get_logger(qconfig.LOGGER_DEFAULT_NAME) MIN_HISTOGRAM_POINTS = 5 MIN_GC_WINDOW_SIZE = qconfig.GC_window_size // 2 @@ -324,6 +326,12 @@ def do(ref_fpath, contigs_fpaths, output_dirpath, results_dir): if ref_fpath: report.add_field(reporting.Fields.REFLEN, int(reference_length)) report.add_field(reporting.Fields.REF_FRAGMENTS, reference_fragments) + + if qconfig.ploid_mode: + report.add_field(reporting.Fields.REFLEN_HAPLOTYPES, + [int(l) for l in length_of_haplotypes.values()]) + + if not qconfig.is_combined_ref: report.add_field(reporting.Fields.REFGC, ('%.2f' % reference_GC if reference_GC is not None else None)) elif reference_length: diff --git a/quast_libs/ca_utils/analyze_contigs.py b/quast_libs/ca_utils/analyze_contigs.py index ab3651a1c9..8474160267 100644 --- a/quast_libs/ca_utils/analyze_contigs.py +++ b/quast_libs/ca_utils/analyze_contigs.py @@ -10,6 +10,7 @@ from quast_libs.ca_utils.analyze_misassemblies import process_misassembled_contig, IndelsInfo, find_all_sv, Misassembly from quast_libs.ca_utils.best_set_selection import get_best_aligns_sets, get_used_indexes, score_single_align from quast_libs.ca_utils.misc import ref_labels_by_chromosomes +from quast_libs.diputils import dip_genome_by_chr, l_names_ambiguity_contigs def add_potential_misassembly(ref, misassemblies_by_ref, refs_with_translocations): @@ -192,16 +193,45 @@ def analyze_contigs(ca_output, contigs_fpath, unaligned_fpath, unaligned_info_fp for align in top_aligns: ca_output.stdout_f.write('\t\t\tSkipping alignment ' + str(align) + '\n') elif qconfig.ambiguity_usage == "one": - ca_output.stdout_f.write('\t\tUsing only first of these alignment (option --ambiguity-usage is set to "one"):\n') - ca_output.stdout_f.write('\t\t\tAlignment: %s\n' % str(top_aligns[0])) - ca_output.icarus_out_f.write(top_aligns[0].icarus_report_str() + '\n') - ref_aligns.setdefault(top_aligns[0].ref, []).append(top_aligns[0]) - aligned_lengths.append(top_aligns[0].len2) - contigs_aligned_lengths[-1] = top_aligns[0].len2 - ca_output.coords_filtered_f.write(top_aligns[0].coords_str() + '\n') - top_aligns = top_aligns[1:] - for align in top_aligns: - ca_output.stdout_f.write('\t\t\tSkipping alignment ' + str(align) + '\n') + if qconfig.ploid_mode: + ploidy = len(dip_genome_by_chr) + ca_output.stdout_f.write( + f'\t\tThere are {ploidy} haplotypes. Using no more than one alignment for each haplotype\n') + used_haplotypes = [] + skipped_aligns = [] + while len(top_aligns): + if len(used_haplotypes) == ploidy: + skipped_aligns += top_aligns + break + for key, value in dip_genome_by_chr.items(): # Create method for this later! + if top_aligns[0].ref in value: + haplotype = key + if haplotype not in used_haplotypes: + ca_output.stdout_f.write('\t\t\tAlignment: %s\n' % str(top_aligns[0])) + ca_output.icarus_out_f.write(top_aligns[0].icarus_report_str() + '\n') + ref_aligns.setdefault(top_aligns[0].ref, []).append(top_aligns[0]) + aligned_lengths.append(top_aligns[0].len2) + contigs_aligned_lengths[-1] = top_aligns[0].len2 + ca_output.coords_filtered_f.write(top_aligns[0].coords_str() + '\n') + used_haplotypes.append(haplotype) + if qconfig.ploid_assembly_type == 'phased' and len(used_haplotypes) > 1 and top_aligns[0].contig not in l_names_ambiguity_contigs: + l_names_ambiguity_contigs.append(top_aligns[0].contig) + else: + skipped_aligns.append(top_aligns[0]) + top_aligns = top_aligns[1:] + for align in skipped_aligns: + ca_output.stdout_f.write('\t\t\tSkipping alignment ' + str(align) + '\n') + else: + ca_output.stdout_f.write('\t\tUsing only first of these alignment (option --ambiguity-usage is set to "one"):\n') + ca_output.stdout_f.write('\t\t\tAlignment: %s\n' % str(top_aligns[0])) + ca_output.icarus_out_f.write(top_aligns[0].icarus_report_str() + '\n') + ref_aligns.setdefault(top_aligns[0].ref, []).append(top_aligns[0]) + aligned_lengths.append(top_aligns[0].len2) + contigs_aligned_lengths[-1] = top_aligns[0].len2 + ca_output.coords_filtered_f.write(top_aligns[0].coords_str() + '\n') + top_aligns = top_aligns[1:] + for align in top_aligns: + ca_output.stdout_f.write('\t\t\tSkipping alignment ' + str(align) + '\n') elif qconfig.ambiguity_usage == "all": ca_output.stdout_f.write('\t\tUsing all these alignments (option --ambiguity-usage is set to "all"):\n') # we count only extra bases, so we shouldn't include bases in the first alignment diff --git a/quast_libs/ca_utils/analyze_misassemblies.py b/quast_libs/ca_utils/analyze_misassemblies.py index a69c398cf1..c56ba3de54 100644 --- a/quast_libs/ca_utils/analyze_misassemblies.py +++ b/quast_libs/ca_utils/analyze_misassemblies.py @@ -14,7 +14,7 @@ from quast_libs.log import get_logger logger = get_logger(qconfig.LOGGER_DEFAULT_NAME) from quast_libs.qutils import correct_name - +from quast_libs.diputils import is_homologous_ref class Misassembly: LOCAL = 0 @@ -22,6 +22,7 @@ class Misassembly: RELOCATION = 2 TRANSLOCATION = 3 INTERSPECTRANSLOCATION = 4 # for metaquast, if translocation occurs between chromosomes of different references + INTERHAPLOTRANSLOCATION = 200 # for dipquast SCAFFOLD_GAP = 5 LOCAL_SCAFFOLD_GAP = 6 FRAGMENTED = 7 @@ -34,6 +35,7 @@ class Misassembly: SCF_RELOCATION = 13 SCF_TRANSLOCATION = 14 SCF_INTERSPECTRANSLOCATION = 15 + SCF_INTERHAPLOTRANSLOCATION = 16 class StructuralVariations(object): @@ -43,9 +45,10 @@ def __init__(self): self.inversions = [] self.relocations = [] self.translocations = [] + self.inter_haplotype_translocations = [] def get_count(self): - return len(self.inversions) + len(self.relocations) + len(self.translocations) + return len(self.inversions) + len(self.relocations) + len(self.translocations) + len(self.inter_haplotype_translocations) class Mapping(object): @@ -289,7 +292,10 @@ def find_all_sv(bed_fpath): align1 = Mapping(s1=int(fs[1]), e1=int(fs[2]), ref=correct_name(fs[0]), sv_type=fs[6]) align2 = Mapping(s1=int(fs[4]), e1=int(fs[5]), ref=correct_name(fs[3]), sv_type=fs[6]) if align1.ref != align2.ref: - region_struct_variations.translocations.append((align1, align2)) + if qconfig.ploid_mode and is_homologous_ref(align1.ref, align2.ref): + region_struct_variations.inter_haplotype_translocations.append((align1, align2)) + else: + region_struct_variations.translocations.append((align1, align2)) elif 'INV' in fs[6]: region_struct_variations.inversions.append((align1, align2)) elif 'DEL' in fs[6] or 'INS' in fs[6] or 'BND' in fs[6]: @@ -469,6 +475,8 @@ def process_misassembled_contig(sorted_aligns, is_cyclic, aligned_lengths, regio if prev_align.ref != next_align.ref: # if chromosomes from different references if qconfig.is_combined_ref and prev_ref != next_ref: misassembly_type = 'interspecies translocation' + elif qconfig.ploid_mode and is_homologous_ref(prev_align.ref, next_align.ref): + misassembly_type = 'interhaplotype translocation' else: misassembly_type = 'translocation' elif abs(aux_data["inconsistency"]) > qconfig.extensive_misassembly_threshold: @@ -544,6 +552,8 @@ def process_misassembled_contig(sorted_aligns, is_cyclic, aligned_lengths, regio misassembly_id = Misassembly.INTERSPECTRANSLOCATION istranslocations_by_ref[prev_ref][next_ref] += 1 istranslocations_by_ref[next_ref][prev_ref] += 1 + elif misassembly_type == 'interhaplotype translocation': + misassembly_id = Misassembly.INTERHAPLOTRANSLOCATION elif misassembly_type == 'translocation': misassembly_id = Misassembly.TRANSLOCATION elif misassembly_type == 'relocation': diff --git a/quast_libs/ca_utils/best_set_selection.py b/quast_libs/ca_utils/best_set_selection.py index 76b6a7c17d..8f3c7e99c7 100644 --- a/quast_libs/ca_utils/best_set_selection.py +++ b/quast_libs/ca_utils/best_set_selection.py @@ -14,6 +14,7 @@ from quast_libs.ca_utils.analyze_misassemblies import is_misassembly, exclude_internal_overlaps, Misassembly, \ is_fragmented_ref_fake_translocation from quast_libs.ca_utils.misc import is_same_reference +from quast_libs.diputils import is_homologous_ref class ScoredSet(object): @@ -199,7 +200,7 @@ def get_best_aligns_sets(sorted_aligns, ctg_len, stdout_f, seq, ref_lens, is_cyc break cur_set_aligns = [sorted_aligns[i] for i in scored_set.indexes] + [align] score, uncovered = get_score(scored_set.score, cur_set_aligns, ref_lens, is_cyclic, scored_set.uncovered, - seq, region_struct_variations, penalties) + seq, region_struct_variations, penalties, ctg_len) if score is None: # incorrect set, i.e. internal overlap excluding resulted in incorrectly short alignment continue if score > local_max_score: @@ -262,7 +263,7 @@ def get_best_aligns_sets(sorted_aligns, ctg_len, stdout_f, seq, ref_lens, is_cyc break cur_set_aligns = [sorted_aligns[i] for i in scored_set.indexes] + [align] score, uncovered = get_score(scored_set.score, cur_set_aligns, ref_lens, is_cyclic, scored_set.uncovered, - seq, region_struct_variations, penalties) + seq, region_struct_variations, penalties, ctg_len) if score is not None: putative_predecessors[scored_set] = (score, uncovered) if score > local_max_score: @@ -303,7 +304,7 @@ def get_added_len(set_aligns, cur_align): return added_right + added_left -def get_score(score, aligns, ref_lens, is_cyclic, uncovered_len, seq, region_struct_variations, penalties): +def get_score(score, aligns, ref_lens, is_cyclic, uncovered_len, seq, region_struct_variations, penalties, ctg_len): if len(aligns) > 1: align1, align2 = aligns[-2], aligns[-1] = aligns[-2].clone(), aligns[-1].clone() is_fake_translocation = is_fragmented_ref_fake_translocation(align1, align2, ref_lens) @@ -326,6 +327,8 @@ def get_score(score, aligns, ref_lens, is_cyclic, uncovered_len, seq, region_str if align1.ref != align2.ref: if qconfig.is_combined_ref and not is_same_reference(align1.ref, align2.ref): misassembly = Misassembly.INTERSPECTRANSLOCATION + elif qconfig.ploid_mode and is_homologous_ref(align1.ref, align2.ref): + misassembly = max(Misassembly.INTERHAPLOTRANSLOCATION, ctg_len * 0.001) else: misassembly = Misassembly.TRANSLOCATION elif abs(aux_data["inconsistency"]) > qconfig.extensive_misassembly_threshold: diff --git a/quast_libs/ca_utils/save_results.py b/quast_libs/ca_utils/save_results.py index 06e77f3af3..d840135c60 100644 --- a/quast_libs/ca_utils/save_results.py +++ b/quast_libs/ca_utils/save_results.py @@ -8,7 +8,7 @@ import os from collections import defaultdict -from quast_libs import qconfig, qutils, reporting +from quast_libs import qconfig, qutils, reporting, diputils from quast_libs.ca_utils.analyze_misassemblies import Misassembly from quast_libs.ca_utils.misc import print_file, intergenomic_misassemblies_by_asm, ref_labels_by_chromosomes @@ -26,11 +26,13 @@ def print_results(contigs_fpath, log_out_f, used_snps_fpath, total_indels_info, log_out_f.write('\tLocal Misassemblies: %d\n' % region_misassemblies.count(Misassembly.LOCAL)) log_out_f.write('\tMisassemblies: %d\n' % (region_misassemblies.count(Misassembly.RELOCATION) + region_misassemblies.count(Misassembly.INVERSION) + region_misassemblies.count(Misassembly.TRANSLOCATION) + - region_misassemblies.count(Misassembly.INTERSPECTRANSLOCATION))) + region_misassemblies.count(Misassembly.INTERSPECTRANSLOCATION) + region_misassemblies.count(Misassembly.INTERHAPLOTRANSLOCATION))) log_out_f.write('\t\tRelocations: %d\n' % region_misassemblies.count(Misassembly.RELOCATION)) log_out_f.write('\t\tTranslocations: %d\n' % region_misassemblies.count(Misassembly.TRANSLOCATION)) if qconfig.is_combined_ref: log_out_f.write('\t\tInterspecies translocations: %d\n' % region_misassemblies.count(Misassembly.INTERSPECTRANSLOCATION)) + if qconfig.ploid_mode: + log_out_f.write('\t\tInterhaplotype translocations: %d\n' % region_misassemblies.count(Misassembly.INTERHAPLOTRANSLOCATION)) log_out_f.write('\t\tInversions: %d\n' % region_misassemblies.count(Misassembly.INVERSION)) if qconfig.is_combined_ref: log_out_f.write('\tPotentially Misassembled Contigs (i/s translocations): %d\n' % region_misassemblies.count(Misassembly.POTENTIALLY_MIS_CONTIGS)) @@ -97,7 +99,7 @@ def save_result(result, report, fname, ref_fpath, genome_size): report.add_field(reporting.Fields.MISLOCAL, region_misassemblies.count(Misassembly.LOCAL)) report.add_field(reporting.Fields.MISASSEMBL, region_misassemblies.count(Misassembly.RELOCATION) + region_misassemblies.count(Misassembly.INVERSION) + region_misassemblies.count(Misassembly.TRANSLOCATION) + - region_misassemblies.count(Misassembly.INTERSPECTRANSLOCATION)) + region_misassemblies.count(Misassembly.INTERSPECTRANSLOCATION) + region_misassemblies.count(Misassembly.INTERHAPLOTRANSLOCATION)) report.add_field(reporting.Fields.MISCONTIGS, len(misassembled_contigs)) report.add_field(reporting.Fields.MISCONTIGSBASES, misassembled_bases) report.add_field(reporting.Fields.MISINTERNALOVERLAP, misassembly_internal_overlap) @@ -128,10 +130,22 @@ def save_result(result, report, fname, ref_fpath, genome_size): report.add_field(reporting.Fields.INDELSERROR, "%.2f" % (float(report.get_field(reporting.Fields.INDELS)) * 100000.0 / float(aligned_assembly_bases))) + if qconfig.ploid_mode: + genome_fraction_by_haplotypes = {} + for haplotype in diputils.length_of_haplotypes.keys(): + genome_fraction_by_haplotypes[haplotype] = genome_fraction_by_haplotypes.get(haplotype, 0) + round(diputils.ploid_aligned[haplotype] * + 100 / diputils.length_of_haplotypes[haplotype], 2) + + report.add_field(reporting.Fields.MAPPEDGENOME_BY_HAPLOTYPES, + [float(l) for l in genome_fraction_by_haplotypes.values()]) + report.add_field(reporting.Fields.N_MISSED_HETEROZYGOSITY, + diputils.n_missed_heterozygosity) + + # for misassemblies report: report.add_field(reporting.Fields.MIS_ALL_EXTENSIVE, region_misassemblies.count(Misassembly.RELOCATION) + region_misassemblies.count(Misassembly.INVERSION) + region_misassemblies.count(Misassembly.TRANSLOCATION) + - region_misassemblies.count(Misassembly.INTERSPECTRANSLOCATION)) + region_misassemblies.count(Misassembly.INTERSPECTRANSLOCATION) + region_misassemblies.count(Misassembly.INTERHAPLOTRANSLOCATION)) report.add_field(reporting.Fields.MIS_RELOCATION, region_misassemblies.count(Misassembly.RELOCATION)) report.add_field(reporting.Fields.MIS_TRANSLOCATION, region_misassemblies.count(Misassembly.TRANSLOCATION)) report.add_field(reporting.Fields.MIS_INVERTION, region_misassemblies.count(Misassembly.INVERSION)) @@ -141,7 +155,7 @@ def save_result(result, report, fname, ref_fpath, genome_size): # special case for separating contig and scaffold misassemblies report.add_field(reporting.Fields.SCF_MIS_ALL_EXTENSIVE, region_misassemblies.count(Misassembly.SCF_RELOCATION) + region_misassemblies.count(Misassembly.SCF_INVERSION) + region_misassemblies.count(Misassembly.SCF_TRANSLOCATION) + - region_misassemblies.count(Misassembly.SCF_INTERSPECTRANSLOCATION)) + region_misassemblies.count(Misassembly.SCF_INTERSPECTRANSLOCATION) + region_misassemblies.count(Misassembly.SCF_INTERHAPLOTRANSLOCATION)) report.add_field(reporting.Fields.SCF_MIS_RELOCATION, region_misassemblies.count(Misassembly.SCF_RELOCATION)) report.add_field(reporting.Fields.SCF_MIS_TRANSLOCATION, region_misassemblies.count(Misassembly.SCF_TRANSLOCATION)) report.add_field(reporting.Fields.SCF_MIS_INVERTION, region_misassemblies.count(Misassembly.SCF_INVERSION)) @@ -188,6 +202,16 @@ def save_result(result, report, fname, ref_fpath, genome_size): report.add_field(reporting.Fields.MIS_LOCAL_SCAFFOLDS_GAP, region_misassemblies.count(Misassembly.LOCAL_SCAFFOLD_GAP)) if qconfig.check_for_fragmented_ref: report.add_field(reporting.Fields.MIS_FRAGMENTED, region_misassemblies.count(Misassembly.FRAGMENTED)) + + if qconfig.ploid_mode: + report.add_field(reporting.Fields.MIS_IHTRANSLOCATIONS, + region_misassemblies.count(Misassembly.INTERHAPLOTRANSLOCATION)) + report.add_field(reporting.Fields.SCF_MIS_IHTRANSLOCATIONS, + region_misassemblies.count(Misassembly.SCF_INTERHAPLOTRANSLOCATION)) + report.add_field(reporting.Fields.CTG_MIS_IHTRANSLOCATIONS, + region_misassemblies.count(Misassembly.INTERHAPLOTRANSLOCATION) - + region_misassemblies.count(Misassembly.SCF_INTERHAPLOTRANSLOCATION)) + # for unaligned report: report.add_field(reporting.Fields.UNALIGNED_FULL_CNTGS, unaligned) report.add_field(reporting.Fields.UNALIGNED_FULL_LENGTH, fully_unaligned_bases) diff --git a/quast_libs/contigs_analyzer.py b/quast_libs/contigs_analyzer.py index e15ae0a7ce..0b58c8c79c 100644 --- a/quast_libs/contigs_analyzer.py +++ b/quast_libs/contigs_analyzer.py @@ -23,7 +23,7 @@ from collections import defaultdict from os.path import join, dirname -from quast_libs import reporting, qconfig, qutils, fastaparser +from quast_libs import reporting, qconfig, qutils, fastaparser, diputils from quast_libs.ca_utils import misc from quast_libs.ca_utils.analyze_contigs import analyze_contigs from quast_libs.ca_utils.analyze_misassemblies import Mapping, IndelsInfo @@ -100,9 +100,7 @@ def analyze_coverage(ref_aligns, reference_chromosomes, ns_by_chromosomes, used_ genome_mapping[align.ref][pos] = 1 for i in ns_by_chromosomes[align.ref]: genome_mapping[align.ref][i] = 0 - - covered_ref_bases = sum([sum(genome_mapping[chrom]) for chrom in genome_mapping]) - return covered_ref_bases, indels_info + return genome_mapping, indels_info # former plantagora and plantakolya @@ -192,11 +190,22 @@ def align_and_analyze(is_cyclic, index, contigs_fpath, output_dirpath, ref_fpath log_out_f.write('Analyzing contigs...\n') result, ref_aligns, total_indels_info, aligned_lengths, misassembled_contigs, misassemblies_in_contigs, aligned_lengths_by_contigs =\ analyze_contigs(ca_output, contigs_fpath, unaligned_fpath, unaligned_info_fpath, aligns, ref_features, reference_chromosomes, is_cyclic) + if qconfig.ploid_mode and qconfig.ploid_assembly_type == 'phased': + # drop not the best alignment for ambiguity contigs in ref_aligns + diputils.leave_best_alignment_for_ambiguity_contigs(ref_aligns, reference_chromosomes, ca_output) log_out_f.write('Analyzing coverage...\n') if qconfig.show_snps: log_out_f.write('Writing SNPs into ' + used_snps_fpath + '\n') - aligned_ref_bases, indels_info = analyze_coverage(ref_aligns, reference_chromosomes, ns_by_chromosomes, used_snps_fpath) + genome_mapping, indels_info = analyze_coverage(ref_aligns, reference_chromosomes, ns_by_chromosomes, used_snps_fpath) + + if qconfig.ploid_mode: + for key, val in diputils.dip_genome_by_chr.items(): + for chrom in val: + diputils.ploid_aligned[key] += sum(genome_mapping[chrom]) + + aligned_ref_bases = sum([sum(genome_mapping[chrom]) for chrom in genome_mapping]) + total_indels_info += indels_info cov_stats = {'SNPs': total_indels_info.mismatches, 'indels_list': total_indels_info.indels_list, 'aligned_ref_bases': aligned_ref_bases} result.update(cov_stats) @@ -246,6 +255,95 @@ def align_and_analyze(is_cyclic, index, contigs_fpath, output_dirpath, ref_fpath else: return AlignerStatus.OK, result, aligned_lengths, misassemblies_in_contigs, aligned_lengths_by_contigs +def align_and_analyze_ref_to_ref(is_cyclic, index, contigs_fpath, output_dirpath, ref_fpath, + reference_chromosomes, ns_by_chromosomes, old_contigs_fpath, bed_fpath, threads=1): + tmp_output_dirpath = create_minimap_output_dir(output_dirpath) + assembly_label = qutils.label_from_fpath(contigs_fpath) + corr_assembly_label = qutils.label_from_fpath_for_fname(contigs_fpath) + out_basename = join(tmp_output_dirpath, corr_assembly_label) + + logger.info(' ' + qutils.index_to_str(index) + assembly_label) + + log_out_fpath = '/dev/null' + log_err_fpath = '/dev/null' + icarus_out_fpath = '/dev/null' + misassembly_fpath = '/dev/null' + unaligned_info_fpath = '/dev/null' + + icarus_out_f = open(icarus_out_fpath, 'w') + icarus_header_cols = ['S1', 'E1', 'S2', 'E2', 'Reference', 'Contig', 'IDY', 'Ambiguous', 'Best_group'] + icarus_out_f.write('\t'.join(icarus_header_cols) + '\n') + misassembly_f = open(misassembly_fpath, 'w') + + if not qconfig.space_efficient: + logger.info(' ' + qutils.index_to_str(index) + 'Logging to files ' + log_out_fpath + + ' and ' + os.path.basename(log_err_fpath) + '...') + else: + logger.info(' ' + qutils.index_to_str(index) + 'Logging is disabled.') + + coords_fpath, coords_filtered_fpath, unaligned_fpath, used_snps_fpath = get_aux_out_fpaths(out_basename) + status = align_contigs(coords_fpath, out_basename, ref_fpath, contigs_fpath, old_contigs_fpath, index, threads, + log_out_fpath, log_err_fpath) + if status != AlignerStatus.OK: + with open(log_err_fpath, 'a') as log_err_f: + if status == AlignerStatus.ERROR: + logger.error(' ' + qutils.index_to_str(index) + + 'Failed aligning contigs ' + qutils.label_from_fpath(contigs_fpath) + + ' to the reference (non-zero exit code). ' + + ('Run with the --debug flag to see additional information.' if not qconfig.debug else '')) + elif status == AlignerStatus.FAILED: + log_err_f.write(qutils.index_to_str(index) + 'Alignment failed for ' + contigs_fpath + ':' + coords_fpath + 'doesn\'t exist.\n') + logger.info(' ' + qutils.index_to_str(index) + 'Alignment failed for ' + '\'' + assembly_label + '\'.') + elif status == AlignerStatus.NOT_ALIGNED: + log_err_f.write(qutils.index_to_str(index) + 'Nothing aligned for ' + contigs_fpath + '\n') + logger.info(' ' + qutils.index_to_str(index) + 'Nothing aligned for ' + '\'' + assembly_label + '\'.') + return status, {}, [], [], [] + + log_out_f = open(log_out_fpath, 'a') + # Loading the alignment files + log_out_f.write('Parsing coords...\n') + aligns = {} + with open(coords_fpath) as coords_file: + for line in coords_file: + mapping = Mapping.from_line(line) + if not qconfig.alignments_for_reuse_dirpath or mapping.ref in reference_chromosomes.keys(): + aligns.setdefault(mapping.contig, []).append(mapping) + + # Loading the reference sequences + log_out_f.write('Loading reference...\n') # TODO: move up + ref_features = {} + + # Loading the regions (if any) + regions = {} + total_reg_len = 0 + total_regions = 0 + # # TODO: gff + # log_out_f.write('Loading regions...\n') + # log_out_f.write('\tNo regions given, using whole reference.\n') + for name, seq_len in reference_chromosomes.items(): + log_out_f.write('\tLoaded [%s]\n' % name) + regions.setdefault(name, []).append([1, seq_len]) + total_regions += 1 + total_reg_len += seq_len + log_out_f.write('\tTotal Regions: %d\n' % total_regions) + log_out_f.write('\tTotal Region Length: %d\n' % total_reg_len) + + ca_output = CAOutput(stdout_f=log_out_f, misassembly_f=misassembly_f, coords_filtered_f=open(coords_filtered_fpath, 'w'), + icarus_out_f=icarus_out_f) + + log_out_f.write('Analyzing contigs...\n') + result, ref_aligns, total_indels_info, aligned_lengths, misassembled_contigs, misassemblies_in_contigs, aligned_lengths_by_contigs =\ + analyze_contigs(ca_output, contigs_fpath, unaligned_fpath, unaligned_info_fpath, aligns, ref_features, reference_chromosomes, is_cyclic) + if qconfig.ploid_mode and qconfig.ploid_assembly_type == 'phased': + # drop not the best alignment for ambiguity contigs in ref_aligns + diputils.leave_best_alignment_for_ambiguity_contigs(ref_aligns, reference_chromosomes, ca_output) + + log_out_f.write('Analyzing coverage...\n') + if qconfig.show_snps: + log_out_f.write('Writing SNPs into ' + used_snps_fpath + '\n') + _, _ = analyze_coverage(ref_aligns, reference_chromosomes, ns_by_chromosomes, used_snps_fpath) + + return used_snps_fpath def do(reference, contigs_fpaths, is_cyclic, output_dir, old_contigs_fpaths, bed_fpath=None): if not os.path.isdir(output_dir): @@ -314,3 +412,27 @@ def do(reference, contigs_fpaths, is_cyclic, output_dir, old_contigs_fpaths, bed logger.main_info('Failed aligning the contigs for all the assemblies. Only basic stats are going to be evaluated.') return aligner_statuses, aligned_lengths_per_fpath + +def do_ref_to_ref_align(reference, contigs_fpaths, is_cyclic, output_dir, old_contigs_fpaths, bed_fpath=None): + if not os.path.isdir(output_dir): + os.mkdir(output_dir) + + logger.print_timestamp() + logger.main_info('Running Contig analyzer...') + success_compilation = compile_aligner(logger) + if not success_compilation: + logger.main_info('Failed aligning the contigs for all the assemblies. Only basic stats are going to be evaluated.') + return dict(zip(contigs_fpaths, [AlignerStatus.FAILED] * len(contigs_fpaths))), None + + create_minimap_output_dir(output_dir) + n_jobs = min(len(contigs_fpaths), qconfig.max_threads) + threads = max(1, qconfig.max_threads // n_jobs) + + genome_size, reference_chromosomes, ns_by_chromosomes = get_genome_stats(reference, skip_ns=True) + threads = qconfig.max_threads if qconfig.memory_efficient else threads + args = [(is_cyclic, i, contigs_fpath, output_dir, reference, reference_chromosomes, ns_by_chromosomes, + old_contigs_fpath, bed_fpath, threads) + for i, (contigs_fpath, old_contigs_fpath) in enumerate(zip(contigs_fpaths, old_contigs_fpaths))] + snps_ref_to_ref_fpath = run_parallel(align_and_analyze_ref_to_ref, args, n_jobs) + + return snps_ref_to_ref_fpath \ No newline at end of file diff --git a/quast_libs/diputils.py b/quast_libs/diputils.py new file mode 100644 index 0000000000..fbd86d82c3 --- /dev/null +++ b/quast_libs/diputils.py @@ -0,0 +1,276 @@ +from collections import defaultdict +import os +import subprocess +import shutil +import re + +from quast_libs import qconfig +from quast_libs.fastaparser import get_chr_lengths_from_fastafile, read_fasta + + +ploid_aligned = {} +dip_genome_by_chr = {} +length_of_haplotypes = {} +homologous_chroms = defaultdict(list) +l_names_ambiguity_contigs = [] +n_missed_heterozygosity = None + + +def execute(execute_that): + PIPE = subprocess.PIPE + p = subprocess.Popen(execute_that, shell=True, stdin=PIPE, stdout=PIPE, stderr=subprocess.STDOUT, close_fds=True) + p.communicate() + + +def run_mash(fasta_fpath): + tool_dirpath = shutil.which('mash') + if tool_dirpath is None: + if qconfig.platform_name == 'macosx': + mash_file = os.path.join('mash', 'mash_osx') + else: + mash_file = os.path.join('mash', 'mash_linux') + tool_dirpath = os.path.join(qconfig.LIBS_LOCATION, mash_file) + mash_command = f'{tool_dirpath} dist -i {fasta_fpath} {fasta_fpath} > tmp_mash_res.txt' + execute(mash_command) + + with open('tmp_mash_res.txt') as inf: + for line in inf: + line = line.strip('\n').split('\t') + if line[0] == line[1]: + continue + if float(line[3]) < 0.05: # p-value + homologous_chroms[line[0]].append(line[1]) + os.remove('tmp_mash_res.txt') + + +def get_max_n_haplotypes(): + n_max_haplotypes = 0 + for key, val in homologous_chroms.items(): + if len(val) + 1 > n_max_haplotypes: + n_max_haplotypes = len(val) + 1 + return n_max_haplotypes + + +def fill_dip_dict_by_chromosomes_for_unknown_names(ref_fpath): + run_mash(ref_fpath) + check_added_chroms = [] + counter_haplotypes = 1 + for idx in range(get_max_n_haplotypes()): + dip_genome_by_chr[f'haplotype_{idx + 1}'] = [] + + homologous_chroms_sorted = dict(sorted(homologous_chroms.items())) + for chrom in homologous_chroms_sorted.keys(): + if chrom not in check_added_chroms: + dip_genome_by_chr[f'haplotype_{counter_haplotypes}'].append(chrom) + check_added_chroms.append(chrom) + counter_haplotypes += 1 + for other_chr in homologous_chroms_sorted[chrom]: + if other_chr not in check_added_chroms: + dip_genome_by_chr[f'haplotype_{counter_haplotypes}'].append(other_chr) + check_added_chroms.append(other_chr) + counter_haplotypes += 1 + else: + continue + counter_haplotypes = 1 + return dip_genome_by_chr + + +def fill_dip_dict_by_chromosomes_for_conservative_names(ref_fpath): + for chrom, _ in read_fasta(ref_fpath): + haplotype = re.findall(r'(haplotype\d+)', chrom)[0] + if haplotype not in dip_genome_by_chr.keys(): + dip_genome_by_chr[haplotype] = [] + dip_genome_by_chr[haplotype].append(chrom) + return dip_genome_by_chr + + +def fill_dip_dict_by_chromosomes(ref_fpath): + is_conservative_chr_names = True + for name, _ in read_fasta(ref_fpath): + haplotype = re.findall(r'(haplotype\d+)', name) + if haplotype == []: + is_conservative_chr_names = False + break + if is_conservative_chr_names: + dip_genome_by_chr = fill_dip_dict_by_chromosomes_for_conservative_names(ref_fpath) + else: + dip_genome_by_chr = fill_dip_dict_by_chromosomes_for_unknown_names(ref_fpath) + return dict(sorted(dip_genome_by_chr.items())) + + +def get_haplotypes_len(fpath): + chr_len_d = get_chr_lengths_from_fastafile(fpath) + for key, val in dip_genome_by_chr.items(): + for chrom in val: + length_of_haplotypes[key] = length_of_haplotypes.get(key, 0) + chr_len_d[chrom] + return dict(sorted(length_of_haplotypes.items())) + + +def is_homologous_ref(align1, align2): + return align2 in homologous_chroms[align1] + + +def genome_coverage_for_unambiguity_contigs(ref_aligns, reference_chromosomes): + genome_mapping = {} + for chr_name, chr_len in reference_chromosomes.items(): + genome_mapping[chr_name] = [0] * (chr_len + 1) + for chr_name, aligns in ref_aligns.items(): + for align in aligns: + if align.contig not in l_names_ambiguity_contigs: + if align.s1 < align.e1: + for pos in range(align.s1, align.e1 + 1): + genome_mapping[align.ref][pos] = 1 + else: + for pos in range(align.s1, len(genome_mapping[align.ref])): + genome_mapping[align.ref][pos] = 1 + for pos in range(1, align.e1 + 1): + genome_mapping[align.ref][pos] = 1 + return genome_mapping + + +def find_ambiguity_alignments(ref_aligns): + ambiguity_contigs = defaultdict(list) + for key in ref_aligns.keys(): + for contig in ref_aligns[key]: + if contig.contig in l_names_ambiguity_contigs: + ambiguity_contigs[key].append(contig) + for key in ambiguity_contigs.keys(): + ambiguity_contigs[key] = sorted(ambiguity_contigs[key], key=lambda ctg: ctg.s1) + return ambiguity_contigs + + +def find_contig_alignments_to_haplotypes(ambiguity_contigs_pos): + alignments_by_contigs = defaultdict(list) + for chrom in ambiguity_contigs_pos.keys(): + for align in ambiguity_contigs_pos[chrom]: + alignments_by_contigs[align.contig].append(align) + return alignments_by_contigs + + +def get_coords_for_non_overlapping_seq(ambiguity_contigs, ctg_under_study): + l_non_overlapping_pos = [] + coords_to_ignore = [] + coords_of_study_ctg = [[ctg_under_study.s1, ctg_under_study.e1]] # [[start_coord, stop_coord],...] + for study_ctg_coord in coords_of_study_ctg: + for compared_ctg in ambiguity_contigs[ctg_under_study.ref]: + if compared_ctg.contig != ctg_under_study.contig: + if study_ctg_coord[0] >= compared_ctg.s1: # example: start_interest 20 >= start_compared 10 + if study_ctg_coord[0] <= compared_ctg.e1: + if study_ctg_coord[1] > compared_ctg.e1: + study_ctg_coord[0] = compared_ctg.e1 + 1 + elif study_ctg_coord[1] <= compared_ctg.e1: + coords_to_ignore.append(study_ctg_coord) + break + + elif study_ctg_coord[0] < compared_ctg.s1: # example: start_interest 20 < start_compared 30 + if study_ctg_coord[1] >= compared_ctg.s1: + if study_ctg_coord[1] <= compared_ctg.e1: + study_ctg_coord[1] = compared_ctg.s1 - 1 + elif study_ctg_coord[1] > compared_ctg.e1: + coords_of_study_ctg.append([study_ctg_coord[0], compared_ctg.s1 - 1]) + coords_of_study_ctg.append([compared_ctg.e1 + 1, study_ctg_coord[1]]) + coords_to_ignore.append(study_ctg_coord) + break + + for coord in coords_of_study_ctg: + if coord not in coords_to_ignore: + start_coord, stop_coord = coord + l_non_overlapping_pos += list(set(range(start_coord, stop_coord + 1))) + return l_non_overlapping_pos + + +def leave_best_alignment_for_ambiguity_contigs(ref_aligns, reference_chromosomes, ca_output): + genome_mapping = genome_coverage_for_unambiguity_contigs(ref_aligns, reference_chromosomes) + ambiguity_contigs = find_ambiguity_alignments(ref_aligns) + alignments_by_contigs = find_contig_alignments_to_haplotypes(ambiguity_contigs) # {ambiguity_contig_name: [alignment_haplotype_1, alignment_haplotype_2, ..., alignment_haplotype_n]} + ca_output.stdout_f.write('\nNOTICE: used just one alignment of ambiguity contigs to the best haplotype for the analysis.\n') + + for contig in alignments_by_contigs.keys(): + l_of_alignment_to_not_the_best_haplotypes = [] + contribution_of_non_overlapping_seq = {} # {ref_align: [contribution_of_contig_to_genome_mapping, len_align_to_ref]} + for align in alignments_by_contigs[contig]: + contribution_of_non_overlapping_seq[align.ref] = [0] + l_non_overlapping_pos = get_coords_for_non_overlapping_seq(ambiguity_contigs, align) + for position in l_non_overlapping_pos: + if genome_mapping[align.ref][position] == 0: + contribution_of_non_overlapping_seq[align.ref][0] += 1 + contribution_of_non_overlapping_seq[align.ref][0] *= align.idy / 100 + contribution_of_non_overlapping_seq[align.ref].append(align.e1 - align.s1 + 1) + + ref_of_best_alignment_of_ctg, _ = sorted(contribution_of_non_overlapping_seq.items(), key=lambda x: x[1])[::-1][0] + for align in alignments_by_contigs[contig]: + if align.ref == ref_of_best_alignment_of_ctg: + for pos in range(align.s1, align.e1 + 1): + genome_mapping[align.ref][pos] = 1 + ca_output.stdout_f.write( + f'\tContig {align.contig}: best alignment to {ref_of_best_alignment_of_ctg}. ') + else: + l_of_alignment_to_not_the_best_haplotypes.append(align.ref) + for ambiguity_contig in ambiguity_contigs[align.ref]: + if ambiguity_contig.contig == contig: + ambiguity_contigs[align.ref].remove(ambiguity_contig) + break + for ambiguity_contig in ref_aligns[align.ref]: + if ambiguity_contig.contig == contig: + ref_aligns[align.ref].remove(ambiguity_contig) + break + alignment_to_not_the_best_haplotypes = ', '.join(l_of_alignment_to_not_the_best_haplotypes) + ca_output.stdout_f.write(f'Skipping alignment to {alignment_to_not_the_best_haplotypes}.\n') + + +def count_missed_heterozygosity(used_snps_fpath, ref1_to_ref2_fpath, ref2_to_ref1_fpath): + # Creating dictionary for file ref2_to_ref1 + # Dictionary contains positions of SNPs on ref2 and corresponding nucleotides on ref1 + ref2_to_ref1_dict = {} + with open(ref2_to_ref1_fpath) as file: + for _, line in enumerate(file): + if len(line.strip()) != 0: + ref1_name, _, ref1_pos, ref1_nucl, ref2_nucl, _ = line.strip().split('\t') + ref2_to_ref1_dict[ref1_pos] = ref2_nucl + + # Creating dictionary for file ref2_to_ref1 + # Dictionary contains positions of SNPs on ref2 and corresponding nucleotides on ref1 + ref1_to_ref2_dict = {} + with open(ref1_to_ref2_fpath) as file: + for _, line in enumerate(file): + if len(line.strip()) != 0: + ref2_name, _, ref2_pos, ref2_nucl, ref1_nucl, _ = line.strip().split('\t') + ref1_to_ref2_dict[ref2_pos] = ref1_nucl + + # Counting cases of lost heterozygosity + n_missed_heterozygosity = 0 + with open(used_snps_fpath) as used_snps_file: + for l_number, line in enumerate(used_snps_file): + if len(line.strip()) != 0: + ref_name, ctg_name, ref_pos, ref_nucl, ctg_nucl, ctg_pos = line.strip().split('\t') + if ref_name == ref1_name: # Searching in the ref2_to_ref1_dict + if ref_pos in ref2_to_ref1_dict.keys() and ctg_nucl == ref2_to_ref1_dict[ref_pos]: + n_missed_heterozygosity += 1 + elif ref_name == ref2_name: # Searching in the ref1_to_ref2_dict + if ref_pos in ref1_to_ref2_dict.keys() and ctg_nucl == ref1_to_ref2_dict[ref_pos]: + n_missed_heterozygosity += 1 + + return n_missed_heterozygosity + + +def split_ref_file_by_haplotypes(ref_fpath, dip_genome_by_chr, aux_files_dirpath): + ref1, ref2 = dip_genome_by_chr.keys() + ref1_fpath = os.path.join(aux_files_dirpath, ref1 + '.fa') + ref2_fpath = os.path.join(aux_files_dirpath, ref2 + '.fa') + open(ref1_fpath, 'w').close() + open(ref2_fpath, 'w').close() + + with open(ref_fpath, 'r') as ref_file: + list_of_ref_chr = ref_file.read().split('>') + + for fasta_entry in list_of_ref_chr: + for key, value in dip_genome_by_chr.items(): + for chrom in value: + if fasta_entry.startswith(chrom): + if key == ref1: + with open(ref1_fpath, 'a') as ref1_file: + ref1_file.write('>' + fasta_entry) + elif key == ref2: + with open(ref2_fpath, 'a') as ref2_file: + ref2_file.write('>' + fasta_entry) + return ref1_fpath, ref2_fpath \ No newline at end of file diff --git a/quast_libs/mash/LICENSE.txt b/quast_libs/mash/LICENSE.txt new file mode 100644 index 0000000000..3f8586c5da --- /dev/null +++ b/quast_libs/mash/LICENSE.txt @@ -0,0 +1,61 @@ +PURPOSE + +Mash is a fast sequence distance estimator that uses the MinHash +algorithm and is designed to work with genomes and metagenomes in the +form of assemblies or reads. It is implemented in C++ and is +distributed with: + +KSeq + lh3lh3.users.sourceforge.net/kseq.shtml + MIT License + +MurmurHash3 + code.google.com/p/smhasher/wiki/MurmurHash3 + Public domain + +Open Bloom Filter + https://code.google.com/p/bloom/source/browse/trunk/bloom_filter.hpp + Common Public License + +Robin_Hood Unordered Map and Set + https://github.com/martinus/robin-hood-hashing + MIT License + +COPYRIGHT LICENSE + +Copyright © 2015, Battelle National Biodefense Institute (BNBI); +all rights reserved. Authored by: Brian Ondov, Todd Treangen, +Sergey Koren, and Adam Phillippy + +This Software was prepared for the Department of Homeland Security +(DHS) by the Battelle National Biodefense Institute, LLC (BNBI) as +part of contract HSHQDC-07-C-00020 to manage and operate the National +Biodefense Analysis and Countermeasures Center (NBACC), a Federally +Funded Research and Development Center. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are +met: + +1. Redistributions of source code must retain the above copyright +notice, this list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright +notice, this list of conditions and the following disclaimer in the +documentation and/or other materials provided with the distribution. + +3. Neither the name of the copyright holder nor the names of its +contributors may be used to endorse or promote products derived from +this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/quast_libs/mash/mash_linux b/quast_libs/mash/mash_linux new file mode 100755 index 0000000000..cf2aa0e20d Binary files /dev/null and b/quast_libs/mash/mash_linux differ diff --git a/quast_libs/mash/mash_osx b/quast_libs/mash/mash_osx new file mode 100755 index 0000000000..74575bafec Binary files /dev/null and b/quast_libs/mash/mash_osx differ diff --git a/quast_libs/options_parser.py b/quast_libs/options_parser.py index b40b3d021d..284554b748 100644 --- a/quast_libs/options_parser.py +++ b/quast_libs/options_parser.py @@ -274,6 +274,7 @@ def parse_options(logger, quast_args): mode = get_mode(quast_args[0]) is_metaquast = True if mode == 'meta' else False qconfig.large_genome = True if mode == 'large' else False + qconfig.ploid_mode = True if mode == 'ploid' else False if '-h' in quast_args or '--help' in quast_args or '--help-hidden' in quast_args: qconfig.usage('--help-hidden' in quast_args, mode=mode, short=False) @@ -741,7 +742,20 @@ def parse_options(logger, quast_args): (['--agb'], dict( dest='is_agb_mode', action='store_true') - ) + ), + (['--ploid'], dict( + dest='ploid_mode', + action='store_true') + ), + (['--ploid-assembly-type'], dict( + dest='ploid_assembly_type', + type='string', + default=qconfig.ploid_assembly_type, + action='callback', + callback=check_str_arg_value, + callback_args=(logger,), + callback_kwargs={'available_values': ['consensus', 'phased']}) + ), ] if is_metaquast: options += [ diff --git a/quast_libs/qconfig.py b/quast_libs/qconfig.py index 1e7270cbd1..b6bf4be2d3 100644 --- a/quast_libs/qconfig.py +++ b/quast_libs/qconfig.py @@ -83,8 +83,11 @@ all_labels_from_dirs = False run_busco = False large_genome = False +ploid_mode = False +ploid_assembly_type = 'phased' use_kmc = False report_all_metrics = False +var_haplotypes = [1,2,3,4,5,6,7,8] # ideal assembly section optimal_assembly = False @@ -354,6 +357,8 @@ def get_mode(binary_path=None): mode = 'default' if basename(binary_path).startswith("metaquast"): mode = 'meta' + elif basename(binary_path).startswith("dipquast") or ploid_mode: + mode = 'ploid' elif basename(binary_path).startswith("quast-lg") or large_genome: mode = 'large' return mode @@ -365,6 +370,8 @@ def print_version(mode=None): full_version = 'QUAST v' + quast_version() if mode == 'meta': full_version += ' (MetaQUAST mode)' + elif mode == 'ploid': + full_version += ' (dipQUAST mode)' elif mode == 'large': full_version += ' (QUAST-LG mode)' sys.stdout.write(full_version + '\n') @@ -378,6 +385,8 @@ def usage(show_hidden=False, mode=None, short=True, stream=sys.stdout): meta = True if mode == 'meta' else False if mode == 'meta': stream.write('MetaQUAST: Quality Assessment Tool for Metagenome Assemblies\n') + elif mode == 'ploid': + stream.write('dipQUAST: Quality Assessment Tool for Diploid Genome Assemblies\n') elif mode == 'large': stream.write('QUAST-LG: Quality Assessment Tool for Large Genome Assemblies\n') else: @@ -452,6 +461,10 @@ def usage(show_hidden=False, mode=None, short=True, stream=sys.stdout): stream.write(" --blast-db Custom BLAST database (.nsq file). By default, MetaQUAST searches references in SILVA database\n") stream.write(" --use-input-ref-order Use provided order of references in MetaQUAST summary plots (default order: by the best average value)\n") stream.write(" --contig-thresholds Comma-separated list of contig length thresholds [default: %s]\n" % contig_thresholds) + if mode != 'ploid': + stream.write(" --ploid Use to evaluate the assembly quality of ploid genomes.\n" + " Works best with --ambiguity-usage one\n") + stream.write(" --ploid-assembly-type