Skip to content

Commit 1617b87

Browse files
committed
this will become version 3 at arXiv
1 parent 2191ac5 commit 1617b87

File tree

1 file changed

+34
-19
lines changed

1 file changed

+34
-19
lines changed

tex/minimap2.tex

Lines changed: 34 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ \section{Motivation:} Recent advances in sequencing technologies promise
3434
length. Existing alignment programs are unable or inefficient to process such data
3535
at scale, which presses for the development of new alignment algorithms.
3636

37-
\section{Results:} Minimap2 is a general-purpose mapper to align DNA or long
37+
\section{Results:} Minimap2 is a general-purpose alignment program to map DNA or long
3838
mRNA sequences against a large reference database. It works with accurate short
3939
reads of $\ge$100bp in length, $\ge$1kb genomic reads at error rate $\sim$15\%,
4040
full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely
@@ -66,15 +66,22 @@ \section{Introduction}
6666
approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}.
6767
\citet{Suzuki130633} extended our work with a fast and novel algorithm on
6868
generating base-level alignment, which in turn inspired us to develop minimap2
69-
towards higher accuracy and more practical functionality.
69+
with added functionality.
7070

7171
Both SMRT and ONT have been applied to the sequencing of spliced mRNAs (RNA-seq). While
7272
traditional mRNA aligners work~\citep{Wu:2005vn,Iwata:2012aa}, they are not
7373
optimized for long noisy sequence reads and are tens of times slower than
7474
dedicated long-read aligners. When developing minimap2 initially for aligning
75-
genomic DNA only, we realized minor modifications could make it competitive for
76-
aligning mRNAs as well. Minimap2 is a first RNA-seq aligner specifically
77-
designed for long noisy reads.
75+
genomic DNA only, we realized minor modifications could enable the base
76+
algorithm to map mRNAs as well. Minimap2 becomes a first RNA-seq aligner
77+
specifically designed for long noisy reads. We have also extended the original
78+
algorithm to map short reads at a speed faster than several mainstream
79+
short-read mappers.
80+
81+
In this article, we will describe the minimap2 algorithm and its applications
82+
to different types of input sequences. We will evaluate the performance and
83+
accuracy of minimap2 on several simulated and real data sets and demonstrate
84+
the versatility of minimap2.
7885

7986
\begin{methods}
8087
\section{Methods}
@@ -366,12 +373,12 @@ \subsection{Aligning spliced sequences}
366373

367374
In the spliced alignment mode, minimap2 further increases the density of
368375
minimizers and disables banded alignment. Together with the two-round DP-based
369-
alignment, spliced alignment is several times slower than DNA sequence
376+
alignment, spliced alignment is several times slower than genomic DNA
370377
alignment.
371378

372379
\subsection{Aligning short paired-end reads}
373380

374-
During chainging, minimap2 takes a pair of reads as one read with a gap of
381+
During chainging, minimap2 takes a pair of reads as one fragment with a gap of
375382
unknown length in the middle. It applies a normal gap cost between seeds on the
376383
same read but is a more permissive gap cost between seeds on different reads.
377384
More precisely, the gap cost during chaining is:
@@ -423,9 +430,7 @@ \subsection{Aligning long genomic reads}
423430
and LAMSA~\citep{Liu:2017aa} because they either
424431
crashed or produced malformatted output. In this evaluation, minimap2 has
425432
higher power to distinguish unique and repetitive hits, and achieves overall
426-
higher mapping accuracy (Fig.~\ref{fig:eval}a). It is still the most accurate
427-
even if we skip DP-based alignment (data not shown), confirming chaining alone
428-
is sufficient to achieve high accuracy for approximate mapping. Minimap2 and
433+
higher mapping accuracy (Fig.~\ref{fig:eval}a). Minimap2 and
429434
NGMLR provide better mapping quality estimate: they rarely give repetitive hits
430435
high mapping quality. Apparently, other aligners may
431436
occasionally miss close suboptimal hits and be overconfident in wrong mappings.
@@ -498,10 +503,10 @@ \subsection{Aligning long spliced reads}
498503
We have also evaluated spliced aligners on public Iso-Seq data (human Alzheimer
499504
brain from \href{http://bit.ly/isoseqpub}{http://bit.ly/isoseqpub}). The
500505
observation is similar: minimap2 is faster at higher junction accuracy.
501-
On a private Nanopore Direct RNA data set with $>$20\% sequencing error rate
502-
(M\"{u}ller et al, personal communication), minimap2 aligned 940,346 introns
503-
from 239,976 mapped reads with 88.5\% of them consistent with human gene
504-
annotations. In comparison, only 40.3\% of GMAP introns found in known gene
506+
On a private Nanopore Direct RNA data set with $\sim$17\% sequencing error rate
507+
(N. Loman, personal communication), minimap2 aligned 96\,467 introns
508+
from 37\,068 mapped reads with 95.4\% of them consistent with human gene
509+
annotations. In comparison, only 74.8\% of GMAP introns found in known gene
505510
annotations.
506511

507512
We noted that GMAP and SpAln have not been optimized for noisy reads. We are
@@ -551,24 +556,23 @@ \subsection{Aligning short genomic reads}
551556
ERR1341796. In this evaluation, minimap2 has higher SNP false negative rate
552557
(FNR; 2.5\% of minimap2 vs 2.2\% of BWA-MEM), but fewer false positive SNPs per
553558
million bases (FPPM; 3.0 vs 3.9), lower 2--50bp INDEL FNR (7.3\% vs 7.5\%) and
554-
similar INDEL FPPM (both 1.0). In comparison, Bowtie2 has a SNP FNR of 4.7\%
555-
and INDEL FNR of 10.4\%. Minimap2 is broadly similar to BWA-MEM in the context
556-
of small variant calling.
559+
similar INDEL FPPM (both 1.0). Minimap2 is broadly similar to BWA-MEM in the
560+
context of small variant calling.
557561

558562
\subsection{Other applications}
559563

560564
Minimap2 retains minimap's functionality to find overlaps between long reads
561565
and to search against large multi-species databases such as \emph{nt} from
562566
NCBI. Minimap2 can also align similar genomes or different assemblies of the
563567
same species. It took 7 wall-clock minutes over 8 CPU cores to align a human
564-
SMRT assembly (AC:GCA\_001297185.1) to GRCh38, over 20 times as fast as
568+
SMRT assembly (AC:GCA\_001297185.1) to GRCh38, over 20 times faster
565569
MUMmer4~\citep{Kurtz:2004zr}.
566570

567571
\section{Discussions}
568572

569573
Minimap2 is a versatile mapper and pairwise aligner for nucleotide sequences.
570574
It works with short reads, assembly contigs and long noisy genomic and RNA-seq
571-
reads. It can be used as a read mapper, long-read overlapper or a full-genome
575+
reads, and can be used as a read mapper, long-read overlapper or a full-genome
572576
aligner. Minimap2 is also accurate and efficient, often outperforming other
573577
domain-specific alignment tools in terms of both speed and accuracy.
574578

@@ -586,6 +590,17 @@ \section{Discussions}
586590
spliced reads and multiple reads per fragment. This gives us the opportunity to
587591
extend the same base algorithm to a variety of use cases.
588592

593+
Modern mainstream aligners often use a full-text index, such as suffix array or
594+
FM-index, to index reference sequences. An advantage of this approach is that
595+
we can use exact seeds of arbitrary lengths, which helps to increase seed
596+
uniqueness and reduce unsuccessful extensions. Minimap2 indexes reference
597+
k-mers with a hash table instead. Such fixed-length seeds are inferior to
598+
variable-length seeds in theory, but can be computed much more efficiently in
599+
practice. When a query sequence has multiple seed hits, we can afford to skip
600+
some highly repetitive seeds without affecting the final accuracy. This further
601+
alleviates the concern with the uniqueness of seeds. Hash table is the ideal
602+
data structure for mapping long query sequences.
603+
589604
\section*{Acknowledgements}
590605
We owe a debt of gratitude to H. Suzuki and M. Kasahara for releasing their
591606
masterpiece and insightful notes before formal publication. We thank M.

0 commit comments

Comments
 (0)