-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
As far I can see mergelist
works (beautifully!) as one would hope/expect: it is (or at least resembles) a recursive x[i]
join across a list of tables, where the column order reflects the left-to-right order of the tables. In other words each merge is at least analogous to:
mergelist(list(LHS,RHS), how="right", ...)
<---> LHS[i=RHS, ...)]
mergelist(list(LHS,RHS), how="left", ...)
<---> RHS[i=LHS, j=<LHS-then-RHS-cols>, ...)]
All other cases (how
s) are like how="left"
. I'm abstracting here from details like having to complete the full join with how="full"
, and the fact that mult
seems to have some specialised behaviour in inner/outer joins (see note at the end of the documentation).
This basic description of things (even if it's only an analogy to whatever the internals are) accounts for column and row order, the "direction of action" of mult
, and the key
that gets used with implicit on
.
library(data.table)
# ________________________
# row and column order
# ________________________
LHS <- data.table(id=1:3)[, a:=paste0("LHS",.I)]
RHS <- data.table(id=4:2)[, b:=paste0("RHS",.I)]
LHS
RHS
# all TRUE
identical(
mergelist(list(LHS,RHS), on="id", how="left"),
RHS[LHS, on="id", c("id","a","b")]
)
identical(
mergelist(list(LHS,RHS),on="id",how="inner"),
RHS[LHS, on="id", c("id","a","b"), nomatch=NULL]
)
identical(
mergelist(list(LHS,RHS), on="id", how="right"),
LHS[RHS, on="id"]
)
# ________________________
# mult
# ________________________
LHSm <- data.table(id=rep(1:3, each=2))[, a:=paste0("LHS",.I)][]
RHSm <- data.table(id=rep(4:2, each=2))[, b:=paste0("RHS",.I)][]
LHSm
RHSm
# all TRUE
identical(
mergelist(list(LHSm,RHSm), on="id", mult="first", how="left"),
RHSm[LHSm, on="id", c("id","a","b"), mult="first"]
)
identical(
mergelist(list(LHSm,RHSm), on="id", mult="first", how="right"),
LHSm[RHSm, on="id", mult="first"]
)
# ________________________
# key, implicit `on`
# ________________________
LHSk <- setkey(copy(LHS), "id")
# errors (DT/join-to not keyed)
try(mergelist(list(LHSk,RHS), how="left"))
try(RHS[LHSk, c("id","a","b")])
# TRUE
identical(
mergelist(list(LHSk,RHS), on="id", mult="first", how="right"),
LHSk[RHS]
)
Current documentation and "join-from"/"join-to"
The documentation uses "join-from" and "join-to" to refer to (the equivalents of) i
and x
respectively. I have some worries about this. First, to a data.table user this seems like obfuscation - why not just say i
and x
(or e.g. the "i
-equivalent" and "x
-equivalent" tables)? Second, to an outside user (who might be dipping into data.table precisely to use mergelist
), there is a risk that these terms will mean nothing at all and just end up being mysterious intermediate variables that obscure the documentation. Why can't we just say what goes on (with respect to row order, direction of mult
, on
), pointing out in each case that things are the other way round in a right-join? Also, for me at least, the terms don't make sense: in my mind it is i
that is joined to from DT
(because i
presents an ordered list of demands that DT
supplies where it can). This is not worth arguing about because it's so subjective, but the point is that neither the terms themselves nor the words "baseline" or "authoritative" can hope to convey very much.
I'm also puzzled about the content of the Details section that explains this distinction:
The terms join-to and join-from indicate which in a pair of tables is the "baseline" or "authoritative" source – this governs the ordering of rows and columns. Whether each refers to the "left" or "right" table of a pair depends on the
how
argument:
how %in% c("left", "semi", "anti")
: join-to is RHS, join-from is LHS.how %in% c("inner", "full", "cross")
: LHS and RHS tables are treated equally, so that the terms are interchangeable.how == "right"
: join-to is LHS, join-from is RHS.
This clearly says that the identity of the "join-from"/"join-to" tables determines column and row order in mergelist
. But
- with respect to column order, it doesn't, because column order is always LHS-then-RHS (see above).
- with respect to row order, case 2 is logically impossible .
i
retains its row order because it is iterated over, butDT
will only do so if:- the join columns have the same order in both tables
- a row of
i
never joins with multiple rows ofDT
(we mustn't have a pattern ofDT
's row indices like 1,2,3,2,3,4,...) - in a full join, the value of the join columns in all of
DT
's non-matching rows (which get appended at the bottom) is above the maximum in any ofi
s rows
In fact, inner, full and cross joins behave like how="left"
(LHS as the "join-from"/i
-equivalent table).
Proposed improvement
If I have got all this right (and I may well be missing something), my suggestion is:
-
Eliminate "join-from"/"join-to" and just say what happens in relation to implicit
on
, row order, andmult
inhow="right
vs the other cases (after fixing the problems I've noted). This directly documents howmergelist
works in these respects without these baffling intermediate variables. -
Add a note for the benefit of data.table users explicitly spelling out the connection (or analogy, whichever it is) between
mergelist
andx[i]
.
I think this documentation would be much clearer and more direct to both "audiences".
Delete the "join-from"/"join-to" section in Details and replace with:
The columns of the result reflect the left-right order of the tables in
l
, i.e. columns of LHS are placed before columns of RHS in the output of each merge. The rows reflect the row orders of LHS then RHS in all cases excepthow="right"
, where they reflect RHS then LHS.
Then remove all references to "join-from"/"join-to" as follows:
- Definition of
on
:
on
character vector of column names to merge on; when missing, thekey
of the RHS table is used (LHS table ifhow="right"
).
- Definition of
mult
(first sentence):
mult Character string, controls how to proceed when a row in the LHS table encounters multiple matching rows in the RHS table (vice versa with
how="right"
).
Details re. mult
:
Using
mult="error"
will throw an error when a row in the LHS table encounters multiple matching rows in the RHS table (vice versa withhow="right"
).
(NB I've slightly altered the baseline wording of these two.)
Add a separate section, something like:
Users familiar with data.table
x[i]
join syntax may find it helpful to think of each merge inmergelist
as being equivalent toLHS[RHS, ...]
whenhow="right"
, andRHS[LHS, j=<LHS-then-RHS-cols>, ...]
in all other cases. The row order, direction of action ofmult
, and thekey
used in impliciton
, all reflect the "usual rules" ofx[i]
. Please note, however, thatmergelist
differs fromx[i]
joins in the specific behaviour ofmult
.