Skip to content

Revise explanation of 'join-to'/'join-from' in mergelist documentation #7190

@trobx

Description

@trobx

As far I can see mergelist works (beautifully!) as one would hope/expect: it is (or at least resembles) a recursive x[i] join across a list of tables, where the column order reflects the left-to-right order of the tables. In other words each merge is at least analogous to:

mergelist(list(LHS,RHS), how="right", ...) <---> LHS[i=RHS, ...)]

mergelist(list(LHS,RHS), how="left", ...) <---> RHS[i=LHS, j=<LHS-then-RHS-cols>, ...)]

All other cases (hows) are like how="left". I'm abstracting here from details like having to complete the full join with how="full", and the fact that mult seems to have some specialised behaviour in inner/outer joins (see note at the end of the documentation).

This basic description of things (even if it's only an analogy to whatever the internals are) accounts for column and row order, the "direction of action" of mult, and the key that gets used with implicit on.

library(data.table)

# ________________________
# row and column order
# ________________________
LHS <- data.table(id=1:3)[, a:=paste0("LHS",.I)]
RHS <- data.table(id=4:2)[, b:=paste0("RHS",.I)]
LHS
RHS

# all TRUE
identical(
  mergelist(list(LHS,RHS), on="id", how="left"),
  RHS[LHS, on="id", c("id","a","b")]
)
identical(
  mergelist(list(LHS,RHS),on="id",how="inner"),
  RHS[LHS, on="id", c("id","a","b"), nomatch=NULL]
)
identical(
  mergelist(list(LHS,RHS), on="id", how="right"),
  LHS[RHS, on="id"]
)

# ________________________
# mult
# ________________________
LHSm <- data.table(id=rep(1:3, each=2))[, a:=paste0("LHS",.I)][]
RHSm <- data.table(id=rep(4:2, each=2))[, b:=paste0("RHS",.I)][]
LHSm
RHSm

# all TRUE
identical(
  mergelist(list(LHSm,RHSm), on="id", mult="first", how="left"),
  RHSm[LHSm, on="id", c("id","a","b"), mult="first"]
)
identical(
  mergelist(list(LHSm,RHSm), on="id", mult="first", how="right"),
  LHSm[RHSm, on="id", mult="first"]
)

# ________________________
# key, implicit `on`
# ________________________
LHSk <- setkey(copy(LHS), "id")

# errors (DT/join-to not keyed)
try(mergelist(list(LHSk,RHS), how="left"))
try(RHS[LHSk, c("id","a","b")])

# TRUE
identical(
  mergelist(list(LHSk,RHS), on="id", mult="first", how="right"),
  LHSk[RHS]
)

Current documentation and "join-from"/"join-to"

The documentation uses "join-from" and "join-to" to refer to (the equivalents of) i and x respectively. I have some worries about this. First, to a data.table user this seems like obfuscation - why not just say i and x (or e.g. the "i-equivalent" and "x-equivalent" tables)? Second, to an outside user (who might be dipping into data.table precisely to use mergelist), there is a risk that these terms will mean nothing at all and just end up being mysterious intermediate variables that obscure the documentation. Why can't we just say what goes on (with respect to row order, direction of mult, on), pointing out in each case that things are the other way round in a right-join? Also, for me at least, the terms don't make sense: in my mind it is i that is joined to from DT (because i presents an ordered list of demands that DT supplies where it can). This is not worth arguing about because it's so subjective, but the point is that neither the terms themselves nor the words "baseline" or "authoritative" can hope to convey very much.

I'm also puzzled about the content of the Details section that explains this distinction:

The terms join-to and join-from indicate which in a pair of tables is the "baseline" or "authoritative" source – this governs the ordering of rows and columns. Whether each refers to the "left" or "right" table of a pair depends on the how argument:

  1. how %in% c("left", "semi", "anti"): join-to is RHS, join-from is LHS.
  2. how %in% c("inner", "full", "cross"): LHS and RHS tables are treated equally, so that the terms are interchangeable.
  3. how == "right": join-to is LHS, join-from is RHS.

This clearly says that the identity of the "join-from"/"join-to" tables determines column and row order in mergelist. But

  • with respect to column order, it doesn't, because column order is always LHS-then-RHS (see above).
  • with respect to row order, case 2 is logically impossible . i retains its row order because it is iterated over, but DT will only do so if:
    • the join columns have the same order in both tables
    • a row of i never joins with multiple rows of DT (we mustn't have a pattern of DT's row indices like 1,2,3,2,3,4,...)
    • in a full join, the value of the join columns in all of DT's non-matching rows (which get appended at the bottom) is above the maximum in any of is rows

In fact, inner, full and cross joins behave like how="left" (LHS as the "join-from"/i-equivalent table).

Proposed improvement

If I have got all this right (and I may well be missing something), my suggestion is:

  • Eliminate "join-from"/"join-to" and just say what happens in relation to implicit on, row order, and mult in how="right vs the other cases (after fixing the problems I've noted). This directly documents how mergelist works in these respects without these baffling intermediate variables.

  • Add a note for the benefit of data.table users explicitly spelling out the connection (or analogy, whichever it is) between mergelist and x[i].

I think this documentation would be much clearer and more direct to both "audiences".

Delete the "join-from"/"join-to" section in Details and replace with:

The columns of the result reflect the left-right order of the tables in l, i.e. columns of LHS are placed before columns of RHS in the output of each merge. The rows reflect the row orders of LHS then RHS in all cases except how="right", where they reflect RHS then LHS.

Then remove all references to "join-from"/"join-to" as follows:

  • Definition of on:

on character vector of column names to merge on; when missing, the key of the RHS table is used (LHS table if how="right").

  • Definition of mult (first sentence):

mult Character string, controls how to proceed when a row in the LHS table encounters multiple matching rows in the RHS table (vice versa with how="right").

Details re. mult:

Using mult="error" will throw an error when a row in the LHS table encounters multiple matching rows in the RHS table (vice versa with how="right").

(NB I've slightly altered the baseline wording of these two.)

Add a separate section, something like:

Users familiar with data.table x[i] join syntax may find it helpful to think of each merge in mergelist as being equivalent to LHS[RHS, ...] when how="right", and RHS[LHS, j=<LHS-then-RHS-cols>, ...] in all other cases. The row order, direction of action of mult, and the key used in implicit on, all reflect the "usual rules" of x[i]. Please note, however, that mergelist differs from x[i] joins in the specific behaviour of mult.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions