Revise explanation of 'join-to'/'join-from' in `mergelist` documentation

As far I can see `mergelist` works (beautifully!) as one would hope/expect: it is (or at least resembles) a recursive `x[i]` join across a list of tables, where the column order reflects the left-to-right order of the tables. In other words each merge is at least analogous to:

`mergelist(list(LHS,RHS), how="right", ...)` <---> `LHS[i=RHS, ...)]`

`mergelist(list(LHS,RHS), how="left", ...)`  <---> `RHS[i=LHS, j=<LHS-then-RHS-cols>, ...)]`

All other cases (`how`s) are like `how="left"`. I'm abstracting here from details like having to complete the full join with `how="full"`, and the fact that `mult` seems to have some specialised behaviour in inner/outer joins (see note at the end of the documentation).

This basic description of things (even if it's only an analogy to whatever the internals are) accounts for column and row order, the "direction of action" of `mult`, and the `key` that gets used with implicit `on`.

```
library(data.table)

# ________________________
# row and column order
# ________________________
LHS <- data.table(id=1:3)[, a:=paste0("LHS",.I)]
RHS <- data.table(id=4:2)[, b:=paste0("RHS",.I)]
LHS
RHS

# all TRUE
identical(
  mergelist(list(LHS,RHS), on="id", how="left"),
  RHS[LHS, on="id", c("id","a","b")]
)
identical(
  mergelist(list(LHS,RHS),on="id",how="inner"),
  RHS[LHS, on="id", c("id","a","b"), nomatch=NULL]
)
identical(
  mergelist(list(LHS,RHS), on="id", how="right"),
  LHS[RHS, on="id"]
)

# ________________________
# mult
# ________________________
LHSm <- data.table(id=rep(1:3, each=2))[, a:=paste0("LHS",.I)][]
RHSm <- data.table(id=rep(4:2, each=2))[, b:=paste0("RHS",.I)][]
LHSm
RHSm

# all TRUE
identical(
  mergelist(list(LHSm,RHSm), on="id", mult="first", how="left"),
  RHSm[LHSm, on="id", c("id","a","b"), mult="first"]
)
identical(
  mergelist(list(LHSm,RHSm), on="id", mult="first", how="right"),
  LHSm[RHSm, on="id", mult="first"]
)

# ________________________
# key, implicit `on`
# ________________________
LHSk <- setkey(copy(LHS), "id")

# errors (DT/join-to not keyed)
try(mergelist(list(LHSk,RHS), how="left"))
try(RHS[LHSk, c("id","a","b")])

# TRUE
identical(
  mergelist(list(LHSk,RHS), on="id", mult="first", how="right"),
  LHSk[RHS]
)

```
## Current documentation and "join-from"/"join-to"

The documentation uses "join-from" and "join-to" to refer to (the equivalents of) `i` and `x` respectively. I have some worries about this. First, to a data.table user this seems like obfuscation - why not just say `i` and `x` (or e.g. the "`i`-equivalent" and "`x`-equivalent" tables)? Second, to an outside user (who might be dipping into data.table precisely to use `mergelist`), there is a risk that these terms will mean nothing at all and just end up being mysterious intermediate variables that obscure the documentation. Why can't we just say what goes on (with respect to row order, direction of `mult`, `on`), pointing out in each case that things are the other way round in a right-join? Also, for me at least, the terms don't make sense: in my mind it is `i` that is joined to from `DT` (because `i` presents an ordered list of demands that `DT` supplies where it can). This is not worth arguing about because it's so subjective, but the point is that neither the terms themselves nor the words "baseline" or "authoritative" can hope to convey very much.

I'm also puzzled about the content of the Details section that explains this distinction:

> The terms _join-to_ and _join-from_ indicate which in a pair of tables is the "baseline" or "authoritative" source – this governs the ordering of rows and columns. Whether each refers to the "left" or "right" table of a pair depends on the `how` argument:
>
> 1. `how %in% c("left", "semi", "anti")`: _join-to_ is _RHS_, _join-from_ is _LHS_.
> 2. `how %in% c("inner", "full", "cross")`: _LHS_ and _RHS_ tables are treated equally, so that the terms are interchangeable.
> 3. `how == "right"`: _join-to_ is _LHS_, _join-from_ is _RHS_.

This clearly says that the identity of the "join-from"/"join-to" tables determines column and row order in `mergelist`. But

- with respect to column order, it doesn't, because column order is always LHS-then-RHS (see above).
- with respect to row order, case 2 is logically impossible . `i` retains its row order because it is iterated over, but `DT` will only do so if:
  - the join columns have the same order in both tables
  - a row of `i` never joins with multiple rows of `DT` (we mustn't have a pattern of `DT`'s row indices like 1,2,3,2,3,4,...)
  - in a full join, the value of the join columns in all of `DT`'s non-matching rows (which get appended at the bottom) is above the maximum in any of `i`s rows

In fact, inner, full and cross joins behave like `how="left"` (*LHS* as the "join-from"/`i`-equivalent table).

## Proposed improvement

If I have got all this right (and I may well be missing something), my suggestion is:

- Eliminate "join-from"/"join-to" and just say what happens in relation to implicit `on`, row order, and `mult` in `how="right` vs the other cases (after fixing the problems I've noted). This directly documents how `mergelist` works in these respects without these baffling intermediate variables.

- Add a note for the benefit of data.table users explicitly spelling out the connection (or analogy, whichever it is) between `mergelist` and `x[i]`.

I think this documentation would be much clearer and more direct to both "audiences".

Delete the "join-from"/"join-to" section in Details and replace with:

> The columns of the result reflect the left-right order of the tables in `l`, i.e. columns of *LHS* are placed before columns of *RHS* in the output of each merge. The rows reflect the row orders of *LHS* then *RHS* in all cases except `how="right"`, where they reflect *RHS* then *LHS*.

Then remove all references to "join-from"/"join-to" as follows:

- Definition of `on`:

> `on`  character vector of column names to merge on; when missing, the `key` of the *RHS* table is used (*LHS* table if `how="right"`).

- Definition of `mult` (first sentence):

> mult  Character string, controls how to proceed when a row in the *LHS* table encounters multiple matching rows in the *RHS* table (vice versa with `how="right"`). 

Details re. `mult`:

> Using `mult="error"` will throw an error when a row in the *LHS* table encounters multiple matching rows in the *RHS* table (vice versa with `how="right"`).

(NB I've slightly altered the baseline wording of these two.)

Add a separate section, something like:

> Users familiar with data.table `x[i]` join syntax may find it helpful to think of each merge in `mergelist` as being equivalent to `LHS[RHS, ...]` when `how="right"`, and `RHS[LHS, j=<LHS-then-RHS-cols>, ...]` in all other cases. The row order, direction of action of `mult`, and the `key` used in implicit `on`, all reflect the "usual rules" of `x[i]`. Please note, however, that `mergelist` differs from `x[i]` joins in the specific behaviour of `mult`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revise explanation of 'join-to'/'join-from' in `mergelist` documentation #7190

Current documentation and "join-from"/"join-to"

Proposed improvement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Revise explanation of 'join-to'/'join-from' in mergelist documentation #7190

Description

Current documentation and "join-from"/"join-to"

Proposed improvement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Revise explanation of 'join-to'/'join-from' in `mergelist` documentation #7190