Skip to content

Conversation

venom1204
Copy link
Contributor

Closes #5397.

This PR adds a warning when by/keyby are supplied but j is a simple column selection vector (character/numeric). Previously this was accepted silently; this change now warns the user and guides them to use .SD if grouping was intended.
The warning is suppressed when with=FALSE is explicitly provided, respecting user intent. New tests and documentation updates are included.

hi @tdhock — please take a look when you have time.
Thanks!

Copy link

github-actions bot commented Aug 16, 2025

  • HEAD=issue__5379 slower P<0.001 for memrecycle regression fixed in #5463
    Comparison Plot

Generated via commit b150b0a

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 2 minutes and 45 seconds
Installing different package versions 20 seconds
Running and plotting the test cases 2 minutes and 35 seconds

@@ -97,8 +97,8 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac

See \href{../doc/datatable-intro.html}{\code{vignette("datatable-intro")}} and \code{example(data.table)}.}

\item{by}{ Column names are seen as if they are variables (as in \code{j} when \code{with=TRUE}). The \code{data.table} is then grouped by the \code{by} and \code{j} is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups. \code{by} accepts:

\item{by}{ Column names are seen as if they are variables (as in \code{j} when \code{with=TRUE}). \emph{Note that `by` and `keyby` are ignored when `j` is a character or numeric vector used for selecting columns (i.e., when the internal `with=FALSE` is triggered).} The \code{data.table} is then grouped by the \code{by} and \code{j} is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups. \code{by} accepts:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use ` in Rd files like this, use \code{} instead

@venom1204 venom1204 requested a review from jangorecki August 16, 2025 19:03
R/data.table.R Outdated
@@ -739,6 +739,13 @@ replace_dot_alias = function(e) {
if (!length(j) && !notj) return( null.data.table() )
if (is.factor(j)) j = as.character(j) # fix for FR: #358
if (is.character(j)) {
if (!missingby) {
warning(
"`by` or `keyby` is ignored when `j` is a character vector used for column selection. ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not use backticks in warnings (only in markdown), please remove.
also please use one long character string (easier to translate), instead of breaking into several strings/lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly it's not so simple:

l = make_linter_from_xpath(
  "
    //SYMBOL_FUNCTION_CALL[
      starts-with(text(), 'stop')
      or starts-with(text(), 'warning')
      or starts-with(text(), 'message')
      or starts-with(text(), 'packageStartupMessage')
    ]
      /parent::expr[
        following-sibling::expr
        //STR_CONST[contains(text(), '`')]
      ]
  ",
  'message with backtick'
)
lint_dir("R", l())

Gives hits

data.table.R:315:9: warning: [l] message with backtick
        stopf("Invalid use of `:=` inside `{}`. `:=` must be the only expression inside `{}` when used in `j`. Instead of: DT[{tmp1 <- ...; tmp2 <- ...; someCol := tmp1 * tmp2}], Use: DT[, someCol := {tmp1 <- ...; tmp2 <- ...; tmp1 * tmp2}]")
        ^~~~~
data.table.R:724:31: warning: [l] message with backtick
      if (jsub %iscall% ":=") stopf("`:=` is only supported under with=TRUE, see ?`:=`.")
                              ^~~~~
data.table.R:2671:40: warning: [l] message with backtick
  if (identical(name, quote(`*tmp*`))) stopf("setalloccol attempting to modify `*tmp*`")
                                       ^~~~~
data.table.R:2887:3: warning: [l] message with backtick
  stopf('Check that is.data.table(DT) == TRUE. Otherwise, `:=` is defined for use in j, once only and in particular ways. See help(":=", "data.table"). A common reason for this error is allocating a new column in `j` and using `<-` instead of `:=`; e.g., `DT[, new_col <- 1]` should be `DT[, new_col := 1]`. Another is using `:=` in a multi-statement `{...}` block; please use `:=` as the only statement in `j`.', class="dt_invalid_let_error")
  ^~~~~
data.table.R:3241:5: warning: [l] message with backtick
    stopf("It looks like you re-used `:=` in argument %d a functional assignment call -- use `=` instead: %s(col1=val1, col2=val2, ...)", jj-1L, call_name)
    ^~~~~
foverlaps.R:3:47: warning: [l] message with backtick
  if (!is.data.table(y) || !is.data.table(x)) stopf("y and x must both be data.tables. Use `setDT()` to convert list/data.frames to data.tables by reference or as.data.table() to convert to data.tables by copying.")
                                              ^~~~~
groupingsets.R:72:5: warning: [l] message with backtick
    stopf("When using `id=TRUE` the 'x' data.table must not have a column named 'grouping'.")
    ^~~~~
groupingsets.R:76:5: warning: [l] message with backtick
    warningf("'sets' contains a duplicate (i.e., equivalent up to sorting) element at index %d; as such, there will be duplicate rows in the output -- note that grouping by A,B and B,A will produce the same aggregations. Use `sets=unique(lapply(sets, sort))` to eliminate duplicates.", idx)
    ^~~~~~~~
groupingsets.R:121:5: warning: [l] message with backtick
    stopf("When using `id=TRUE` the 'j' expression must not evaluate to a column named 'grouping'.")
    ^~~~~
groupingsets.R:123:5: warning: [l] message with backtick
    stopf("There exists duplicated column names in the results, ensure the column passed/evaluated in `j` and those in `by` are not overlapping.")
    ^~~~~
merge.R:32:5: warning: [l] message with backtick
    stopf("`by.x` and `by.y` must be of same length.")
    ^~~~~
merge.R:34:5: warning: [l] message with backtick
    warningf("Supplied both `by` and `by.x`/`by.y`. `by` argument will be ignored.")
    ^~~~~~~~
merge.R:37:7: warning: [l] message with backtick
      stopf("A non-empty vector of column names is required for `by.x` and `by.y`.", class="dt_invalid_input_error")
      ^~~~~
merge.R:39:7: warning: [l] message with backtick
      stopf("The following columns listed in `%s` are missing from %s: %s", "by.x", "x", brackify(by.x[!idx]))
      ^~~~~
merge.R:42:7: warning: [l] message with backtick
      stopf("The following columns listed in `%s` are missing from %s: %s", "by.y", "y", brackify(by.y[!idx]))
      ^~~~~
merge.R:54:7: warning: [l] message with backtick
      stopf("A non-empty vector of column names for `by` is required.")
      ^~~~~
merge.R:56:7: warning: [l] message with backtick
      stopf("The following columns listed in `%s` are missing from %s: %s", "by", "x", brackify(by[!idx]))
      ^~~~~
merge.R:59:7: warning: [l] message with backtick
      stopf("The following columns listed in `%s` are missing from %s: %s", "by", "y", brackify(by[!idx]))
      ^~~~~
rowwiseDT.R:4:5: warning: [l] message with backtick
    stopf("Must provide at least one column (use `name=`). See ?rowwiseDT for details")
    ^~~~~
setops.R:197:7: warning: [l] message with backtick
      warningf("Argument 'tolerance' was forced to lowest accepted value `sqrt(.Machine$double.eps)` from provided %s", format(tolerance, scientific=FALSE))
      ^~~~~~~~
xts.R:11:33: warning: [l] message with backtick
  if (index_nm %chin% names(x)) stopf("Input xts object should not have '%s' column because it would result in duplicate column names. Rename '%s' column in xts or use `keep.rownames` to change the index column name.", index_nm, index_nm)
                                ^~~~~
xts.R:21:36: warning: [l] message with backtick
  if (!xts::is.timeBased(x[[1L]])) stopf("data.table must have a time based column in first position, use `setcolorder` function to change the order, or see ?timeBased for supported types")
                                   ^~~~~

Someday, we should make a guide around message styling (how to refer to functions, arguments, R values, etc) and try to enforce it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, thanks

R/data.table.R Outdated
@@ -762,6 +769,12 @@ replace_dot_alias = function(e) {
# else the NA in ansvals are for join inherited scope (test 1973), and NA could be in irows from join and data in i should be returned (test 1977)
# in both cases leave to the R-level subsetting of i and x together further below
} else if (is.numeric(j)) {
if (!missingby) {
warning(
"`by` or `keyby` is ignored when `j` is a numeric vector used for column selection. ", "Perhaps you intended to use `.SD`? For example: DT[, .SD[, ", deparse(jsub), "], by = ...]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same as above? can this duplication be avoided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried implementing it .

Copy link

codecov bot commented Aug 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (master@c27ec26). Learn more about missing BASE report.
⚠️ Report is 20 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #7254   +/-   ##
=========================================
  Coverage          ?   98.79%           
=========================================
  Files             ?       81           
  Lines             ?    15261           
  Branches          ?        0           
=========================================
  Hits              ?    15077           
  Misses            ?      184           
  Partials          ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@venom1204 venom1204 requested a review from tdhock August 20, 2025 11:51
R/data.table.R Outdated
if (is.character(j) || (is.numeric(j) && !is.logical(j))) {
if (!missingby) {
j_type = if (is.character(j)) "a character" else "a numeric"
warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have warningf function that could be used here nicely

@venom1204
Copy link
Contributor Author

venom1204 commented Aug 20, 2025

hi @tdhock can you please tell me the further improvements needed for the pr .
thanks

@venom1204 venom1204 requested a review from jangorecki August 29, 2025 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

keyby/key ignored if numeric indices used in j
4 participants