Skip to content

Conversation

@badesra
Copy link

@badesra badesra commented Aug 20, 2025

Adds a new read option spark.bigtable.read.max.versions to limit the number of cell versions returned per column (equivalent to cbt ... cells-per-column=N). Prevents oversized rows when many historical versions exist.

Fixes this error:

Caused by: com.google.cloud.spark.bigtable.repackaged.io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Error while reading table 'projects/***/instances/store/tables/data_lifecycle' : Read returned 269MiB from row '0RGWN8RPA2P3#PRODUCT_DUMP' which exceeds the limit of 256MiB.
Make sure you are setting an appropriate request filter to retrieve only recent versions and only the columns you want. If columns are accumulating more versions than you need to read, you can also create a garbage collection policy: https://cloud.google.com/bigtable/docs/configuring-garbage-collection#versions

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @badesra, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new optional configuration for the Spark Bigtable connector, allowing users to limit the number of cell versions returned per column during reads. This enhancement directly addresses issues with oversized rows in Bigtable, which can lead to FAILED_PRECONDITION errors, by providing a mechanism to fetch only the most relevant data.

Highlights

  • New Configuration Option: Introduced spark.bigtable.read.max.versions configuration option to control the number of cell versions retrieved per column.
  • Problem Solved: Mitigates FAILED_PRECONDITION errors and performance issues caused by reading excessively large Bigtable rows with many historical cell versions.
  • Implementation Detail: Implemented using Bigtable's FILTERS.limit().cellsPerColumn(N) to ensure efficient server-side filtering of cell versions.
  • Validation: Added validation to ensure the max.versions configuration value is a positive integer.
  • Documentation Update: Updated the README.md with clear usage examples for both Scala and PySpark, demonstrating how to apply the new configuration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature to limit the number of cell versions returned from Bigtable, which helps prevent oversized row errors. The implementation is solid, with good documentation in the README, proper configuration handling, and dedicated unit tests. My main feedback is to ensure the new maxVersions setting is correctly applied in the RDD API, as it currently seems to be ignored.

Copy link
Collaborator

@mutianf mutianf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left some nits, but otherwise LGTM!

@badesra badesra requested a review from mutianf October 7, 2025 18:23
@igorbernstein2
Copy link
Collaborator

Hi,
I'd like to understand your usecase a bit better. Which api would you like to add support for limiting cell versions? DataFrames or RDD or both?
If are interested in RDD, then I think I'd prefer to generalize this to have the RDD take a full blown bigtable Filter instance instead of exposing a subset using stringly typed options.
If its just DataFrames, then I think the correct behavior is to hardcode a cell per column limit to 1 since DataFrames can't expose multiple values per column (please correct me if I'm wrong). Also I'd be interested to know if you would have a usecase for being able to specify a generalized filter for the DataFrame apis as well.

Thanks!

Copy link
Collaborator

@igorbernstein2 igorbernstein2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting on answer on usecase

@badesra
Copy link
Author

badesra commented Oct 14, 2025

Hi, I'd like to understand your usecase a bit better. Which api would you like to add support for limiting cell versions? DataFrames or RDD or both? If are interested in RDD, then I think I'd prefer to generalize this to have the RDD take a full blown bigtable Filter instance instead of exposing a subset using stringly typed options. If its just DataFrames, then I think the correct behavior is to hardcode a cell per column limit to 1 since DataFrames can't expose multiple values per column (please correct me if I'm wrong). Also I'd be interested to know if you would have a usecase for being able to specify a generalized filter for the DataFrame apis as well.

Thanks!

Sorry was away for last few days. We would like to add support for DataFrames only, as we are not using RDDs.

Use Case: Our current Bigtable implementation has accumulated many historical cell versions over time. Without server-side filtering, rows exceed Bigtable's size limits, causing errors: Error while reading table 'projects/***/instances/store/tables/data_lifecycle': Read returned 269MiB from row. You are correct that DataFrames can't expose multiple versions. However, without the FILTERS.limit().cellsPerColumn() filter applied server-side, Bigtable still tries to return ALL versions, causing the 269MiB error before the data reaches Spark. I believe HBase connector already hardcodes this as a server side filter to return just latest version.
I wanted to maintain consistency between DataFrame and RDD APIs so added this stringly option but after reading your comment I feel it would be easier to just hardcode this in DataFrame API.

@igorbernstein2
Copy link
Collaborator

igorbernstein2 commented Oct 15, 2025

I should've mentioned this earlier, but first of all thank you for contributing and I apologize for the feature gap/suboptimal state of the connector.

I'm not a spark expert so please let me know if this is misguided, but I think it would be better to decouple this into 2-3 features:

  1. dataframe always appends a max cells per column = 1 to the filter list - solves immediate problem
  2. rdd allows passing a Filter during construction - could be used as a implementation detail for feature 1
  3. dataframe takes an option with a stringified representation of the filter (either proto text format or json proto format) - aligns featureset of dataframes & rdds
    3b. add a utility to the connector to convert a Filter into a string to be parsed - ergonomics

My main hesitation with the current approach is that it seems like its re-creating a subset of the bigtable filter api as a flattened bag of strings.

Looking forward to hearing your thoughts

@mutianf
Copy link
Collaborator

mutianf commented Oct 16, 2025

Closing this PR in favor of #97

@mutianf mutianf closed this Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants