Added Optional cell version limiting #91

badesra · 2025-08-20T13:50:31Z

Adds a new read option spark.bigtable.read.max.versions to limit the number of cell versions returned per column (equivalent to cbt ... cells-per-column=N). Prevents oversized rows when many historical versions exist.

Fixes this error:

Caused by: com.google.cloud.spark.bigtable.repackaged.io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Error while reading table 'projects/***/instances/store/tables/data_lifecycle' : Read returned 269MiB from row '0RGWN8RPA2P3#PRODUCT_DUMP' which exceeds the limit of 256MiB.
Make sure you are setting an appropriate request filter to retrieve only recent versions and only the columns you want. If columns are accumulating more versions than you need to read, you can also create a garbage collection policy: https://cloud.google.com/bigtable/docs/configuring-garbage-collection#versions

gemini-code-assist

Summary of Changes

Hello @badesra, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new optional configuration for the Spark Bigtable connector, allowing users to limit the number of cell versions returned per column during reads. This enhancement directly addresses issues with oversized rows in Bigtable, which can lead to FAILED_PRECONDITION errors, by providing a mechanism to fetch only the most relevant data.

Highlights

New Configuration Option: Introduced spark.bigtable.read.max.versions configuration option to control the number of cell versions retrieved per column.
Problem Solved: Mitigates FAILED_PRECONDITION errors and performance issues caused by reading excessively large Bigtable rows with many historical cell versions.
Implementation Detail: Implemented using Bigtable's FILTERS.limit().cellsPerColumn(N) to ensure efficient server-side filtering of cell versions.
Validation: Added validation to ensure the max.versions configuration value is a positive integer.
Documentation Update: Updated the README.md with clear usage examples for both Scala and PySpark, demonstrating how to apply the new configuration.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature to limit the number of cell versions returned from Bigtable, which helps prevent oversized row errors. The implementation is solid, with good documentation in the README, proper configuration handling, and dedicated unit tests. My main feedback is to ensure the new maxVersions setting is correctly applied in the RDD API, as it currently seems to be ignored.

mutianf

Thanks for the PR! Left some nits, but otherwise LGTM!

spark-bigtable-core/src/main/scala/com/google/cloud/spark/bigtable/BigtableDefaultSource.scala

...able-core/src/main/scala/com/google/cloud/spark/bigtable/datasources/BigtableSparkConf.scala

igorbernstein2 · 2025-10-10T18:01:57Z

Hi,
I'd like to understand your usecase a bit better. Which api would you like to add support for limiting cell versions? DataFrames or RDD or both?
If are interested in RDD, then I think I'd prefer to generalize this to have the RDD take a full blown bigtable Filter instance instead of exposing a subset using stringly typed options.
If its just DataFrames, then I think the correct behavior is to hardcode a cell per column limit to 1 since DataFrames can't expose multiple values per column (please correct me if I'm wrong). Also I'd be interested to know if you would have a usecase for being able to specify a generalized filter for the DataFrame apis as well.

Thanks!

igorbernstein2

waiting on answer on usecase

badesra · 2025-10-14T19:07:24Z

Hi, I'd like to understand your usecase a bit better. Which api would you like to add support for limiting cell versions? DataFrames or RDD or both? If are interested in RDD, then I think I'd prefer to generalize this to have the RDD take a full blown bigtable Filter instance instead of exposing a subset using stringly typed options. If its just DataFrames, then I think the correct behavior is to hardcode a cell per column limit to 1 since DataFrames can't expose multiple values per column (please correct me if I'm wrong). Also I'd be interested to know if you would have a usecase for being able to specify a generalized filter for the DataFrame apis as well.

Thanks!

Sorry was away for last few days. We would like to add support for DataFrames only, as we are not using RDDs.

Use Case: Our current Bigtable implementation has accumulated many historical cell versions over time. Without server-side filtering, rows exceed Bigtable's size limits, causing errors: Error while reading table 'projects/***/instances/store/tables/data_lifecycle': Read returned 269MiB from row. You are correct that DataFrames can't expose multiple versions. However, without the FILTERS.limit().cellsPerColumn() filter applied server-side, Bigtable still tries to return ALL versions, causing the 269MiB error before the data reaches Spark. I believe HBase connector already hardcodes this as a server side filter to return just latest version.
I wanted to maintain consistency between DataFrame and RDD APIs so added this stringly option but after reading your comment I feel it would be easier to just hardcode this in DataFrame API.

igorbernstein2 · 2025-10-15T14:36:07Z

I should've mentioned this earlier, but first of all thank you for contributing and I apologize for the feature gap/suboptimal state of the connector.

I'm not a spark expert so please let me know if this is misguided, but I think it would be better to decouple this into 2-3 features:

dataframe always appends a max cells per column = 1 to the filter list - solves immediate problem
rdd allows passing a Filter during construction - could be used as a implementation detail for feature 1
dataframe takes an option with a stringified representation of the filter (either proto text format or json proto format) - aligns featureset of dataframes & rdds
3b. add a utility to the connector to convert a Filter into a string to be parsed - ergonomics

My main hesitation with the current approach is that it seems like its re-creating a subset of the bigtable filter api as a flattened bag of strings.

Looking forward to hearing your thoughts

mutianf · 2025-10-16T21:42:01Z

Closing this PR in favor of #97

Added cell version limiting

73ea18a

gemini-code-assist bot reviewed Aug 20, 2025

View reviewed changes

badesra and others added 4 commits August 30, 2025 15:35

update rdd api

936218f

Merge branch 'main' into add-cell-version-option

7541371

Merge branch 'main' into add-cell-version-option

9e4e914

Merge branch 'main' into add-cell-version-option

a157edb

mutianf reviewed Oct 7, 2025

View reviewed changes

spark-bigtable-core/src/main/scala/com/google/cloud/spark/bigtable/BigtableDefaultSource.scala Outdated Show resolved Hide resolved

spark-bigtable-core/src/main/scala/com/google/cloud/spark/bigtable/BigtableDefaultSource.scala Outdated Show resolved Hide resolved

address review comments

f046040

badesra requested a review from mutianf October 7, 2025 18:23

mutianf reviewed Oct 7, 2025

View reviewed changes

...able-core/src/main/scala/com/google/cloud/spark/bigtable/datasources/BigtableSparkConf.scala Outdated Show resolved Hide resolved

mutianf approved these changes Oct 7, 2025

View reviewed changes

rename BIGTABLE_MAX_VERSIONS

7207edc

igorbernstein2 requested changes Oct 14, 2025

View reviewed changes

mutianf closed this Oct 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added Optional cell version limiting #91

Added Optional cell version limiting #91

Uh oh!

badesra commented Aug 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mutianf left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

igorbernstein2 commented Oct 10, 2025

Uh oh!

igorbernstein2 left a comment

Uh oh!

badesra commented Oct 14, 2025

Uh oh!

igorbernstein2 commented Oct 15, 2025 •

edited

Loading

Uh oh!

mutianf commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Added Optional cell version limiting #91

Added Optional cell version limiting #91

Uh oh!

Conversation

badesra commented Aug 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mutianf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

igorbernstein2 commented Oct 10, 2025

Uh oh!

igorbernstein2 left a comment

Choose a reason for hiding this comment

Uh oh!

badesra commented Oct 14, 2025

Uh oh!

igorbernstein2 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mutianf commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

igorbernstein2 commented Oct 15, 2025 •

edited

Loading