-
Notifications
You must be signed in to change notification settings - Fork 14
Added Optional cell version limiting #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @badesra, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new optional configuration for the Spark Bigtable connector, allowing users to limit the number of cell versions returned per column during reads. This enhancement directly addresses issues with oversized rows in Bigtable, which can lead to FAILED_PRECONDITION errors, by providing a mechanism to fetch only the most relevant data.
Highlights
- New Configuration Option: Introduced spark.bigtable.read.max.versions configuration option to control the number of cell versions retrieved per column.
- Problem Solved: Mitigates FAILED_PRECONDITION errors and performance issues caused by reading excessively large Bigtable rows with many historical cell versions.
- Implementation Detail: Implemented using Bigtable's FILTERS.limit().cellsPerColumn(N) to ensure efficient server-side filtering of cell versions.
- Validation: Added validation to ensure the max.versions configuration value is a positive integer.
- Documentation Update: Updated the README.md with clear usage examples for both Scala and PySpark, demonstrating how to apply the new configuration.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature to limit the number of cell versions returned from Bigtable, which helps prevent oversized row errors. The implementation is solid, with good documentation in the README, proper configuration handling, and dedicated unit tests. My main feedback is to ensure the new maxVersions setting is correctly applied in the RDD API, as it currently seems to be ignored.
mutianf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Left some nits, but otherwise LGTM!
spark-bigtable-core/src/main/scala/com/google/cloud/spark/bigtable/BigtableDefaultSource.scala
Outdated
Show resolved
Hide resolved
spark-bigtable-core/src/main/scala/com/google/cloud/spark/bigtable/BigtableDefaultSource.scala
Outdated
Show resolved
Hide resolved
...able-core/src/main/scala/com/google/cloud/spark/bigtable/datasources/BigtableSparkConf.scala
Outdated
Show resolved
Hide resolved
|
Hi, Thanks! |
igorbernstein2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waiting on answer on usecase
Sorry was away for last few days. We would like to add support for DataFrames only, as we are not using RDDs. Use Case: Our current Bigtable implementation has accumulated many historical cell versions over time. Without server-side filtering, rows exceed Bigtable's size limits, causing errors: |
|
I should've mentioned this earlier, but first of all thank you for contributing and I apologize for the feature gap/suboptimal state of the connector. I'm not a spark expert so please let me know if this is misguided, but I think it would be better to decouple this into 2-3 features:
My main hesitation with the current approach is that it seems like its re-creating a subset of the bigtable filter api as a flattened bag of strings. Looking forward to hearing your thoughts |
|
Closing this PR in favor of #97 |
Adds a new read option spark.bigtable.read.max.versions to limit the number of cell versions returned per column (equivalent to cbt ... cells-per-column=N). Prevents oversized rows when many historical versions exist.
Fixes this error: