-
Notifications
You must be signed in to change notification settings - Fork 4.9k
feat(destination-pgvector): Update to use CDK with langchain 1.0 #68596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Update CDK dependency to use local path (langchain upgrade branch) - Temporarily comment out PyAirbyte dependency (will restore after CDK PR merges) - Update Python version constraint to ^3.10,<3.13 - Run poetry lock to update dependencies This is Stage 3 of the langchain upgrade plan: - Stage 1: ✅ Bump pgvector to latest CDK (skipped, handled by PyAirbyte) - Stage 2: ✅ Bump langchain in CDK (PR #810) - Stage 3: 🔄 Update pgvector to use new CDK with updated langchain Related PR: airbytehq/airbyte-python-cdk#810 Co-Authored-By: AJ Steers <[email protected]>
Original prompt from AJ Steers |
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Helpful Resources
PR Slash CommandsAirbyte Maintainers (that's you!) can execute the following slash commands on your PR:
|
|
Switched from local path to git branch reference to avoid Poetry version resolution issues. The CDK branch uses dynamic versioning (0.0.0) which conflicts with PyAirbyte's requirement for CDK >=7.3.9. This is a temporary approach until CDK PR #810 is merged and we can use the published version. Co-Authored-By: AJ Steers <[email protected]>
- Update pyproject.toml to use published CDK v7.4.0 instead of git branch - Fix PyAirbyte import paths (_processors.file -> _writers) - Update pydantic imports to use pydantic.v1 for compatibility with CDK - Remove deprecated progress logging calls - All unit tests passing (4/4) Co-Authored-By: AJ Steers <[email protected]>
- Created simple JSONL writer to replace PyAirbyte's JsonlWriter - Updated all imports to use CDK's sql module (exceptions, constants, types, etc.) - Added WriteStrategy enum and BatchHandle class to vendored code - Removed PyAirbyte dependency from pyproject.toml - Added ulid-py dependency (needed by sql_processor) - Fixed circular import by moving WriteStrategy to separate file - All unit tests passing (4/4) Co-Authored-By: AJ Steers <[email protected]>
Co-Authored-By: AJ Steers <[email protected]>
Co-Authored-By: AJ Steers <[email protected]>
- Remove pandas dependency (no longer needed) - Implement direct SQL INSERT statements with proper parameter binding - Convert dict and list values to JSON strings for PostgreSQL compatibility - Fixes pandas/SQLAlchemy compatibility issues - Manual end-to-end test successful: data lands correctly in pgvector with embeddings Co-Authored-By: AJ Steers <[email protected]>
| class WriteStrategy(enum.Enum): | ||
| """Write strategy for SQL destinations.""" | ||
| AUTO = "auto" | ||
| APPEND = "append" | ||
| REPLACE = "replace" | ||
| MERGE = "merge" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we refactor to remove WriteStrategy and instead use destination write modes as defined in the airbyte protocol? (You should have a model that defines it?)
|
Good question! I looked into this. The DestinationSyncMode (from Airbyte protocol):
WriteStrategy (internal to connector):
The key difference is the Options:
Which approach would you prefer? I'm leaning toward option 1 (keeping it) since the AUTO mode seems useful, but I can refactor if you think option 2 or 3 is better. |
What
Updates destination-pgvector to use CDK v7.4.0 with langchain 1.0 support and removes the PyAirbyte dependency. This PR addresses the langchain version lag by upgrading to the latest CDK which includes langchain 1.0.x (up from 0.1.x).
Related to airbytehq/airbyte-python-cdk#810 (CDK langchain upgrade)
How
Dependency Changes:
>=1.0to>=7.4.0airbyte = "^0.12.0") entirely^3.9,<3.12to^3.10,<3.13ulid-py = "^1.1.0"for ULID generationCode Changes:
Vendored minimal code from PyAirbyte:
WriteStrategyenum in newwrite_strategy.pymoduleBatchHandleclass inrecord_processor.pyJsonlWriterclass (110 lines) for JSONL file writingMigrated to CDK equivalents:
airbyte_cdk.sql.*instead ofairbyte.*pydantictopydantic.v1for BaseModelRefactored SQL data loading (⚠️ CRITICAL CHANGE):
pandas.to_sql()_write_files_to_new_table():Review guide
High priority - potential correctness issues:
sql_processor.py:574-638- The new_write_files_to_new_table()implementation:sql_processor.py:617-620- JSON serialization logic:This is critical for pgvector's metadata (dict) and embedding (list) fields. Verify this works for:
Medium priority - performance and maintainability:
3. Performance implications - Row-by-row INSERT vs pandas bulk operations:
jsonl_writer.py- Custom JSONL writer implementation:Test coverage - Only 4 unit tests:
Low priority - imports and structure:
6.
write_strategy.py- New module to avoid circular imports7.
record_processor.py:42-54- New BatchHandle class8. Import changes across all files - migrating from
airbyte.*toairbyte_cdk.sql.*User Impact
Positive:
Potential negative:
Compatibility:
Can this PR be safely reverted and rolled back?
This PR can be safely reverted. The changes are isolated to the destination-pgvector connector and don't affect other connectors. However, note that reverting would restore the old langchain 0.1.x dependencies and Python 3.9 compatibility.
Testing performed:
Session info: