-
Notifications
You must be signed in to change notification settings - Fork 22
Add pokemon data source #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughThis update introduces a new data source, Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Spark
participant PokemonDataSource
participant PokemonDataReader
participant PokeAPI
User->>Spark: Load DataFrame using PokemonDataSource
Spark->>PokemonDataSource: Instantiate and configure
PokemonDataSource->>PokemonDataReader: Create reader with options
PokemonDataReader->>PokeAPI: Fetch list of resources (e.g., /pokemon)
loop For each resource
PokemonDataReader->>PokeAPI: Fetch resource details (e.g., /pokemon/{id})
PokeAPI-->>PokemonDataReader: Return resource data
end
PokemonDataReader-->>Spark: Yield rows as per schema
Spark-->>User: DataFrame with Pokémon data
Poem
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (9)
docs/index.md (1)
39-39: The spelling of "Pokémon" should include the accent mark.For consistency with official branding, consider updating the spelling to include the accent mark on the "e" in all instances of "Pokémon" and "PokéAPI" throughout the documentation.
-| [PokemonDataSource](./datasources/pokemon.md) | `pokemon` | Read Pokemon data from the PokeAPI | None | +| [PokémonDataSource](./datasources/pokemon.md) | `pokemon` | Read Pokémon data from the PokéAPI | None |🧰 Tools
🪛 LanguageTool
[grammar] ~39-~39: The name of this game or TV show is spelled with an accent.
Context: ... None | | PokemonDataSource |pokemon| Read P...(POKEMON)
[grammar] ~39-~39: The name of this game or TV show is spelled with an accent.
Context: ...](./datasources/pokemon.md) |pokemon| Read Pokemon data from the Poke...(POKEMON)
[grammar] ~39-~39: The name of this game or TV show is spelled with an accent.
Context: ...n.md) |pokemon| Read Pokemon data from the PokeAPI | None...(POKEMON)
docs/datasources/pokemon.md (1)
1-5: Documentation is concise but could benefit from consistency in branding.The documentation provides sufficient information but should maintain consistent branding with the official Pokémon name.
-# PokemonDataSource +# PokémonDataSource -> Uses the public [PokeAPI](https://pokeapi.co/) to retrieve Pokemon data. No API key required. +> Uses the public [PokéAPI](https://pokeapi.co/) to retrieve Pokémon data. No API key required. -::: pyspark_datasources.pokemon.PokemonDataSource +::: pyspark_datasources.pokemon.PokemonDataSource🧰 Tools
🪛 LanguageTool
[grammar] ~3-~3: The name of this game or TV show is spelled with an accent.
Context: ...keAPI](https://pokeapi.co/) to retrieve Pokemon data. No API key required. ::: pyspark...(POKEMON)
[grammar] ~5-~5: The name of this game or TV show is spelled with an accent.
Context: ... key required. ::: pyspark_datasources.pokemon.PokemonDataSource(POKEMON)
tests/test_pokemon.py (2)
1-6: Remove unused import.The
jsonmodule is imported but never used in this test file.import unittest from unittest.mock import patch, MagicMock -import json from pyspark_datasources.pokemon import PokemonDataSource, PokemonDataReader, PokemonPartition🧰 Tools
🪛 Ruff (0.8.2)
3-3:
jsonimported but unusedRemove unused import:
json(F401)
31-83: Good unit test with proper mocking, but consider testing more endpoints.The test effectively mocks the API responses and verifies the correct data parsing for the "pokemon" endpoint. However, it would be beneficial to also add tests for the "type" endpoint and the generic endpoint handler.
Consider adding additional test methods like:
@patch('pyspark_datasources.pokemon.requests.Session') def test_read_type(self, mock_session): # Mock response for the list endpoint mock_response_list = MagicMock() mock_response_list.json.return_value = { "results": [ {"name": "grass", "url": "https://pokeapi.co/api/v2/type/12/"} ] } # Mock response for the individual type mock_response_type = MagicMock() mock_response_type.json.return_value = { "id": 12, "name": "grass", "pokemon": [ {"pokemon": {"name": "bulbasaur"}}, {"pokemon": {"name": "ivysaur"}} ] } # Set up the session mock mock_session_instance = mock_session.return_value mock_session_instance.get.side_effect = [mock_response_list, mock_response_type] # Create reader and partition schema = None # Not used in the test reader = PokemonDataReader(schema, {"endpoint": "type", "limit": 1}) partition = PokemonPartition("type", 1, 0) # Get results results = list(reader.read(partition)) # Verify the results self.assertEqual(len(results), 1) type_data = results[0] self.assertEqual(type_data[0], 12) # id self.assertEqual(type_data[1], "grass") # name self.assertEqual(type_data[2], ["bulbasaur", "ivysaur"]) # pokemon # Verify the correct URLs were called mock_session_instance.get.assert_any_call( "https://pokeapi.co/api/v2/type?limit=1&offset=0" ) mock_session_instance.get.assert_any_call( "https://pokeapi.co/api/v2/type/12/" )pyspark_datasources/pokemon.py (5)
1-6: Clean up unused imports.Several imported types from
pyspark.sql.typesare not directly used in the code:
- StructField
- StringType
- IntegerType
- ArrayType
While they are referenced conceptually in the schema strings, they're not used as actual type objects.
import json import requests from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition -from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType +from pyspark.sql.types import StructType🧰 Tools
🪛 Ruff (0.8.2)
5-5:
pyspark.sql.types.StructFieldimported but unusedRemove unused import
(F401)
5-5:
pyspark.sql.types.StringTypeimported but unusedRemove unused import
(F401)
5-5:
pyspark.sql.types.IntegerTypeimported but unusedRemove unused import
(F401)
5-5:
pyspark.sql.types.ArrayTypeimported but unusedRemove unused import
(F401)
80-88: Consider adding input validation for limit and offset.The code converts limit and offset to integers but doesn't validate that they are positive values. Adding validation would help prevent potential API errors or unexpected behavior.
def __init__(self, schema: StructType, options: dict): self.schema = schema self.options = options self.endpoint = options.get("endpoint", "pokemon") - self.limit = int(options.get("limit", 20)) - self.offset = int(options.get("offset", 0)) + self.limit = max(1, int(options.get("limit", 20))) + self.offset = max(0, int(options.get("offset", 0))) self.base_url = "https://pokeapi.co/api/v2"
138-144: Consider adding retry logic for transient API failures.API calls can sometimes fail temporarily due to network issues or rate limiting. Adding retry logic would make the data source more robust.
@staticmethod def _fetch_pokemon(url, session): """Fetch detailed Pokemon data""" - response = session.get(url) - response.raise_for_status() - return response.json() + for attempt in range(3): # Try up to 3 times + try: + response = session.get(url) + response.raise_for_status() + return response.json() + except requests.RequestException as e: + if attempt == 2: # Last attempt + raise + # Wait a bit before retrying + import time + time.sleep(1)
85-87: Add exception handling for numeric conversion.The code assumes that "limit" and "offset" parameters can be converted to integers. If a user provides non-numeric values, this will raise a ValueError. Consider adding try/except blocks to handle this gracefully.
- self.limit = int(options.get("limit", 20)) - self.offset = int(options.get("offset", 0)) + try: + self.limit = int(options.get("limit", 20)) + except (ValueError, TypeError): + self.limit = 20 + try: + self.offset = int(options.get("offset", 0)) + except (ValueError, TypeError): + self.offset = 0
96-101: Consider caching or pagination for large result sets.The current implementation fetches all results in a single request and processes them in memory. For large result sets, this could lead to memory issues or timeouts. Consider implementing pagination for better resource management.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/datasources/pokemon.md(1 hunks)docs/index.md(1 hunks)pyspark_datasources/__init__.py(1 hunks)pyspark_datasources/pokemon.py(1 hunks)tests/test_pokemon.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
pyspark_datasources/__init__.py (1)
pyspark_datasources/pokemon.py (1)
PokemonDataSource(8-70)
tests/test_pokemon.py (1)
pyspark_datasources/pokemon.py (7)
PokemonDataSource(8-70)PokemonDataReader(80-150)PokemonPartition(73-77)name(54-55)schema(57-67)reader(69-70)read(93-136)
🪛 LanguageTool
docs/index.md
[grammar] ~39-~39: The name of this game or TV show is spelled with an accent.
Context: ... None | | PokemonDataSource | pokemon | Read P...
(POKEMON)
[grammar] ~39-~39: The name of this game or TV show is spelled with an accent.
Context: ...](./datasources/pokemon.md) | pokemon | Read Pokemon data from the Poke...
(POKEMON)
[grammar] ~39-~39: The name of this game or TV show is spelled with an accent.
Context: ...n.md) | pokemon | Read Pokemon data from the PokeAPI | None...
(POKEMON)
docs/datasources/pokemon.md
[grammar] ~3-~3: The name of this game or TV show is spelled with an accent.
Context: ...keAPI](https://pokeapi.co/) to retrieve Pokemon data. No API key required. ::: pyspark...
(POKEMON)
[grammar] ~5-~5: The name of this game or TV show is spelled with an accent.
Context: ... key required. ::: pyspark_datasources.pokemon.PokemonDataSource
(POKEMON)
🪛 Ruff (0.8.2)
pyspark_datasources/__init__.py
7-7: .pokemon.PokemonDataSource imported but unused; consider removing, adding to __all__, or using a redundant alias
(F401)
pyspark_datasources/pokemon.py
5-5: pyspark.sql.types.StructField imported but unused
Remove unused import
(F401)
5-5: pyspark.sql.types.StringType imported but unused
Remove unused import
(F401)
5-5: pyspark.sql.types.IntegerType imported but unused
Remove unused import
(F401)
5-5: pyspark.sql.types.ArrayType imported but unused
Remove unused import
(F401)
tests/test_pokemon.py
3-3: json imported but unused
Remove unused import: json
(F401)
🔇 Additional comments (6)
pyspark_datasources/__init__.py (1)
7-7:❓ Verification inconclusive
LGTM, a proper import for the new data source class.
The import is correctly added, making the PokemonDataSource available from the package namespace, following the same pattern as other data source classes.
🏁 Script executed:
#!/bin/bash # Let's check if __all__ is defined anywhere in the file to confirm this is the correct pattern grep -n "__all__" pyspark_datasources/__init__.pyLength of output: 49
PokemonDataSource Import Verified – Please Confirm all Usage
The import statement inpyspark_datasources/__init__.pycorrectly exposesPokemonDataSourcefrom the package namespace, following the established pattern for data source classes. Note that the grep check for__all__returned no output, which appears to be consistent with the current file design—but please manually verify that omitting an__all__definition here is intentional and aligns with our project conventions.🧰 Tools
🪛 Ruff (0.8.2)
7-7:
.pokemon.PokemonDataSourceimported but unused; consider removing, adding to__all__, or using a redundant alias(F401)
tests/test_pokemon.py (2)
8-29: LGTM! Good coverage of schema generation for different endpoints.The test class thoroughly verifies the schema generation functionality for various endpoints, ensuring the correct schema is returned in each case.
85-86: LGTM!Standard unittest main block for running the tests directly.
pyspark_datasources/pokemon.py (3)
8-71: The data source implementation looks good!The
PokemonDataSourceclass correctly implements the required methods for a PySpark data source. The documentation is excellent with clear examples and explanations of all options. The schema method handles different endpoints appropriately.
73-78: LGTM - Simple and effective partition implementation.The
PokemonPartitionclass correctly implements theInputPartitioninterface with the necessary fields for endpoint, limit, and offset.
89-92: LGTM - Simple partitioning strategy.The single partition approach is appropriate for this API-based data source. It keeps the implementation straightforward while still adhering to the PySpark data source interface.
| def read(self, partition: PokemonPartition): | ||
| session = requests.Session() | ||
|
|
||
| # Fetch list of resources | ||
| url = f"{self.base_url}/{partition.endpoint}?limit={partition.limit}&offset={partition.offset}" | ||
| response = session.get(url) | ||
| response.raise_for_status() | ||
| results = response.json()["results"] | ||
|
|
||
| # Process based on endpoint type | ||
| if partition.endpoint == "pokemon": | ||
| for result in results: | ||
| pokemon_data = self._fetch_pokemon(result["url"], session) | ||
| abilities = [ability["ability"]["name"] for ability in pokemon_data["abilities"]] | ||
|
|
||
| yield ( | ||
| pokemon_data["id"], | ||
| pokemon_data["name"], | ||
| pokemon_data["height"], | ||
| pokemon_data["weight"], | ||
| abilities | ||
| ) | ||
|
|
||
| elif partition.endpoint == "type": | ||
| for result in results: | ||
| type_data = self._fetch_resource(result["url"], session) | ||
| pokemon_names = [pokemon["pokemon"]["name"] for pokemon in type_data["pokemon"]] | ||
|
|
||
| yield ( | ||
| type_data["id"], | ||
| type_data["name"], | ||
| pokemon_names | ||
| ) | ||
|
|
||
| else: | ||
| # Generic handler for other endpoints | ||
| for result in results: | ||
| resource_data = self._fetch_resource(result["url"], session) | ||
|
|
||
| yield ( | ||
| resource_data.get("id", 0), | ||
| resource_data.get("name", ""), | ||
| json.dumps(resource_data) | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add session closing and improve error handling.
There are two aspects that could be improved in the read method:
- The requests session should be properly closed at the end to avoid resource leaks
- Consider adding better error handling for API request failures
def read(self, partition: PokemonPartition):
session = requests.Session()
- # Fetch list of resources
- url = f"{self.base_url}/{partition.endpoint}?limit={partition.limit}&offset={partition.offset}"
- response = session.get(url)
- response.raise_for_status()
- results = response.json()["results"]
-
- # Process based on endpoint type
- if partition.endpoint == "pokemon":
- for result in results:
- pokemon_data = self._fetch_pokemon(result["url"], session)
- abilities = [ability["ability"]["name"] for ability in pokemon_data["abilities"]]
-
- yield (
- pokemon_data["id"],
- pokemon_data["name"],
- pokemon_data["height"],
- pokemon_data["weight"],
- abilities
- )
-
- elif partition.endpoint == "type":
- for result in results:
- type_data = self._fetch_resource(result["url"], session)
- pokemon_names = [pokemon["pokemon"]["name"] for pokemon in type_data["pokemon"]]
-
- yield (
- type_data["id"],
- type_data["name"],
- pokemon_names
- )
-
- else:
- # Generic handler for other endpoints
- for result in results:
- resource_data = self._fetch_resource(result["url"], session)
-
- yield (
- resource_data.get("id", 0),
- resource_data.get("name", ""),
- json.dumps(resource_data)
- )
+ try:
+ # Fetch list of resources
+ url = f"{self.base_url}/{partition.endpoint}?limit={partition.limit}&offset={partition.offset}"
+ response = session.get(url)
+ response.raise_for_status()
+ results = response.json()["results"]
+
+ # Process based on endpoint type
+ if partition.endpoint == "pokemon":
+ for result in results:
+ pokemon_data = self._fetch_pokemon(result["url"], session)
+ abilities = [ability["ability"]["name"] for ability in pokemon_data["abilities"]]
+
+ yield (
+ pokemon_data["id"],
+ pokemon_data["name"],
+ pokemon_data["height"],
+ pokemon_data["weight"],
+ abilities
+ )
+
+ elif partition.endpoint == "type":
+ for result in results:
+ type_data = self._fetch_resource(result["url"], session)
+ pokemon_names = [pokemon["pokemon"]["name"] for pokemon in type_data["pokemon"]]
+
+ yield (
+ type_data["id"],
+ type_data["name"],
+ pokemon_names
+ )
+
+ else:
+ # Generic handler for other endpoints
+ for result in results:
+ resource_data = self._fetch_resource(result["url"], session)
+
+ yield (
+ resource_data.get("id", 0),
+ resource_data.get("name", ""),
+ json.dumps(resource_data)
+ )
+ except requests.RequestException as e:
+ raise RuntimeError(f"Error fetching data from PokeAPI: {str(e)}")
+ finally:
+ session.close()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def read(self, partition: PokemonPartition): | |
| session = requests.Session() | |
| # Fetch list of resources | |
| url = f"{self.base_url}/{partition.endpoint}?limit={partition.limit}&offset={partition.offset}" | |
| response = session.get(url) | |
| response.raise_for_status() | |
| results = response.json()["results"] | |
| # Process based on endpoint type | |
| if partition.endpoint == "pokemon": | |
| for result in results: | |
| pokemon_data = self._fetch_pokemon(result["url"], session) | |
| abilities = [ability["ability"]["name"] for ability in pokemon_data["abilities"]] | |
| yield ( | |
| pokemon_data["id"], | |
| pokemon_data["name"], | |
| pokemon_data["height"], | |
| pokemon_data["weight"], | |
| abilities | |
| ) | |
| elif partition.endpoint == "type": | |
| for result in results: | |
| type_data = self._fetch_resource(result["url"], session) | |
| pokemon_names = [pokemon["pokemon"]["name"] for pokemon in type_data["pokemon"]] | |
| yield ( | |
| type_data["id"], | |
| type_data["name"], | |
| pokemon_names | |
| ) | |
| else: | |
| # Generic handler for other endpoints | |
| for result in results: | |
| resource_data = self._fetch_resource(result["url"], session) | |
| yield ( | |
| resource_data.get("id", 0), | |
| resource_data.get("name", ""), | |
| json.dumps(resource_data) | |
| ) | |
| def read(self, partition: PokemonPartition): | |
| session = requests.Session() | |
| try: | |
| # Fetch list of resources | |
| url = f"{self.base_url}/{partition.endpoint}?limit={partition.limit}&offset={partition.offset}" | |
| response = session.get(url) | |
| response.raise_for_status() | |
| results = response.json()["results"] | |
| # Process based on endpoint type | |
| if partition.endpoint == "pokemon": | |
| for result in results: | |
| pokemon_data = self._fetch_pokemon(result["url"], session) | |
| abilities = [ability["ability"]["name"] for ability in pokemon_data["abilities"]] | |
| yield ( | |
| pokemon_data["id"], | |
| pokemon_data["name"], | |
| pokemon_data["height"], | |
| pokemon_data["weight"], | |
| abilities | |
| ) | |
| elif partition.endpoint == "type": | |
| for result in results: | |
| type_data = self._fetch_resource(result["url"], session) | |
| pokemon_names = [pokemon["pokemon"]["name"] for pokemon in type_data["pokemon"]] | |
| yield ( | |
| type_data["id"], | |
| type_data["name"], | |
| pokemon_names | |
| ) | |
| else: | |
| # Generic handler for other endpoints | |
| for result in results: | |
| resource_data = self._fetch_resource(result["url"], session) | |
| yield ( | |
| resource_data.get("id", 0), | |
| resource_data.get("name", ""), | |
| json.dumps(resource_data) | |
| ) | |
| except requests.RequestException as e: | |
| raise RuntimeError(f"Error fetching data from PokeAPI: {str(e)}") | |
| finally: | |
| session.close() |
| @staticmethod | ||
| def _fetch_resource(url, session): | ||
| """Fetch any resource data""" | ||
| response = session.get(url) | ||
| response.raise_for_status() | ||
| return response.json() No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
DRY: Combine fetch methods to eliminate code duplication.
The _fetch_pokemon and _fetch_resource methods are identical. Consider combining them into a single helper method to avoid code duplication.
- @staticmethod
- def _fetch_pokemon(url, session):
- """Fetch detailed Pokemon data"""
- response = session.get(url)
- response.raise_for_status()
- return response.json()
-
- @staticmethod
- def _fetch_resource(url, session):
- """Fetch any resource data"""
- response = session.get(url)
- response.raise_for_status()
- return response.json()
+ @staticmethod
+ def _fetch_resource(url, session):
+ """Fetch data from the PokeAPI"""
+ response = session.get(url)
+ response.raise_for_status()
+ return response.json()Then update the read method to use only _fetch_resource:
for result in results:
- pokemon_data = self._fetch_pokemon(result["url"], session)
+ pokemon_data = self._fetch_resource(result["url"], session)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @staticmethod | |
| def _fetch_resource(url, session): | |
| """Fetch any resource data""" | |
| response = session.get(url) | |
| response.raise_for_status() | |
| return response.json() | |
| # Before changes, the file had two methods: | |
| # @staticmethod | |
| # def _fetch_pokemon(url, session): | |
| # """Fetch detailed Pokemon data""" | |
| # response = session.get(url) | |
| # response.raise_for_status() | |
| # return response.json() | |
| # | |
| # @staticmethod | |
| # def _fetch_resource(url, session): | |
| # """Fetch any resource data""" | |
| # response = session.get(url) | |
| # response.raise_for_status() | |
| # return response.json() | |
| # After applying the review comment: | |
| @staticmethod | |
| def _fetch_resource(url, session): | |
| """Fetch data from the PokeAPI""" | |
| response = session.get(url) | |
| response.raise_for_status() | |
| return response.json() | |
| # ... other class methods ... | |
| def read(self): | |
| # ... preceding code ... | |
| results = [...] # Assume results is defined here | |
| for result in results: | |
| - pokemon_data = self._fetch_pokemon(result["url"], session) | |
| + pokemon_data = self._fetch_resource(result["url"], session) | |
| # ... following code ... |
Summary by CodeRabbit
New Features
Documentation
Tests