Skip to content

Conversation

jgyasu
Copy link
Member

@jgyasu jgyasu commented Jul 1, 2025

Proposes dataset collections for simplifying benchmarking experiment reproducibility.

def __len__(self):
return len(self._collection.get_datasets()) + len(self._additional_datasets)

def get_collection_info(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be part of a tag system for the dataset collection?

self.subset = subset
super().__init__()

def get_collection_name(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_collection_name feels like it should be a tag, e.g., "info:name" : "TSC Bake-off 2017

A collection of datasets following the strategy pattern.
"""

_collections = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this design is not very extensible!

Why: for adding a collection, you need to:

  1. add a new object
  2. add it to this register

Generally, designs requiring a registry are less extensible than designs using the strategy pattern only.

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great proposal, I like the high-level ideas and design!

Some issues and comments I have:

  • the DatasetCollection is a kind of register, that also functions as a instance factory (via add). I feel this is a violation of the single responsibility principle and the strategy pattern as well. So I would go for sth different.
  • The use case of adding one more dataset is a nice one, I did not have it on the radar that clearly.
  • the add method belongs to the builder pattern, and violates the dataclass pattern. There is a way to have both: instead of modifying self, the add method returns a newly constructed CustomCollection (or similar), which is a dataclass.

"What if" ideas:

  • how about collections that are not just for datasets, but anything? It could be estimators! Example: the TSC bake-off had a collection of datasets, and a collection of estimators as well.
  • instead of using a registry in DatasetCollection, should we use craft from registry? This is a lookup entry point which gets class by name.

@jgyasu
Copy link
Member Author

jgyasu commented Jul 2, 2025

how about collections that are not just for datasets, but anything? It could be estimators! Example: the TSC bake-off had a collection of datasets, and a collection of estimators as well.

I will adress the other comments, but this is a nice idea. I thought about it too. I think we can just hace a base class for collections, say, BaseCollection and then extend it to different types of collections, EstimatorBaseCollection, DatasetBaseCollection.

If this sounds good then I will modify this proposal? @fkiraly

@fkiraly
Copy link
Contributor

fkiraly commented Jul 2, 2025

Sounds good, although I would only introduce subclasses if something implies they are needed, design-wise.

@jgyasu jgyasu changed the title Dataset Collections Advanced Benchmarking Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PR in progress
Development

Successfully merging this pull request may close these issues.

3 participants