Skip to content

Conversation

fkiraly
Copy link
Contributor

@fkiraly fkiraly commented Feb 15, 2025

Copy link

@benHeid benHeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very unsure about this proposal. I do not really understand the drawbacks of adding y to predict. In my conceptual model this would be the correct approach, since the past y values are the input of the prediction.

Foundation models currently come in two different, contradictory forms:

* those that carry out fine-tuning in `fit`
* those that pass the context in `fit`, e.g., zero-shot models
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which models do this? As far as I known, in most zero-shot capable scenarios the data passed to fit are completely ignored. Thus, I created the following PR: sktime/sktime#7838 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not correct! For instance, see TinyTimeMixerForecaster.

If no y is passed in predict, then the y passed in fit is the context.

Further, if we would change that to your suggestion - y being ignored in fit - we would break API uniformity!

Because there would be no way to forecast the "normal" way with ttm.

In particular, this is, for me, a prime example that shows that the y in predict is not a good idea:

  • you need to remember _y from fit, because you do not know whether the user will pass one in predict
  • the "pure" design where we do not do this remembering is no longer in a uniform API with all other forecasters!

Copy link

@benHeid benHeid Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I was not exact in my statement above. It has to be: "data passed in y might completely be ignored by the forecast in case predict receives a new y."

But I now understand better your concerns regarding y in predict.

The difficulty for me is that we now have at least four different kinds of fit, which has to be distinguishable:

  • fit for setting the context.
  • fitfor training a model in a global setting (I think this is referred to as pretraining)
  • fit for fine-tuning a model on (a single/few time series) (either with full fine tuning or PEFT approaches)
  • fit for training a model in the broadcasting setting.

(Not sure if there a more kinds of fit when using other forecasters in sktime). So helpful for me would be to collect the different kinds of fit and how we distinguish between them and if we have to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on that we need to collect and analyze the different kinds of "fitting". I would start with qualitative, and then try to match them in a "quantitative" way, I mean typing or API with mathematical formalism.

In my current internal model, from a typing perspective there are two pairs in the above that can be identified:

  • setting the context, training a model in the broadcasting setting (with VectorizedDf, and the current sktime vanilla case
  • training a model in a global setting, and fine-tuning

fine-tuning is a curious case imo: it can be done with different series, but also with same series that are later predicted. I think this is a kind of reduction - "in fit, pretrain the inner model with the partial series passed, then fit inner model".

* fine-tuning
* pre-training of any other kind

Zero-shot models do not have pre-training, but `fit` needs to be called,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is confusing, since the call of fit would not be required if we allow to pass values to predict. For me fit is associated with actual fitting a model and not with just passing context.
Furthermore, it is also possible to fit zero-shot models. e.g. for fine-tuning them on own data. Would this mean to switch pertaining on before starting the training?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo we should not merge two issues here:

  • how to map models that "need no fit" onto the current interface
  • how to map fine-tuning

The representation of "need no fit" is out of scope for this PR, imo that is not sth we should reopen (but we could). The current API requires:

  • all models to have a fit, and it to be executed before predict.
  • some index connection between data seen in fit and predict, for most estimators (in the consolidated interfaces)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, it is also possible to fit zero-shot models. e.g. for fine-tuning them on own data. Would this mean to switch pretraining on before starting the training?

In my opinion and my current conceptual model, fine-tuning is pre-training and not fitting. Fitting is done on the same data we later predict on.

Therefore, in the current API - before we add any y to predict in forecasting, and also for all other learning tasks - zero-shot models pass the context in fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benHeid, I would recommend you think of other examples than foundation models before continuing the discussion, I think having a larger example space is important.

Two important examples:

  • recursive reduction with a global regression model that has an update. We fit this on a number of pre-training batches (global per batch), and then for application we update it on the local data set.
    • a special case is the current recursive reducer, namely without pre-training.
  • the naive strategy of predicting the mean over pre-training time series.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does update mean here? Calling the update method? And what exactly would it do, training or just setting the context?

Let me add the following example, which I think is realistic. Imagine you have a large dataset. You use this for pretraining. Afterwards, you have some specific time series. Now there are two options:

  • Finetune the model further on one specific time series
  • Directly forecast on one specific time series.

How would this look like from a user perspective and with the interplay between update/pretrain/fit/predict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does update mean here? Calling the update method? And what exactly would it do, training or just setting the context?

No, I meant the concept and not the API method "update". Simply a model update of the inner regressor, fo rinstance, a low rank update to a regularized linear regression model.

How would this look like from a user perspective and with the interplay between update/pretrain/fit/predict?

That is precisely the answer that we need to settle on.

One question: is fine-tuning not a form of pre-training a composite model? Namely, applied to the, for instance, lora adapted model architecture?

In vanilla, we could map this to a "pretrain", or to fit. The key question becomes, if we want to fine-tune on a different instance vs the same instance: should this be in different modes? I.e., pretrain vs fit? Or should this be consistently in the same mode? Then it would have to be pretrain, if we do no twant to add new arguments to predict.

Directly forecast on one specific time series.

I think this one, at least, maps clearly, to: "fit passes context, predict produces forecast".

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 16, 2025

I am very unsure about this proposal. I do not really understand the drawbacks of adding y to predict.

I think there are major drawbacks:

  • this breaks the unified API, if you look at the example of zero-shot forecaster TinyTimeMixerForecaster and the general case of zero-shot forecasting.
    • This has, in the past (following the design), forced us to make delayed case distinction which imo is a bad sign.
  • lack of composability. The case distinctions pull through to compositors and pipelines. In essence, this forces us to write two APIs and dispatch, everywhere.
    • I would recommend to try writing down the pipeline or the grid tuner to see what happens. In particular, how would you do "fit on global training data, then grid-tune on the inference batch (where both the training data and the inference batch are panel)"?
  • if there are N predict-like methods, you add the argument to N methods, as opposed to 1 new fit-like method (or a dispatch of fit). This goes against the "minimizing coupling" principle.
  • it forces some "fitting" to be done in predict, which is - very strictly in the current design - a non-mutating method. See the reduction example. Going with this would hence force us to abandon this long-held principle, this weakens the API imo.

Summarizing, I think the journey of implementing a number of foundation models and global forecasters has revealed some weaknesses, but I would also say that these were not easily spottable at the start.

@felipeangelimvieira
Copy link
Contributor

Really nice proposal and discussion! As far as I understand, some of Ben's concerns relate to the semantic meaning of the method names fit and predict. It is counterintuitive to call fit when it does not change the model parameters. It is also somewhat strange that pretrain serves as the actual fitting step in recursive reduction.

Now that I understand the motivations behind the pretrain concept, I see many benefits, such as backward compatibility and easier implementation of compositions. However, I also see the following drawbacks in the current proposal:

  • fit is overloaded, as it can perform multiple functions, requiring users to be aware of the object's state before calling it. Calling fit does not always produce the same result; its behavior depends on the object's state. As a result, the same method can return valid outputs without raising exceptions, yet the object may be operating in unexpected ways.
  • The semantic meaning of fit is diluted if we use the same name for fine-tuning, pretraining, and training in recursive reduction. This can make the user experience more confusing, though it could be mitigated with good documentation.

@felipeangelimvieira
Copy link
Contributor

@fkiraly have you considered update?


* some models need to know at `fit` time whether the data is for pretraining.
Examples: global reduction approaches. Broadcasting.
* as a general design principle, all `predict` methods would need to be changed to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this. Why do predict methods need to get the fit arguments?

Copy link
Contributor Author

@fkiraly fkiraly Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subjecte to the current approach, of adding the local batch to predict-like methods. A training batch is described by the full set of data-like arguments in fit.

In forecasting, for instance, we add the y to each and every predict-like method.

Also, consider the case of classification, where training happens on pairs of instances, one instanece in X pairing with a label in y. To turn time series classifiers into "global models" - or, equivalently, fine-tunable ones - we would need to be able to pass the entire "local" training batch, X and y, to predict and predict_proba. But, there already is an X in predict...

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 17, 2025

@fkiraly have you considered update?

Yes, update as currently used and defined, is a stream update and not a batch update.

I have considered turning update into a method that can do either, depending on whether "the same" instance indices are passed or new ones, but that would result in a highly counterintuitive interface, as users would have to do tedious index operations in a case where they want batch update.

I was also unable to come up with a vignette that makes its use clear, but perhaps you have an idea how to use update, @felipeangelimvieira? One could for instance say, if it is called before fit, it is always a batch update.

That actually might work? But it is an entirely new idea prompted by your question.


f = MyForecasterCls(params)

f.pretrain()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what makes it confusing for me is the word pretrain. I think it should be just train. This would make it much clearer that the simple fit does not train.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not precious about the name - happy to call it train if this is more consistent with, say, torch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much clearer that the simple fit does not train.

Though, sometimes it does, and in a very general sense it always does, expect if the fit_is_empty tag is True.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Puh, this is very philosophical :D and probably depends on the perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think one of the key secrets in API design is to discard philosophical considerations and purely consider the typing of operations.

At least, considering input/output and parametrization and the action carried out tends to lead to better APIs than considering hard to quantify "meaning" assigned to it.

Which, I suppose, is again a bit of a philosophical statement...

@benHeid
Copy link

benHeid commented Feb 17, 2025

I think I get why y in predict is difficult in sktime. So thanks for the explanations @fkiraly. Probably my background from deep learning is the cause why I am used to this design that everything else seems strange to me.

For me personally, the biggest open points are now how the different kinds of fit are distinguished and if they need to be distinguished. At least for the documentation, I think we need to exactly say what types exist and how they can be switched on or off.

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 17, 2025

For me personally, the biggest open points are now how the different kinds of fit are distinguished and if they need to be distinguished.

Conceptually, I consider there to be two different kinds:

  • fitting while knowing that you will apply on different instances, down the line, including:
    • global forecasting, training on the "global set"
    • fine-tuning
  • fitting while knowing that you will apply on different time indices of the same instance, down the line
    • the current vanilla case
    • passing context in zero-shot or few-shot

In general, this does make a difference - for some models it does not, but I feel this is a smaller subset of all examples.

The above conceptual model does not imply that the two top level bullet points are mapped consistently on one method or one specific applicatoin of one method - but it does strongly suggets it, to me.

@benHeid
Copy link

benHeid commented Feb 17, 2025

Conceptually, I consider there to be two different kinds:

  • fitting while knowing that you will apply on different instances, down the line, including:

    • global forecasting, training on the "global set"
    • fine-tuning
  • fitting while knowing that you will apply on different time indices of the same instance, down the line

    • the current vanilla case
    • passing context in zero-shot or few-shot

I started also a list in this comment: #41 (comment)
So fit in broadcasting case you would consider as vanilla or is it an additional case.

When fine-tuning a model, I am not quite sure, if this is true that you will not apply the model to this instance. I can imagine that fine-tuning is part of both top level cases.

@felipeangelimvieira
Copy link
Contributor

fitting while knowing that you will apply on different time indices of the same instance, down the line
the current vanilla case
passing context in zero-shot or few-shot

Isn't this zero-shot case equivalent/similar to .update(y=y, X=X, update_params=True)?

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 17, 2025

fitting while knowing that you will apply on different time indices of the same instance, down the line
the current vanilla case
passing context in zero-shot or few-shot

Isn't this zero-shot case equivalent/similar to .update(y=y, X=X, update_params=True)?

I do not see that, at least not under the current API. In that, fit must have been called before update will be valid. And, update considers the same time series instances as fit does.

@felipeangelimvieira
Copy link
Contributor

I do not see that, at least not under the current API. In the current implementation, fit must be called before update is valid. Additionally, update considers the same time series instances as fit.

Consider the following scenario: a global model has been fit (pretrained according to the current proposal), and now I want to forecast the same time series but with up-to-date data. In this case, wouldn't the behavior of fit (pretraining=off) and update be the same, i.e., updating the context?

After some thought, I feel that not passing y to predict seems easier to maintain. My only concerns right now are:

  • The overlap between update and fit
  • Overloading fit to perform three different tasks:
    • Fit the model's parameters from scratch
    • Fine-tune the model's parameters from the current state
    • Update the context (e.g., information related to time series instances, cutoff, creating broadcasting model clones if needed)

Are these concerns valid? Am I missing something?

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 21, 2025

After some thought, I feel that not passing y to predict seems easier to maintain.

Yes, I do agree with that - it is not an immediate insight though and requires looking at internals.

My only concerns right now are:

The overlap between update and fit

Can you explain that? I do not get this.

Overloading fit to perform three different tasks:

  • I do agree this is a concern, and that is what I do not like about the first option. We could have a second train call.
  • however, I think there are only two substantially different tasks, in terms of formalism. Updating the context and "first fitting" feel the same for me.

Are these concerns valid? Am I missing something?

I do think they are and they are some of the most important discussion points. I do need more explanation on update vs fit though.

@felipeangelimvieira
Copy link
Contributor

felipeangelimvieira commented Feb 24, 2025

Can you explain that? I do not get this.

I hope I'm not misunderstading the behaviour of update. But consider the scenario where you have an already trained global model (a regression reduction). You fit the model with y_train, and now you have model parameters that are useful for predicting. On another day, you have more data for the same series, and now you need to make your model aware of the new observations for each of the timeseries so your predict takes them into account. Currently, this can be achieved with update(y=y, X=X). According to the proposal, this could also be achieved by calling fit to set the context of the timeseries that will be used to predict. Isn't it?

Updating the context and "first fitting" feel the same for me.

Fitting from scratch changes the model parameters. Updating the context only make the model aware of the new observations or new timeseries. The latter, in the case of local models or transformations and new timeseries instances, is not useful without re-training the parameters. However, for global models, it is useful for predicting with previously unseen series.

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 24, 2025

Currently, this can be achieved with update(y=y, X=X).

Yes, this agrees with my understanding.

According to the proposal, this could also be achieved by calling fit to set the context of the timeseries that will be used to predict. Isn't it?

Hm, no - I do not see that following from my proposal. Could you perhaps give more details about what you mean by "this"?

Fitting from scratch changes the model parameters. Updating the context only make the model aware of the new observations or new timeseries. The latter, in the case of local models or transformations and new timeseries instances, is not useful without re-training the parameters.

I still think it is "the same interface", because there is a "is a special case" relationship here.

Fitting (from scratch), for most models, does not only update the model parameter, but it also necessarily makes the model aware of the new observations, which can be seen easily from the fact that these new observations are used to fit the model.

Therefore, estimators that only "make the model aware of new observations" in fit but do no fitting are a special case of a more general API where only this is required, and doing something non-trivial with fitted parameters is optional.

The latter, in the case of local models or transformations and new timeseries instances, is not useful without re-training the parameters.

It is, for models that do not require fitting, and that is imo the key observation! These "models that require no fitting" are most similar to zero-shot model where a context is passed. Consider, for instance, the naive model that just replays the past, or the model that replays some expert forecasts passed in advance to the model. These models do not "fit from scratch" but remember the data aka "pass the context", just like zero-shot models.

@felipeangelimvieira felipeangelimvieira self-requested a review March 25, 2025 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants