-
Notifications
You must be signed in to change notification settings - Fork 30
Pre-training, global learning, and fine-tuning API #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very unsure about this proposal. I do not really understand the drawbacks of adding y
to predict
. In my conceptual model this would be the correct approach, since the past y
values are the input of the prediction.
Foundation models currently come in two different, contradictory forms: | ||
|
||
* those that carry out fine-tuning in `fit` | ||
* those that pass the context in `fit`, e.g., zero-shot models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which models do this? As far as I known, in most zero-shot capable scenarios the data passed to fit are completely ignored. Thus, I created the following PR: sktime/sktime#7838 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not correct! For instance, see TinyTimeMixerForecaster
.
If no y
is passed in predict
, then the y
passed in fit
is the context.
Further, if we would change that to your suggestion - y
being ignored in fit
- we would break API uniformity!
Because there would be no way to forecast the "normal" way with ttm.
In particular, this is, for me, a prime example that shows that the y
in predict
is not a good idea:
- you need to remember
_y
fromfit
, because you do not know whether the user will pass one inpredict
- the "pure" design where we do not do this remembering is no longer in a uniform API with all other forecasters!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I was not exact in my statement above. It has to be: "data passed in y
might completely be ignored by the forecast in case predict
receives a new y
."
But I now understand better your concerns regarding y
in predict
.
The difficulty for me is that we now have at least four different kinds of fit
, which has to be distinguishable:
fit
for setting the context.fit
for training a model in a global setting (I think this is referred to aspretraining
)fit
for fine-tuning a model on (a single/few time series) (either with full fine tuning or PEFT approaches)fit
for training a model in thebroadcasting
setting.
(Not sure if there a more kinds of fit
when using other forecasters in sktime). So helpful for me would be to collect the different kinds of fit
and how we distinguish between them and if we have to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on that we need to collect and analyze the different kinds of "fitting". I would start with qualitative, and then try to match them in a "quantitative" way, I mean typing or API with mathematical formalism.
In my current internal model, from a typing perspective there are two pairs in the above that can be identified:
- setting the context, training a model in the broadcasting setting (with
VectorizedDf
, and the currentsktime
vanilla case - training a model in a global setting, and fine-tuning
fine-tuning is a curious case imo: it can be done with different series, but also with same series that are later predicted. I think this is a kind of reduction - "in fit, pretrain the inner model with the partial series passed, then fit inner model".
* fine-tuning | ||
* pre-training of any other kind | ||
|
||
Zero-shot models do not have pre-training, but `fit` needs to be called, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is confusing, since the call of fit would not be required if we allow to pass values to predict. For me fit is associated with actual fitting a model and not with just passing context.
Furthermore, it is also possible to fit zero-shot models. e.g. for fine-tuning them on own data. Would this mean to switch pertaining
on before starting the training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo we should not merge two issues here:
- how to map models that "need no fit" onto the current interface
- how to map fine-tuning
The representation of "need no fit" is out of scope for this PR, imo that is not sth we should reopen (but we could). The current API requires:
- all models to have a
fit
, and it to be executed before predict. - some index connection between data seen in
fit
andpredict
, for most estimators (in the consolidated interfaces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, it is also possible to fit zero-shot models. e.g. for fine-tuning them on own data. Would this mean to switch
pretraining
on before starting the training?
In my opinion and my current conceptual model, fine-tuning is pre-training and not fitting. Fitting is done on the same data we later predict
on.
Therefore, in the current API - before we add any y
to predict
in forecasting, and also for all other learning tasks - zero-shot models pass the context in fit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benHeid, I would recommend you think of other examples than foundation models before continuing the discussion, I think having a larger example space is important.
Two important examples:
- recursive reduction with a global regression model that has an update. We fit this on a number of pre-training batches (global per batch), and then for application we update it on the local data set.
- a special case is the current recursive reducer, namely without pre-training.
- the naive strategy of predicting the mean over pre-training time series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does update mean here? Calling the update
method? And what exactly would it do, training or just setting the context?
Let me add the following example, which I think is realistic. Imagine you have a large dataset. You use this for pretraining. Afterwards, you have some specific time series. Now there are two options:
- Finetune the model further on one specific time series
- Directly forecast on one specific time series.
How would this look like from a user perspective and with the interplay between update/pretrain/fit/predict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does update mean here? Calling the update method? And what exactly would it do, training or just setting the context?
No, I meant the concept and not the API method "update". Simply a model update of the inner regressor, fo rinstance, a low rank update to a regularized linear regression model.
How would this look like from a user perspective and with the interplay between update/pretrain/fit/predict?
That is precisely the answer that we need to settle on.
One question: is fine-tuning not a form of pre-training a composite model? Namely, applied to the, for instance, lora adapted model architecture?
In vanilla, we could map this to a "pretrain", or to fit
. The key question becomes, if we want to fine-tune on a different instance vs the same instance: should this be in different modes? I.e., pretrain vs fit? Or should this be consistently in the same mode? Then it would have to be pretrain, if we do no twant to add new arguments to predict.
Directly forecast on one specific time series.
I think this one, at least, maps clearly, to: "fit passes context, predict produces forecast".
I think there are major drawbacks:
Summarizing, I think the journey of implementing a number of foundation models and global forecasters has revealed some weaknesses, but I would also say that these were not easily spottable at the start. |
Really nice proposal and discussion! As far as I understand, some of Ben's concerns relate to the semantic meaning of the method names Now that I understand the motivations behind the
|
@fkiraly have you considered |
|
||
* some models need to know at `fit` time whether the data is for pretraining. | ||
Examples: global reduction approaches. Broadcasting. | ||
* as a general design principle, all `predict` methods would need to be changed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this. Why do predict
methods need to get the fit
arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is subjecte to the current approach, of adding the local batch to predict
-like methods. A training batch is described by the full set of data-like arguments in fit
.
In forecasting, for instance, we add the y
to each and every predict
-like method.
Also, consider the case of classification, where training happens on pairs of instances, one instanece in X
pairing with a label in y
. To turn time series classifiers into "global models" - or, equivalently, fine-tunable ones - we would need to be able to pass the entire "local" training batch, X
and y
, to predict
and predict_proba
. But, there already is an X
in predict
...
Yes, I have considered turning I was also unable to come up with a vignette that makes its use clear, but perhaps you have an idea how to use That actually might work? But it is an entirely new idea prompted by your question. |
steps/25_pretrain/step.md
Outdated
|
||
f = MyForecasterCls(params) | ||
|
||
f.pretrain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what makes it confusing for me is the word pretrain
. I think it should be just train
. This would make it much clearer that the simple fit
does not train.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not precious about the name - happy to call it train
if this is more consistent with, say, torch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much clearer that the simple
fit
does not train.
Though, sometimes it does, and in a very general sense it always does, expect if the fit_is_empty
tag is True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Puh, this is very philosophical :D and probably depends on the perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think one of the key secrets in API design is to discard philosophical considerations and purely consider the typing of operations.
At least, considering input/output and parametrization and the action carried out tends to lead to better APIs than considering hard to quantify "meaning" assigned to it.
Which, I suppose, is again a bit of a philosophical statement...
I think I get why For me personally, the biggest open points are now how the different kinds of |
Conceptually, I consider there to be two different kinds:
In general, this does make a difference - for some models it does not, but I feel this is a smaller subset of all examples. The above conceptual model does not imply that the two top level bullet points are mapped consistently on one method or one specific applicatoin of one method - but it does strongly suggets it, to me. |
I started also a list in this comment: #41 (comment) When fine-tuning a model, I am not quite sure, if this is true that you will not apply the model to this instance. I can imagine that fine-tuning is part of both top level cases. |
Isn't this zero-shot case equivalent/similar to |
I do not see that, at least not under the current API. In that, |
Consider the following scenario: a global model has been fit (pretrained according to the current proposal), and now I want to forecast the same time series but with up-to-date data. In this case, wouldn't the behavior of After some thought, I feel that not passing
Are these concerns valid? Am I missing something? |
Yes, I do agree with that - it is not an immediate insight though and requires looking at internals.
Can you explain that? I do not get this.
I do think they are and they are some of the most important discussion points. I do need more explanation on |
I hope I'm not misunderstading the behaviour of update. But consider the scenario where you have an already trained global model (a regression reduction). You fit the model with
Fitting from scratch changes the model parameters. Updating the context only make the model aware of the new observations or new timeseries. The latter, in the case of local models or transformations and new timeseries instances, is not useful without re-training the parameters. However, for global models, it is useful for predicting with previously unseen series. |
Yes, this agrees with my understanding.
Hm, no - I do not see that following from my proposal. Could you perhaps give more details about what you mean by "this"?
I still think it is "the same interface", because there is a "is a special case" relationship here. Fitting (from scratch), for most models, does not only update the model parameter, but it also necessarily makes the model aware of the new observations, which can be seen easily from the fact that these new observations are used to fit the model. Therefore, estimators that only "make the model aware of new observations" in
It is, for models that do not require fitting, and that is imo the key observation! These "models that require no fitting" are most similar to zero-shot model where a context is passed. Consider, for instance, the naive model that just replays the past, or the model that replays some expert forecasts passed in advance to the model. These models do not "fit from scratch" but remember the data aka "pass the context", just like zero-shot models. |
This PR proposes a consolidated design for pre-training, global learning, and fine-tuning.
References: