From 9b2692ae7726324b0e59abecdaa64f50799e4966 Mon Sep 17 00:00:00 2001 From: divyashreepathihalli Date: Fri, 16 May 2025 20:26:33 +0000 Subject: [PATCH 1/5] Update contributing guide --- CONTRIBUTING_MODELS.md | 179 +++++++++++++++++++---------------------- 1 file changed, 85 insertions(+), 94 deletions(-) diff --git a/CONTRIBUTING_MODELS.md b/CONTRIBUTING_MODELS.md index 2a9556fd15..26b60f3c9d 100644 --- a/CONTRIBUTING_MODELS.md +++ b/CONTRIBUTING_MODELS.md @@ -20,25 +20,26 @@ Keep this checklist handy! - [ ] Open an issue or find an issue to contribute a backbone model. -### Step 2: PR #1 - Add XXBackbone +### Step 2: PR #1 - Model folder +- [ ] Create your model folder XX in https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/models + +### Step 3: PR #1 - Add XXBackbone - [ ] An `xx/xx_backbone.py` file which has the model graph \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone.py)\]. - [ ] An `xx/xx_backbone_test.py` file which has unit tests for the backbone \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone_test.py)\]. - [ ] A Colab notebook link in the PR description which matches the outputs of the implemented backbone model with the original source \[[Example](https://colab.research.google.com/drive/1SeZWJorKWmwWJax8ORSdxKrxE25BfhHa?usp=sharing)\]. -### Step 3: PR #2 - Add XXTokenizer +### Step 4: PR #2 - Data Converter - Add XXTokenizer or XXImageConverter or XXAudioConverter, etc -- [ ] An `xx/xx_tokenizer.py` file which has the tokenizer for the model \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_tokenizer.py)\]. + +- [ ] If you are contributing a language model add a `xx/xx_tokenizer.py` file which has the tokenizer for the model \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_tokenizer.py)\]. - [ ] An `xx/xx_tokenizer_test.py` file which has unit tests for the model tokenizer \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_tokenizer_test.py)\]. - [ ] A Colab notebook link in the PR description, demonstrating that the output of the tokenizer matches the original tokenizer \[[Example](https://colab.research.google.com/drive/1MH_rpuFB1Nz_NkKIAvVtVae2HFLjXZDA?usp=sharing)]. +- [ ] If you are contributing an image model add a `xx/xx_image_converter.py` file which has the image transformations for the model \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/clip/clip_image_converter.py)\]. +- [ ] If you are contributing an image model add a `xx/xx_audio_converter.py` file which has the audio transformations for the model \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/moonshine/moonshine_audio_converter.py)\]. -### Step 4: PR #3 - Add XX Presets - -- [ ] An `xx/xx_presets.py` file with links to weights uploaded to a personal GCP bucket/Google Drive \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_presets.py)\]. -- [ ] A `tools/checkpoint_conversion/convert_xx_checkpoints.py` which is reusable script for converting checkpoints \[[Example](https://github.com/keras-team/keras-hub/blob/master/tools/checkpoint_conversion/convert_distilbert_checkpoints.py)\]. -- [ ] A Colab notebook link in the PR description, showing an end-to-end task such as text classification, etc. The task model can be built using the backbone model, with the task head on top \[[Example](https://gist.github.com/mattdangerw/bf0ca07fb66b6738150c8b56ee5bab4e)\]. -### Step 5: PR #4 and Beyond - Add XX Tasks and Preprocessors +### Step 5: PR #3 - Add XX Tasks and Preprocessors This PR is optional. @@ -46,6 +47,14 @@ This PR is optional. - [ ] An `xx/xx__preprocessor.py` file which has the preprocessor and can be used to get inputs suitable for the task model \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_preprocessor.py)\]. - [ ] `xx/xx__test.py` file and `xx/xx__preprocessor_test.py` files which have unit tests for the above two modules \[[Example 1](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_classifier_test.py) and [Example 2](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_preprocessor_test.py)\]. - [ ] A Colab notebook link in the PR description, demonstrating that the output of the preprocessor matches the output of the original preprocessor \[[Example](https://colab.research.google.com/drive/1GFFC7Y1I_2PtYlWDToqKvzYhHWv1b3nC?usp=sharing)]. +- [ ] Add a Colab notebook to demonstate an end to end demo of the task model, show that teh outputs are matching the original implementation and also add a demo to show finetuning of the model. + +### Step 4: PR #4 and beyond - Add XX Presets, Weights, and End-to-End Validation +- [ ] An `xx/xx_presets.py` file with links to weights uploaded to Kaggle Keras page[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_presets.py)\]. +- [ ] You can test out the model presets and show the demo by staging the model presets to KerasHub org page on [Kaggle](https://www.kaggle.com/organizations/kerashub). Here is the invite [link](https://kaggle.com/organizations/kerashub/invite/c4b8baa532b8436e8df8f1ed641b9cb5) to join the org page. +- [ ] A `tools/checkpoint_conversion/convert_xx_checkpoints.py` which is reusable script for converting checkpoints \[[Example](https://github.com/keras-team/keras-hub/blob/master/tools/checkpoint_conversion/convert_distilbert_checkpoints.py)\]. +- [ ] A Colab notebook link in the PR description, showing an end-to-end task such as text classification, etc. The task model can be built using the backbone model, with the task head on top \[[Example](https://gist.github.com/mattdangerw/bf0ca07fb66b6738150c8b56ee5bab4e)\]. Show that the numerics and outputs are matching + ## Detailed Instructions @@ -83,28 +92,9 @@ A model is typically split into three/four sections. We would recommend you to compare this side-by-side with the [`keras_hub.layers.DistilBertBackbone` source code](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone.py)! -**Inputs to the model** - -Generally, the standard inputs to any text model are: - - `token_ids`: tokenised inputs (An integer representation of the text sequence). - - `padding_mask`: Masks the padding tokens. - -**Embedding layer(s)** - -Standard layers used: `keras.layers.Embedding`, -`keras_hub.layers.PositionEmbedding`, `keras_hub.layers.TokenAndPositionEmbedding`. - -**Encoder layers** - -Standard layers used: `keras_hub.layers.TransformerEncoder`, `keras_hub.layers.FNetEncoder`. - -**Decoder layers (possibly)** - -Standard layers used: `keras_hub.layers.TransformerDecoder`. - -**Other layers which might be used** - -`keras.layers.LayerNorm`, `keras.layers.Dropout`, `keras.layers.Conv1D`, etc. +Implementation: Use Keras' functional API or subclass keras.Model. Refer to existing KerasHub backbones for structure. +Inputs: Define standard inputs (e.g., token_ids, padding_mask for text; pixel_values for vision; audio_features for audio). +Layers: Leverage standard keras.layers and relevant keras_hub_layers where possible. Implement custom layers if necessary, ensuring they are well-tested and documented.
@@ -130,6 +120,9 @@ Since the first PR is only to add the model backbone class, you should omit the `from_presets()` function; this will be added at a later stage when you open a PR for adding presets. +Validation Colab: Create a Colab notebook that: Loads weights from the original model source. Manually loads these weights into an instance of your KerasHub backbone. Compares the output of your backbone with the original model's corresponding layer output on sample inputs, ensuring numerical closeness. +Unit Tests (your_model_backbone_test.py): Include tests for forward pass, save/load, and correct output shapes with various configurations. + #### Convert weights from the original source and check output! Before you open a PR for adding the model backbone class, it is essential to check @@ -149,46 +142,54 @@ It is essential to add units tests. These unit tests are basic and mostly check whether the forward pass goes through successfully, whether the model can be saved and loaded correctly, etc. -### Step 3: PR #2 - Add XXTokenizer +### Step 3: PR #2 - Data Converter - Add XXTokenizer or XXImageConverter or XXAudioConverter, etc #### Tokenizer -Most text models nowadays use subword tokenizers such as WordPiece, SentencePiece -and BPE Tokenizer. Since KerasHub has implementations of most of the popular -subword tokenizers, the model tokenizer layer typically inherits from a base -tokenizer class. - -For example, DistilBERT uses the WordPiece tokenizer. So, we can introduce a new -class, `DistilBertTokenizer`, which inherits from `keras_hub.tokenizers.WordPieceTokenizer`. -All the underlying actual tokenization will be taken care of by the superclass. - -The important thing here is adding "special tokens". Most models have -special tokens such as beginning-of-sequence token, end-of-sequence token, -mask token, pad token, etc. These have to be -[added as member attributes](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_tokenizer.py#L91-L105) -to the tokenizer class. These member attributes are then accessed by the -preprocessor layers. - -For a full list of the tokenizers KerasHub offers, please visit -[this link](https://keras.io/api/keras_nlp/tokenizers/) and make use of the -tokenizer your model uses! +The Data Converter transforms raw data of a specific modality into a numerical format suitable for the preprocessor and backbone. +Implementation: +Text: YourModelTokenizer - Converts raw text into sequences of token IDs. Inherit from a base tokenizer in KerasNLP (e.g., WordPieceTokenizer, SentencePieceTokenizer) or implement a custom one. Define special tokens (e.g., cls_token, pad_token) and handle vocabulary loading. +Image: YourModelImageConverter (or similar name like ImageProcessor) - Handles operations like resizing, rescaling, normalization, and potentially data augmentation strategy application. May utilize keras_cv.layers. +Audio: YourModelAudioConverter (or similar name like AudioFeatureExtractor) - Processes raw audio into features like spectrograms or MFCCs. May utilize Keras or other audio processing libraries. +Assets: Ensure your converter can load necessary assets (e.g., vocabulary files for tokenizers, mean/std deviation values for image normalization). +Validation Colab: Demonstrate that your data converter's output (e.g., token IDs, processed pixel tensors, audio feature tensors) matches the behavior of the original model's data conversion step. +Unit Tests (e.g., your_model_tokenizer_test.py): Test core functionality, asset loading, and output consistency. #### Unit Tests -The last step here is to add unit tests for the tokenizer. A dummy vocabulary is -created, and the output of both these layers is verified including tokenization, -detokenization, etc. - -### Step 4: PR #3 - Add XX Presets - -Once the backbone and tokenizer PRs have been merged, you can open a PR for -adding presets. For every model, we have a separate file where we mention our -preset configurations. This preset configuration has model-specific arguments -such as number of layers, number of attention heads; preprocessor-specific -arguments such as whether we want to lowercase the input text; checkpoint and -vocabulary file URLs, etc. In the PR description, you can add -Google Drive/personal GCP bucket links to the checkpoint and the vocabulary -files. These files will then be uploaded to GCP by us! +The last step here is to add unit tests.: Test core functionality, asset loading, and output consistency. +### Step 4: PR #3 and Beyond: Add XXTasks and XXPreprocessors + +This PR builds on the backbone and data converter to create a user-friendly Task Model. +#### Preprocessor (your_model__preprocessor.py) +The Preprocessor takes raw data (text, images, audio paths, etc.) and uses the appropriate Data Converter to transform it into the full format expected by the Backbone. +##### Implementation: +- Create a class (e.g., YourModelCausalLMPreprocessor, YourModelImageClassifierPreprocessor). +- It will use the specific YourModel (e.g., YourModelTokenizer) internally. +- It handles tasks like adding special tokens, padding/truncation for sequences, creating attention masks, batching, and ensuring the output dictionary matches the backbone's expected input names. +##### Inputs: +- Define how it accepts raw data (e.g., strings, file paths, raw tensors). +##### Outputs: +- It should output a dictionary of tensors ready for the Backbone. +##### Validation Colab: +- Show that your preprocessor, given raw input, produces the same tensor inputs (e.g., token_ids, padding_mask, pixel_values) as the original model's complete preprocessing pipeline. +##### Unit Tests (your_model__preprocessor_test.py): +- Test with various inputs, ensuring correct output shapes and values. + +#### Task Model (your_model_.py) +The Task Model is the high-level entry point. It combines the Backbone, Preprocessor, and a task-specific head. +##### Implementation: +- Create a class (e.g., YourModelCausalLM, YourModelImageClassifier). +- It should instantiate its Backbone and Preprocessor in its constructor. +- It will include a task-specific head (e.g., a dense layer for classification, a language modeling head, detection heads). +##### API: +- It should offer simple methods like predict(), fit(), generate() (for generative models), detect() (for detection models). +##### Unit Tests (your_model__test.py): +- Test basic usage: instantiation, forward pass with dummy data from the preprocessor, and model compilation. + +### Step 5: PR #4 - Add Presets, Weights, and End-to-End Validation + +Once the above 3 PRs are merged you can open a PR for adding presets. For every model, we have a separate file where we mention our preset configurations. This preset configuration has model-specific arguments such as number of layers, number of attention heads; preprocessor-specific arguments such as whether we want to lowercase the input text; checkpoint and vocabulary file URLs, etc. Please use this [invite link](https://kaggle.com/organizations/kerashub/invite/c4b8baa532b8436e8df8f1ed641b9cb5) and stage your model presets [here](https://www.kaggle.com/organizations/kerashub/models) After wrapping up the preset configuration file, you need to add the `from_preset` function to all three classes, i.e., `DistilBertBackbone`, @@ -201,39 +202,29 @@ and verify whether the output is correct. For "extra large tests", we loop over all the presets and just check whether the backbone and the tokenizer can be called without any error. -Additionally, a checkpoint conversion script should be added. This script -demonstrates that the outputs of our backbone model and outputs of the source -model match. This should be done for all presets. +Checkpoint Conversion Script (tools/checkpoint_conversion/convert_your_model_checkpoints.py) +- Provide a script that converts weights from their original format (e.g., PyTorch .bin, TensorFlow SavedModel) to the Keras H5 format expected by KerasHub. +- This script should be reusable and clearly documented. +- It's crucial for verifying weight conversion accuracy and for future updates. +End-to-End Validation Colab +- This is the most important validation step. +- Create a Colab notebook that demonstrates: + - Loading your Task Model using YourModelTask.from_preset("your_model_preset_name"). + - Running an end-to-end task (e.g., text generation, image classification, object detection) on sample input. + - Comparing the output (e.g., generated text, class probabilities, bounding boxes) with the output of the original model using its original pretrained weights and inference pipeline. Ensure numerical closeness. +Numerics Test: Add at least one unit test (often marked as "large" or "extra_large") that loads a small preset via from_preset(), runs inference on a fixed input, and asserts that the output matches known-good values (obtained from the original model). See existing tests for examples. -### Step 5: PR #4 and Beyond: Add XXTasks and XXPreprocessors +### Step 6: PR #5 and Beyond - Add More Tasks or Advanced Features (Optional) -Once you are finished with Steps 1-4, you can add "task" models and -preprocessors. -### Task model - -Task models are essentially models which have "task heads" on top of the backbone -models. For instance, for the text classification task, you can have a -feedforward layer on top of a backbone model like DistilBERT. Task models are -very essential since pretrained models are used extensively for downstream tasks -like text classification, token classification, text summarization, neural -machine translation, etc. - -#### Preprocessor - -The preprocessor class is responsible for making the inputs suitable for -consumption by the model - it packs multiple inputs together, i.e., given -multiple input texts, it will add appropriate special tokens, pad the inputs -and return the dictionary in the form expected by the model. - -The preprocessor class might have a few intricacies depending on the model. For example, -the DeBERTaV3 tokenizer does not have the `[MASK]` in the provided sentencepiece -proto file, and we had to make some modifications [here](https://github.com/keras-team/keras-hub/blob/master/keras_hub/models/deberta_v3/deberta_v3_preprocessor.py). Secondly, we have -a separate preprocessor class for every task. This is because different tasks -might require different input formats. For instance, we have a [separate preprocessor](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_masked_lm_preprocessor.py) -for masked language modeling (MLM) for DistilBERT. +Once the primary Task Model is merged, you can extend its utility: +Additional Task Models: Contribute other task models that use the same YourModelBackbone (e.g., YourModelTokenClassifier if you initially contributed YourModelCausalLM, or YourModelImageSegmentation if you contributed YourModelImageClassifier). Each new task will likely require its own YourModelPreprocessor and YourModel class. +Parameter-Efficient Fine-Tuning (PEFT): Add LoRA support (e.g., backbone.enable_lora()) if applicable. See KerasHub's fine-tuning documentation for guidance. +Quantization (QLoRA): If the model benefits, implement and document QLoRA support. +Model Parallelism: For very large models, provide configurations or guidance for model parallelism. ## Conclusion - Once all three PRs (and optionally, the fourth PR) have been merged, you have successfully contributed a model to KerasHub. Congratulations! 🔥 + + From daaaf432e7178c1e7b33d8d7604fb06adf4535e8 Mon Sep 17 00:00:00 2001 From: divyashreepathihalli Date: Fri, 16 May 2025 22:43:46 +0000 Subject: [PATCH 2/5] update iteration #2 --- CONTRIBUTING_MODELS.md | 361 ++++++++++++++++++++++------------------- 1 file changed, 198 insertions(+), 163 deletions(-) diff --git a/CONTRIBUTING_MODELS.md b/CONTRIBUTING_MODELS.md index 26b60f3c9d..b5c939f35e 100644 --- a/CONTRIBUTING_MODELS.md +++ b/CONTRIBUTING_MODELS.md @@ -1,66 +1,87 @@ # Model Contribution Guide -KerasHub has a plethora of pre-trained large language models -ranging from BERT to OPT. We are always looking for more models and are always -open to contributions! +KerasHub has a plethora of pre-trained large language models ranging from BERT to OPT. We are always looking for more models and are always open to contributions! -In this guide, we will walk you through the steps one needs to take in order to -contribute a new pre-trained model to KerasHub. For illustration purposes, let's -assume that you want to contribute the DistilBERT model. Before we dive in, we encourage you to go through -[our getting started guide](https://keras.io/guides/keras_nlp/getting_started/) -for an introduction to the library, and our -[contribution guide](https://github.com/keras-team/keras-hub/blob/master/CONTRIBUTING.md). +In this guide, we will walk you through the steps needed to contribute a new pre-trained model to KerasHub. For illustration purposes, let's assume that you want to contribute the DistilBERT model. Before we dive in, we encourage you to go through our [Getting Started Guide](https://keras.io/guides/keras_nlp/getting_started/) for an introduction to the library, and our [Contribution Guide](https://github.com/keras-team/keras-hub/blob/master/CONTRIBUTING.md). + +--- ## Checklist This to-do list is a brief outline of how a model can be contributed. Keep this checklist handy! -### Step 1: Open an issue/find an issue +### Step 1: Open an Issue or Find an Issue - [ ] Open an issue or find an issue to contribute a backbone model. -### Step 2: PR #1 - Model folder -- [ ] Create your model folder XX in https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/models +### Step 2: PR #1 - Model Folder + +- [ ] Create your model folder `xx` in [`keras_hub/src/models`](https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/models) + +### Step 3: PR #1 - Add `XXBackbone` + +- [ ] An `xx/xx_backbone.py` file which has the model graph + [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone.py) + +- [ ] An `xx/xx_backbone_test.py` file which has unit tests for the backbone + [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone_test.py) + +- [ ] A Colab notebook link in the PR description that matches the outputs of the implemented backbone model with the original source + [Example](https://colab.research.google.com/drive/1SeZWJorKWmwWJax8ORSdxKrxE25BfhHa?usp=sharing) + +### Step 4: PR #2 - Data Converter - Add `XXTokenizer` or `XXImageConverter` or `XXAudioConverter` + +- [ ] If contributing a language model, add an `xx/xx_tokenizer.py` file + [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_tokenizer.py) + +- [ ] Add `xx/xx_tokenizer_test.py` file with unit tests + [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_tokenizer_test.py) -### Step 3: PR #1 - Add XXBackbone +- [ ] A Colab notebook link in the PR description demonstrating that the tokenizer output matches the original + [Example](https://colab.research.google.com/drive/1MH_rpuFB1Nz_NkKIAvVtVae2HFLjXZDA?usp=sharing) -- [ ] An `xx/xx_backbone.py` file which has the model graph \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone.py)\]. -- [ ] An `xx/xx_backbone_test.py` file which has unit tests for the backbone \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone_test.py)\]. -- [ ] A Colab notebook link in the PR description which matches the outputs of the implemented backbone model with the original source \[[Example](https://colab.research.google.com/drive/1SeZWJorKWmwWJax8ORSdxKrxE25BfhHa?usp=sharing)\]. +- [ ] For image models: Add `xx/xx_image_converter.py` file with image transformations + [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/clip/clip_image_converter.py) -### Step 4: PR #2 - Data Converter - Add XXTokenizer or XXImageConverter or XXAudioConverter, etc +- [ ] For audio models: Add `xx/xx_audio_converter.py` file with audio transformations + [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/moonshine/moonshine_audio_converter.py) +### Step 5: PR #3 - Add `XX` Tasks and Preprocessors (Optional) -- [ ] If you are contributing a language model add a `xx/xx_tokenizer.py` file which has the tokenizer for the model \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_tokenizer.py)\]. -- [ ] An `xx/xx_tokenizer_test.py` file which has unit tests for the model tokenizer \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_tokenizer_test.py)\]. -- [ ] A Colab notebook link in the PR description, demonstrating that the output of the tokenizer matches the original tokenizer \[[Example](https://colab.research.google.com/drive/1MH_rpuFB1Nz_NkKIAvVtVae2HFLjXZDA?usp=sharing)]. -- [ ] If you are contributing an image model add a `xx/xx_image_converter.py` file which has the image transformations for the model \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/clip/clip_image_converter.py)\]. -- [ ] If you are contributing an image model add a `xx/xx_audio_converter.py` file which has the audio transformations for the model \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/moonshine/moonshine_audio_converter.py)\]. +- [ ] Add `xx/xx_.py` for adding task models (e.g., classifier, masked LM) + [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_classifier.py) +- [ ] Add `xx/xx__preprocessor.py` for preprocessing inputs to the task model + [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_preprocessor.py) -### Step 5: PR #3 - Add XX Tasks and Preprocessors +- [ ] Add unit tests: `xx/xx__test.py` and `xx/xx__preprocessor_test.py` + [Example 1](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_classifier_test.py), + [Example 2](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_preprocessor_test.py) -This PR is optional. +- [ ] Colab notebook link in the PR description to validate that the preprocessor output matches the original + [Example](https://colab.research.google.com/drive/1GFFC7Y1I_2PtYlWDToqKvzYhHWv1b3nC?usp=sharing) -- [ ] An `xx/xx_.py` file for adding a task model like classifier, masked LM, etc. \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_classifier.py)\] -- [ ] An `xx/xx__preprocessor.py` file which has the preprocessor and can be used to get inputs suitable for the task model \[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_preprocessor.py)\]. -- [ ] `xx/xx__test.py` file and `xx/xx__preprocessor_test.py` files which have unit tests for the above two modules \[[Example 1](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_classifier_test.py) and [Example 2](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_preprocessor_test.py)\]. -- [ ] A Colab notebook link in the PR description, demonstrating that the output of the preprocessor matches the output of the original preprocessor \[[Example](https://colab.research.google.com/drive/1GFFC7Y1I_2PtYlWDToqKvzYhHWv1b3nC?usp=sharing)]. -- [ ] Add a Colab notebook to demonstate an end to end demo of the task model, show that teh outputs are matching the original implementation and also add a demo to show finetuning of the model. +- [ ] Add a Colab notebook demonstrating end-to-end usage of the task model, showing matching outputs and a fine-tuning demo -### Step 4: PR #4 and beyond - Add XX Presets, Weights, and End-to-End Validation -- [ ] An `xx/xx_presets.py` file with links to weights uploaded to Kaggle Keras page[[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_presets.py)\]. -- [ ] You can test out the model presets and show the demo by staging the model presets to KerasHub org page on [Kaggle](https://www.kaggle.com/organizations/kerashub). Here is the invite [link](https://kaggle.com/organizations/kerashub/invite/c4b8baa532b8436e8df8f1ed641b9cb5) to join the org page. -- [ ] A `tools/checkpoint_conversion/convert_xx_checkpoints.py` which is reusable script for converting checkpoints \[[Example](https://github.com/keras-team/keras-hub/blob/master/tools/checkpoint_conversion/convert_distilbert_checkpoints.py)\]. -- [ ] A Colab notebook link in the PR description, showing an end-to-end task such as text classification, etc. The task model can be built using the backbone model, with the task head on top \[[Example](https://gist.github.com/mattdangerw/bf0ca07fb66b6738150c8b56ee5bab4e)\]. Show that the numerics and outputs are matching +### Step 6: PR #4 and Beyond - Add `XXPresets`, Weights, and End-to-End Validation +- [ ] Add `xx/xx_presets.py` with links to weights uploaded to Kaggle KerasHub + [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_presets.py) + +- [ ] Stage the model presets on KerasHub’s [Kaggle org page](https://www.kaggle.com/organizations/kerashub) using this [invite link](https://kaggle.com/organizations/kerashub/invite/c4b8baa532b8436e8df8f1ed641b9cb5) + +- [ ] Add `tools/checkpoint_conversion/convert_xx_checkpoints.py`, a reusable script for converting checkpoints + [Example](https://github.com/keras-team/keras-hub/blob/master/tools/checkpoint_conversion/convert_distilbert_checkpoints.py) + +- [ ] A Colab notebook link in the PR description, showing an end-to-end task such as text classification, etc. The task model can be built using the backbone model, with the task head on top \[[Example](https://gist.github.com/mattdangerw/bf0ca07fb66b6738150c8b56ee5bab4e)\]. Show that the numerics and outputs are matching + +--- ## Detailed Instructions This section discusses, in details, every necessary step. - -### Step 1: Open an issue/Find an open issue +### Step 1: Open an Issue / Find an Open Issue Before getting started with the code, it's important to check if there are any [open issues](https://github.com/keras-team/keras-hub/issues?q=is%3Aissue+is%3Aopen+label%3Amodel-contribution) @@ -81,150 +102,164 @@ contribution! 🙂 #### Add the backbone class -Once you are done identifying all the required layers, you should implement the -model backbone class. - -To keep the code simple and readable, we follow -[Keras' functional style model](https://keras.io/guides/functional_api/) wrapped -around by a class to implement our models. - -A model is typically split into three/four sections. We would recommend you to -compare this side-by-side with the -[`keras_hub.layers.DistilBertBackbone` source code](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone.py)! - -Implementation: Use Keras' functional API or subclass keras.Model. Refer to existing KerasHub backbones for structure. -Inputs: Define standard inputs (e.g., token_ids, padding_mask for text; pixel_values for vision; audio_features for audio). -Layers: Leverage standard keras.layers and relevant keras_hub_layers where possible. Implement custom layers if necessary, ensuring they are well-tested and documented. - -
- -The standard layers provided in Keras and KerasHub are generally enough for -most of the usecases and it is recommended to do a thorough search -[here](https://keras.io/api/layers/) and [here](https://keras.io/api/keras_nlp/layers/). -However, sometimes, models have small tweaks/paradigm changes in their architecture. -This is when things might slightly get complicated. - -If the model introduces a paradigm shift, such as using relative attention instead -of vanilla attention, the contributor will have to implement complete custom layers. A case -in point is `keras_hub.models.DebertaV3Backbone` where we had to [implement layers -from scratch](https://github.com/keras-team/keras-hub/tree/master/keras_hub/models/deberta_v3). - -On the other hand, if the model has a small tweak, something simpler can be done. -For instance, in the Whisper model, the self-attention and cross-attention mechanism -is exactly the same as vanilla attention, with the exception that the key projection -layer does not have a bias term. In this case, we can inherit the custom layer -from one of the standard layers and make minor modifications. See [this PR](https://github.com/keras-team/keras-hub/pull/801/files#diff-8533ae3a7755c0dbe95ccbb71f85c677297f687bf3884fadefc64f1d0fdce51aR22) for -more details. - -Since the first PR is only to add the model backbone class, you should omit the -`from_presets()` function; this will be added at a later stage when you open a PR -for adding presets. - -Validation Colab: Create a Colab notebook that: Loads weights from the original model source. Manually loads these weights into an instance of your KerasHub backbone. Compares the output of your backbone with the original model's corresponding layer output on sample inputs, ensuring numerical closeness. -Unit Tests (your_model_backbone_test.py): Include tests for forward pass, save/load, and correct output shapes with various configurations. - -#### Convert weights from the original source and check output! - -Before you open a PR for adding the model backbone class, it is essential to check -whether the model has been implemented exactly as the source implementation. This -also helps in adding model "presets" at a later stage. - -The preferred way of doing this is to add a Colab link in the PR description, which -1) converts the original preset weights to our format, and -2) checks whether the outputs of the original model and your implemented model are close enough. - -It is okay if you demonstrate it for one preset at this stage; you can do the conversion -for the other presets when you officially add presets to the library at a later stage. - -#### Add Unit Tests - -It is essential to add units tests. These unit tests are basic and mostly check -whether the forward pass goes through successfully, whether the model can be saved -and loaded correctly, etc. - -### Step 3: PR #2 - Data Converter - Add XXTokenizer or XXImageConverter or XXAudioConverter, etc - -#### Tokenizer - -The Data Converter transforms raw data of a specific modality into a numerical format suitable for the preprocessor and backbone. -Implementation: -Text: YourModelTokenizer - Converts raw text into sequences of token IDs. Inherit from a base tokenizer in KerasNLP (e.g., WordPieceTokenizer, SentencePieceTokenizer) or implement a custom one. Define special tokens (e.g., cls_token, pad_token) and handle vocabulary loading. -Image: YourModelImageConverter (or similar name like ImageProcessor) - Handles operations like resizing, rescaling, normalization, and potentially data augmentation strategy application. May utilize keras_cv.layers. -Audio: YourModelAudioConverter (or similar name like AudioFeatureExtractor) - Processes raw audio into features like spectrograms or MFCCs. May utilize Keras or other audio processing libraries. -Assets: Ensure your converter can load necessary assets (e.g., vocabulary files for tokenizers, mean/std deviation values for image normalization). -Validation Colab: Demonstrate that your data converter's output (e.g., token IDs, processed pixel tensors, audio feature tensors) matches the behavior of the original model's data conversion step. -Unit Tests (e.g., your_model_tokenizer_test.py): Test core functionality, asset loading, and output consistency. - -#### Unit Tests - -The last step here is to add unit tests.: Test core functionality, asset loading, and output consistency. -### Step 4: PR #3 and Beyond: Add XXTasks and XXPreprocessors - -This PR builds on the backbone and data converter to create a user-friendly Task Model. -#### Preprocessor (your_model__preprocessor.py) -The Preprocessor takes raw data (text, images, audio paths, etc.) and uses the appropriate Data Converter to transform it into the full format expected by the Backbone. -##### Implementation: -- Create a class (e.g., YourModelCausalLMPreprocessor, YourModelImageClassifierPreprocessor). -- It will use the specific YourModel (e.g., YourModelTokenizer) internally. -- It handles tasks like adding special tokens, padding/truncation for sequences, creating attention masks, batching, and ensuring the output dictionary matches the backbone's expected input names. -##### Inputs: -- Define how it accepts raw data (e.g., strings, file paths, raw tensors). -##### Outputs: -- It should output a dictionary of tensors ready for the Backbone. -##### Validation Colab: +Once you've identified the required layers, implement the backbone using [Keras’ functional API](https://keras.io/guides/functional_api/) wrapped in a class. + +Compare your code with [`DistilBertBackbone`](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone.py) for structure. + +##### Implementation + +- Use standard inputs (`token_ids`, `padding_mask`, `pixel_values`, `audio_features`) +- Use `keras.layers` and `keras_nlp.layers` when possible +- For architectural deviations, implement custom layers + +Examples: +- Major changes: [`DebertaV3`](https://github.com/keras-team/keras-hub/tree/master/keras_hub/models/deberta_v3) +- Minor tweaks: [Whisper attention layer](https://github.com/keras-team/keras-hub/pull/801/files#diff-8533ae3a7755c0dbe95ccbb71f85c677297f687bf3884fadefc64f1d0fdce51aR22) + +Do **not** include `from_presets()` in this PR. + +##### Validation Colab + +- Load original model weights +- Manually set weights in your KerasHub model +- Compare outputs on sample input for closeness + +##### Unit Tests (`xx_backbone_test.py`) + +- Check forward pass, output shapes +- Ensure model can be saved and loaded correctly + +--- + +### Step 3: PR #2 – Data Converter + +#### Tokenizer / ImageConverter / AudioConverter + +The converter transforms raw input into numerical tensors suitable for preprocessing. + +##### Implementation + +- **Text**: `XXTokenizer`, subclassing from KerasHub tokenizers +- **Image**: `XXImageConverter`, subclassing from KerasHub ImageConverter - for resizing, normalization, augmentation +- **Audio**: `XXAudioConverter`, subclassing from KerasHub AudioConverter for extracting features like spectrograms + +Include asset loading (e.g., vocab files, normalization stats). + +##### Validation Colab + +- Show that converted output (tokens, pixels, features) match original behavior + +##### Unit Tests + +- Validate core logic, asset loading, and consistency + +--- + +### Step 4: PR #3 – Tasks and Preprocessors + +#### Preprocessor (`xx__preprocessor.py`) + +Transforms raw input into model-ready format. + +##### Implementation + +- Class: `XXPreprocessor` +- Internally uses the relevant `XX` +- Handles padding, attention masks, batching, formatting + +##### Inputs + +- Accept strings, paths, or tensors + +##### Outputs + +- Dictionary of tensors compatible with the Backbone + +##### Validation Colab + - Show that your preprocessor, given raw input, produces the same tensor inputs (e.g., token_ids, padding_mask, pixel_values) as the original model's complete preprocessing pipeline. -##### Unit Tests (your_model__preprocessor_test.py): + +##### Unit Tests + - Test with various inputs, ensuring correct output shapes and values. -#### Task Model (your_model_.py) -The Task Model is the high-level entry point. It combines the Backbone, Preprocessor, and a task-specific head. -##### Implementation: -- Create a class (e.g., YourModelCausalLM, YourModelImageClassifier). -- It should instantiate its Backbone and Preprocessor in its constructor. -- It will include a task-specific head (e.g., a dense layer for classification, a language modeling head, detection heads). -##### API: -- It should offer simple methods like predict(), fit(), generate() (for generative models), detect() (for detection models). -##### Unit Tests (your_model__test.py): +--- + +#### Task Model (`xx_.py`) + +Wraps the backbone and preprocessor with a task head. + +##### Implementation + +- Class: `XX` +- Instantiate backbone and preprocessor +- Add a task-specific head (e.g., classifier head, LM head) + +##### API + +- It should offer simple methods like predict(), fit(), generate() (for generative models), detect() (for detection models) + +##### Unit Tests + - Test basic usage: instantiation, forward pass with dummy data from the preprocessor, and model compilation. -### Step 5: PR #4 - Add Presets, Weights, and End-to-End Validation +--- + +### Step 5: PR #4 – Presets and End-to-End Validation + +After PRs 1–3 are merged, create: + +#### Preset Configuration + +- Add `xx_presets.py` +- Include model args, checkpoint URLs, vocabulary paths + +Use the [Kaggle org page](https://www.kaggle.com/organizations/kerashub/models) to stage and test. + +#### `from_preset()` Functions -Once the above 3 PRs are merged you can open a PR for adding presets. For every model, we have a separate file where we mention our preset configurations. This preset configuration has model-specific arguments such as number of layers, number of attention heads; preprocessor-specific arguments such as whether we want to lowercase the input text; checkpoint and vocabulary file URLs, etc. Please use this [invite link](https://kaggle.com/organizations/kerashub/invite/c4b8baa532b8436e8df8f1ed641b9cb5) and stage your model presets [here](https://www.kaggle.com/organizations/kerashub/models) +Add this to: +- `XXBackbone` +- `XXTokenizer` -After wrapping up the preset configuration file, you need to -add the `from_preset` function to all three classes, i.e., `DistilBertBackbone`, -and `DistilBertTokenizer`. Here is an -[example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone.py#L187-L189). +[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone.py#L187-L189) -The testing for presets is divided into two: "large" and "extra large". -For "large" tests, we pick the smallest preset (in terms of number of parameters) -and verify whether the output is correct. For "extra large tests", we loop over -all the presets and just check whether the backbone and the tokenizer can -be called without any error. +#### Preset Tests +The testing for presets is divided into two: +- "Large" tests: validate smallest preset numerics +- "Extra large" tests: loop over all presets, check for successful load/inference + +#### Checkpoint Conversion Script (tools/checkpoint_conversion/convert_your_model_checkpoints.py) -Checkpoint Conversion Script (tools/checkpoint_conversion/convert_your_model_checkpoints.py) - Provide a script that converts weights from their original format (e.g., PyTorch .bin, TensorFlow SavedModel) to the Keras H5 format expected by KerasHub. - This script should be reusable and clearly documented. - It's crucial for verifying weight conversion accuracy and for future updates. End-to-End Validation Colab - This is the most important validation step. -- Create a Colab notebook that demonstrates: - - Loading your Task Model using YourModelTask.from_preset("your_model_preset_name"). - - Running an end-to-end task (e.g., text generation, image classification, object detection) on sample input. - - Comparing the output (e.g., generated text, class probabilities, bounding boxes) with the output of the original model using its original pretrained weights and inference pipeline. Ensure numerical closeness. -Numerics Test: Add at least one unit test (often marked as "large" or "extra_large") that loads a small preset via from_preset(), runs inference on a fixed input, and asserts that the output matches known-good values (obtained from the original model). See existing tests for examples. -### Step 6: PR #5 and Beyond - Add More Tasks or Advanced Features (Optional) +#### End-to-End Colab +- Load task model using `from_preset()` +- Run task (e.g., classification, generation) +- Compare output with original model -Once the primary Task Model is merged, you can extend its utility: -Additional Task Models: Contribute other task models that use the same YourModelBackbone (e.g., YourModelTokenClassifier if you initially contributed YourModelCausalLM, or YourModelImageSegmentation if you contributed YourModelImageClassifier). Each new task will likely require its own YourModelPreprocessor and YourModel class. -Parameter-Efficient Fine-Tuning (PEFT): Add LoRA support (e.g., backbone.enable_lora()) if applicable. See KerasHub's fine-tuning documentation for guidance. -Quantization (QLoRA): If the model benefits, implement and document QLoRA support. -Model Parallelism: For very large models, provide configurations or guidance for model parallelism. +#### Numerics Test -## Conclusion -Once all three PRs (and optionally, the fourth PR) have been merged, you have -successfully contributed a model to KerasHub. Congratulations! 🔥 +- Add a test that loads a preset and compares outputs on fixed inputs + +--- + +### Step 6: PR #5 and Beyond – Advanced Features (Optional) +Extend utility: + +- New Task Models (e.g., TokenClassifier, ImageSegmentation) +- Parameter-Efficient Fine-Tuning (LoRA support) +- Quantization (QLoRA support) +- Model Parallelism (for large models) + +--- + +## Conclusion +Once all three main PRs (and optionally the fourth) are merged, you've successfully contributed a model to KerasHub. Congratulations! 🔥 From 60928dcaa0367e3f0a38d5deafd1fde1fdcc3628 Mon Sep 17 00:00:00 2001 From: divyashreepathihalli Date: Fri, 16 May 2025 22:57:50 +0000 Subject: [PATCH 3/5] reformat --- CONTRIBUTING_MODELS.md | 67 +++++++++++++++++++++++++++++------------- 1 file changed, 47 insertions(+), 20 deletions(-) diff --git a/CONTRIBUTING_MODELS.md b/CONTRIBUTING_MODELS.md index b5c939f35e..5816f081af 100644 --- a/CONTRIBUTING_MODELS.md +++ b/CONTRIBUTING_MODELS.md @@ -1,8 +1,14 @@ # Model Contribution Guide -KerasHub has a plethora of pre-trained large language models ranging from BERT to OPT. We are always looking for more models and are always open to contributions! +KerasHub has a plethora of pre-trained large language models ranging from BERT +to OPT. We are always looking for more models and are always open to +contributions! -In this guide, we will walk you through the steps needed to contribute a new pre-trained model to KerasHub. For illustration purposes, let's assume that you want to contribute the DistilBERT model. Before we dive in, we encourage you to go through our [Getting Started Guide](https://keras.io/guides/keras_nlp/getting_started/) for an introduction to the library, and our [Contribution Guide](https://github.com/keras-team/keras-hub/blob/master/CONTRIBUTING.md). +In this guide, we will walk you through the steps needed to contribute a new +pre-trained model to KerasHub. For illustration purposes, let's assume that you +want to contribute the DistilBERT model. Before we dive in, we encourage you to +go through our [Getting Started Guide](https://keras.io/guides/keras_nlp/getting_started/) +for an introduction to the library, and our [Contribution Guide](https://github.com/keras-team/keras-hub/blob/master/CONTRIBUTING.md). --- @@ -27,7 +33,8 @@ Keep this checklist handy! - [ ] An `xx/xx_backbone_test.py` file which has unit tests for the backbone [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone_test.py) -- [ ] A Colab notebook link in the PR description that matches the outputs of the implemented backbone model with the original source +- [ ] A Colab notebook link in the PR description that matches the outputs of +the implemented backbone model with the original source [Example](https://colab.research.google.com/drive/1SeZWJorKWmwWJax8ORSdxKrxE25BfhHa?usp=sharing) ### Step 4: PR #2 - Data Converter - Add `XXTokenizer` or `XXImageConverter` or `XXAudioConverter` @@ -38,13 +45,16 @@ Keep this checklist handy! - [ ] Add `xx/xx_tokenizer_test.py` file with unit tests [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_tokenizer_test.py) -- [ ] A Colab notebook link in the PR description demonstrating that the tokenizer output matches the original +- [ ] A Colab notebook link in the PR description demonstrating that the + tokenizer output matches the original [Example](https://colab.research.google.com/drive/1MH_rpuFB1Nz_NkKIAvVtVae2HFLjXZDA?usp=sharing) -- [ ] For image models: Add `xx/xx_image_converter.py` file with image transformations +- [ ] For image models: Add `xx/xx_image_converter.py` file with image + transformations [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/clip/clip_image_converter.py) -- [ ] For audio models: Add `xx/xx_audio_converter.py` file with audio transformations +- [ ] For audio models: Add `xx/xx_audio_converter.py` file with audio + transformations [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/moonshine/moonshine_audio_converter.py) ### Step 5: PR #3 - Add `XX` Tasks and Preprocessors (Optional) @@ -52,17 +62,20 @@ Keep this checklist handy! - [ ] Add `xx/xx_.py` for adding task models (e.g., classifier, masked LM) [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_classifier.py) -- [ ] Add `xx/xx__preprocessor.py` for preprocessing inputs to the task model +- [ ] Add `xx/xx__preprocessor.py` for preprocessing inputs to the task + model [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_preprocessor.py) - [ ] Add unit tests: `xx/xx__test.py` and `xx/xx__preprocessor_test.py` [Example 1](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_classifier_test.py), [Example 2](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_preprocessor_test.py) -- [ ] Colab notebook link in the PR description to validate that the preprocessor output matches the original +- [ ] Colab notebook link in the PR description to validate that the + preprocessor output matches the original [Example](https://colab.research.google.com/drive/1GFFC7Y1I_2PtYlWDToqKvzYhHWv1b3nC?usp=sharing) -- [ ] Add a Colab notebook demonstrating end-to-end usage of the task model, showing matching outputs and a fine-tuning demo +- [ ] Add a Colab notebook demonstrating end-to-end usage of the task model, + showing matching outputs and a fine-tuning demo ### Step 6: PR #4 and Beyond - Add `XXPresets`, Weights, and End-to-End Validation @@ -71,10 +84,14 @@ Keep this checklist handy! - [ ] Stage the model presets on KerasHub’s [Kaggle org page](https://www.kaggle.com/organizations/kerashub) using this [invite link](https://kaggle.com/organizations/kerashub/invite/c4b8baa532b8436e8df8f1ed641b9cb5) -- [ ] Add `tools/checkpoint_conversion/convert_xx_checkpoints.py`, a reusable script for converting checkpoints +- [ ] Add `tools/checkpoint_conversion/convert_xx_checkpoints.py`, a reusable + script for converting checkpoints [Example](https://github.com/keras-team/keras-hub/blob/master/tools/checkpoint_conversion/convert_distilbert_checkpoints.py) -- [ ] A Colab notebook link in the PR description, showing an end-to-end task such as text classification, etc. The task model can be built using the backbone model, with the task head on top \[[Example](https://gist.github.com/mattdangerw/bf0ca07fb66b6738150c8b56ee5bab4e)\]. Show that the numerics and outputs are matching +- [ ] A Colab notebook link in the PR description, showing an end-to-end task + such as text classification, etc. The task model can be built using the + backbone model, with the task head on top \[[Example](https://gist.github.com/mattdangerw/bf0ca07fb66b6738150c8b56ee5bab4e)\]. Show that the numerics + and outputs are matching --- @@ -135,13 +152,16 @@ Do **not** include `from_presets()` in this PR. #### Tokenizer / ImageConverter / AudioConverter -The converter transforms raw input into numerical tensors suitable for preprocessing. +The converter transforms raw input into numerical tensors suitable for +preprocessing. ##### Implementation - **Text**: `XXTokenizer`, subclassing from KerasHub tokenizers -- **Image**: `XXImageConverter`, subclassing from KerasHub ImageConverter - for resizing, normalization, augmentation -- **Audio**: `XXAudioConverter`, subclassing from KerasHub AudioConverter for extracting features like spectrograms +- **Image**: `XXImageConverter`, subclassing from KerasHub ImageConverter - for + resizing, normalization, augmentation +- **Audio**: `XXAudioConverter`, subclassing from KerasHub AudioConverter for + extracting features like spectrograms Include asset loading (e.g., vocab files, normalization stats). @@ -177,7 +197,9 @@ Transforms raw input into model-ready format. ##### Validation Colab -- Show that your preprocessor, given raw input, produces the same tensor inputs (e.g., token_ids, padding_mask, pixel_values) as the original model's complete preprocessing pipeline. +- Show that your preprocessor, given raw input, produces the same tensor inputs +(e.g., token_ids, padding_mask, pixel_values) as the original model's complete +preprocessing pipeline. ##### Unit Tests @@ -197,11 +219,13 @@ Wraps the backbone and preprocessor with a task head. ##### API -- It should offer simple methods like predict(), fit(), generate() (for generative models), detect() (for detection models) +- It should offer simple methods like predict(), fit(), generate() (for + generative models), detect() (for detection models) ##### Unit Tests -- Test basic usage: instantiation, forward pass with dummy data from the preprocessor, and model compilation. +- Test basic usage: instantiation, forward pass with dummy data from the + preprocessor, and model compilation. --- @@ -214,7 +238,8 @@ After PRs 1–3 are merged, create: - Add `xx_presets.py` - Include model args, checkpoint URLs, vocabulary paths -Use the [Kaggle org page](https://www.kaggle.com/organizations/kerashub/models) to stage and test. +Use the [Kaggle org page](https://www.kaggle.com/organizations/kerashub/models) +to stage and test. #### `from_preset()` Functions @@ -231,7 +256,8 @@ The testing for presets is divided into two: #### Checkpoint Conversion Script (tools/checkpoint_conversion/convert_your_model_checkpoints.py) -- Provide a script that converts weights from their original format (e.g., PyTorch .bin, TensorFlow SavedModel) to the Keras H5 format expected by KerasHub. +- Provide a script that converts weights from their original format (e.g., +PyTorch .bin, TensorFlow SavedModel) to the Keras H5 format expected by KerasHub. - This script should be reusable and clearly documented. - It's crucial for verifying weight conversion accuracy and for future updates. End-to-End Validation Colab @@ -262,4 +288,5 @@ Extend utility: ## Conclusion -Once all three main PRs (and optionally the fourth) are merged, you've successfully contributed a model to KerasHub. Congratulations! 🔥 +Once all three main PRs (and optionally the fourth) are merged, you've +successfully contributed a model to KerasHub. Congratulations! 🔥 From c24af3128c5e4497a656a1bf7827683685d2b2b3 Mon Sep 17 00:00:00 2001 From: divyashreepathihalli Date: Sat, 17 May 2025 02:39:29 +0000 Subject: [PATCH 4/5] nit --- CONTRIBUTING_MODELS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING_MODELS.md b/CONTRIBUTING_MODELS.md index 5816f081af..8a7cbfb347 100644 --- a/CONTRIBUTING_MODELS.md +++ b/CONTRIBUTING_MODELS.md @@ -288,5 +288,5 @@ Extend utility: ## Conclusion -Once all three main PRs (and optionally the fourth) are merged, you've +Once all four main PRs (and optionally the fifth) are merged, you've successfully contributed a model to KerasHub. Congratulations! 🔥 From 4687b06d6d872efa10475a52b288a4d3da6f8be4 Mon Sep 17 00:00:00 2001 From: divyashreepathihalli Date: Sun, 25 May 2025 07:20:52 +0000 Subject: [PATCH 5/5] remove split PR and invite link --- CONTRIBUTING_MODELS.md | 117 +++++++++++++++++++---------------------- 1 file changed, 54 insertions(+), 63 deletions(-) diff --git a/CONTRIBUTING_MODELS.md b/CONTRIBUTING_MODELS.md index 8a7cbfb347..6d9abecc21 100644 --- a/CONTRIBUTING_MODELS.md +++ b/CONTRIBUTING_MODELS.md @@ -1,13 +1,13 @@ # Model Contribution Guide -KerasHub has a plethora of pre-trained large language models ranging from BERT -to OPT. We are always looking for more models and are always open to +KerasHub has a plethora of pre-trained large language models ranging from BERT +to OPT. We are always looking for more models and are always open to contributions! -In this guide, we will walk you through the steps needed to contribute a new -pre-trained model to KerasHub. For illustration purposes, let's assume that you -want to contribute the DistilBERT model. Before we dive in, we encourage you to -go through our [Getting Started Guide](https://keras.io/guides/keras_nlp/getting_started/) +In this guide, we will walk you through the steps needed to contribute a new +pre-trained model to KerasHub. For illustration purposes, let's assume that you +want to contribute the DistilBERT model. Before we dive in, we encourage you to +go through our [Getting Started Guide](https://keras.io/guides/keras_nlp/getting_started/) for an introduction to the library, and our [Contribution Guide](https://github.com/keras-team/keras-hub/blob/master/CONTRIBUTING.md). --- @@ -21,76 +21,76 @@ Keep this checklist handy! - [ ] Open an issue or find an issue to contribute a backbone model. -### Step 2: PR #1 - Model Folder +### Step 2: Model Folder - [ ] Create your model folder `xx` in [`keras_hub/src/models`](https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/models) -### Step 3: PR #1 - Add `XXBackbone` +### Step 3: Add `XXBackbone` -- [ ] An `xx/xx_backbone.py` file which has the model graph +- [ ] An `xx/xx_backbone.py` file which has the model graph [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone.py) -- [ ] An `xx/xx_backbone_test.py` file which has unit tests for the backbone +- [ ] An `xx/xx_backbone_test.py` file which has unit tests for the backbone [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone_test.py) -- [ ] A Colab notebook link in the PR description that matches the outputs of -the implemented backbone model with the original source - [Example](https://colab.research.google.com/drive/1SeZWJorKWmwWJax8ORSdxKrxE25BfhHa?usp=sharing) +- [ ] A Colab notebook link in the PR description that matches the outputs of +the implemented backbone model with the original source + [Example](https://colab.sandbox.google.com/drive/1R99yFJCbxTEpcxFHa2RtlwQWahUIPCJC?usp=sharing) -### Step 4: PR #2 - Data Converter - Add `XXTokenizer` or `XXImageConverter` or `XXAudioConverter` +### Step 4: Data Converter - Add `XXTokenizer` or `XXImageConverter` or `XXAudioConverter` -- [ ] If contributing a language model, add an `xx/xx_tokenizer.py` file +- [ ] If contributing a language model, add an `xx/xx_tokenizer.py` file [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_tokenizer.py) -- [ ] Add `xx/xx_tokenizer_test.py` file with unit tests +- [ ] Add `xx/xx_tokenizer_test.py` file with unit tests [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_tokenizer_test.py) -- [ ] A Colab notebook link in the PR description demonstrating that the - tokenizer output matches the original +- [ ] A Colab notebook link in the PR description demonstrating that the + tokenizer output matches the original [Example](https://colab.research.google.com/drive/1MH_rpuFB1Nz_NkKIAvVtVae2HFLjXZDA?usp=sharing) -- [ ] For image models: Add `xx/xx_image_converter.py` file with image - transformations +- [ ] For image models: Add `xx/xx_image_converter.py` file with image + transformations [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/clip/clip_image_converter.py) -- [ ] For audio models: Add `xx/xx_audio_converter.py` file with audio - transformations +- [ ] For audio models: Add `xx/xx_audio_converter.py` file with audio + transformations [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/moonshine/moonshine_audio_converter.py) -### Step 5: PR #3 - Add `XX` Tasks and Preprocessors (Optional) +### Step 5: Add `XX` Tasks and Preprocessors (Optional) -- [ ] Add `xx/xx_.py` for adding task models (e.g., classifier, masked LM) +- [ ] Add `xx/xx_.py` for adding task models (e.g., classifier, masked LM) [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_classifier.py) -- [ ] Add `xx/xx__preprocessor.py` for preprocessing inputs to the task - model +- [ ] Add `xx/xx__preprocessor.py` for preprocessing inputs to the task + model [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_preprocessor.py) -- [ ] Add unit tests: `xx/xx__test.py` and `xx/xx__preprocessor_test.py` - [Example 1](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_classifier_test.py), +- [ ] Add unit tests: `xx/xx__test.py` and `xx/xx__preprocessor_test.py` + [Example 1](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_classifier_test.py), [Example 2](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_preprocessor_test.py) -- [ ] Colab notebook link in the PR description to validate that the - preprocessor output matches the original +- [ ] Colab notebook link in the PR description to validate that the + preprocessor output matches the original [Example](https://colab.research.google.com/drive/1GFFC7Y1I_2PtYlWDToqKvzYhHWv1b3nC?usp=sharing) -- [ ] Add a Colab notebook demonstrating end-to-end usage of the task model, +- [ ] Add a Colab notebook demonstrating end-to-end usage of the task model, showing matching outputs and a fine-tuning demo -### Step 6: PR #4 and Beyond - Add `XXPresets`, Weights, and End-to-End Validation +### Step 6: Add `XXPresets`, Weights, and End-to-End Validation -- [ ] Add `xx/xx_presets.py` with links to weights uploaded to Kaggle KerasHub +- [ ] Add `xx/xx_presets.py` with links to weights uploaded to Kaggle KerasHub [Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_presets.py) -- [ ] Stage the model presets on KerasHub’s [Kaggle org page](https://www.kaggle.com/organizations/kerashub) using this [invite link](https://kaggle.com/organizations/kerashub/invite/c4b8baa532b8436e8df8f1ed641b9cb5) +- [ ] Stage the model presets on your Kaggle account -- [ ] Add `tools/checkpoint_conversion/convert_xx_checkpoints.py`, a reusable - script for converting checkpoints +- [ ] Add `tools/checkpoint_conversion/convert_xx_checkpoints.py`, a reusable + script for converting checkpoints [Example](https://github.com/keras-team/keras-hub/blob/master/tools/checkpoint_conversion/convert_distilbert_checkpoints.py) -- [ ] A Colab notebook link in the PR description, showing an end-to-end task - such as text classification, etc. The task model can be built using the - backbone model, with the task head on top \[[Example](https://gist.github.com/mattdangerw/bf0ca07fb66b6738150c8b56ee5bab4e)\]. Show that the numerics +- [ ] A Colab notebook link in the PR description, showing an end-to-end task + such as text classification, etc. The task model can be built using the + backbone model, with the task head on top \[[Example](https://gist.github.com/mattdangerw/bf0ca07fb66b6738150c8b56ee5bab4e)\]. Show that the numerics and outputs are matching --- @@ -115,7 +115,7 @@ workings of the model at the time of opening the issue. But it is appreciated if you can furnish as much detail as possible to enable us to help you with the contribution! 🙂 -### Step 2: PR #1 - Add XXBackbone +### Step 2: Add XXBackbone #### Add the backbone class @@ -148,19 +148,19 @@ Do **not** include `from_presets()` in this PR. --- -### Step 3: PR #2 – Data Converter +### Step 3: Data Converter #### Tokenizer / ImageConverter / AudioConverter -The converter transforms raw input into numerical tensors suitable for +The converter transforms raw input into numerical tensors suitable for preprocessing. ##### Implementation - **Text**: `XXTokenizer`, subclassing from KerasHub tokenizers -- **Image**: `XXImageConverter`, subclassing from KerasHub ImageConverter - for +- **Image**: `XXImageConverter`, subclassing from KerasHub ImageConverter - for resizing, normalization, augmentation -- **Audio**: `XXAudioConverter`, subclassing from KerasHub AudioConverter for +- **Audio**: `XXAudioConverter`, subclassing from KerasHub AudioConverter for extracting features like spectrograms Include asset loading (e.g., vocab files, normalization stats). @@ -175,7 +175,7 @@ Include asset loading (e.g., vocab files, normalization stats). --- -### Step 4: PR #3 – Tasks and Preprocessors +### Step 4: Tasks and Preprocessors #### Preprocessor (`xx__preprocessor.py`) @@ -197,8 +197,8 @@ Transforms raw input into model-ready format. ##### Validation Colab -- Show that your preprocessor, given raw input, produces the same tensor inputs -(e.g., token_ids, padding_mask, pixel_values) as the original model's complete +- Show that your preprocessor, given raw input, produces the same tensor inputs +(e.g., token_ids, padding_mask, pixel_values) as the original model's complete preprocessing pipeline. ##### Unit Tests @@ -219,36 +219,27 @@ Wraps the backbone and preprocessor with a task head. ##### API -- It should offer simple methods like predict(), fit(), generate() (for +- It should offer simple methods like predict(), fit(), generate() (for generative models), detect() (for detection models) ##### Unit Tests -- Test basic usage: instantiation, forward pass with dummy data from the +- Test basic usage: instantiation, forward pass with dummy data from the preprocessor, and model compilation. --- -### Step 5: PR #4 – Presets and End-to-End Validation +### Step 5: Presets and End-to-End Validation -After PRs 1–3 are merged, create: #### Preset Configuration - Add `xx_presets.py` - Include model args, checkpoint URLs, vocabulary paths -Use the [Kaggle org page](https://www.kaggle.com/organizations/kerashub/models) +Use the [Kaggle org page](https://www.kaggle.com/organizations/keras/models) to stage and test. -#### `from_preset()` Functions - -Add this to: -- `XXBackbone` -- `XXTokenizer` - -[Example](https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/distil_bert/distil_bert_backbone.py#L187-L189) - #### Preset Tests The testing for presets is divided into two: - "Large" tests: validate smallest preset numerics @@ -256,7 +247,7 @@ The testing for presets is divided into two: #### Checkpoint Conversion Script (tools/checkpoint_conversion/convert_your_model_checkpoints.py) -- Provide a script that converts weights from their original format (e.g., +- Provide a script that converts weights from their original format (e.g., PyTorch .bin, TensorFlow SavedModel) to the Keras H5 format expected by KerasHub. - This script should be reusable and clearly documented. - It's crucial for verifying weight conversion accuracy and for future updates. @@ -275,7 +266,7 @@ End-to-End Validation Colab --- -### Step 6: PR #5 and Beyond – Advanced Features (Optional) +### Step 6: Advanced Features (Optional) Extend utility: @@ -288,5 +279,5 @@ Extend utility: ## Conclusion -Once all four main PRs (and optionally the fifth) are merged, you've +Once your PR is merged, you've successfully contributed a model to KerasHub. Congratulations! 🔥