Skip to content

Conversation

vivekh2000
Copy link

I have included the LayerNorm layer in the classification head to identify it with the original implementation. It will be easy to compare now.
Also, an isolated LayerNorm was included, which I have removed and inserted in the MLP head, again to make it more intuitive. For reference, you have included LayerNorm in the self.ditill_mlp attribute of the class DistillWrapper in distill.py. I hope these changes are acceptable.

Thank you for the excellent code base.

I have included the LayerNorm layer in the classification head, which needed to be added to the implementation. Also, an unnecessary layer norm was included, which I have removed.
For reference, you have included LayerNorm in the ditill_mlp attribute of the class DistillWrapper in distill.py. Thank you.
@lucidrains
Copy link
Owner

@vivekh2000 hey Vivek, thanks for this PR

so i think the layernorm should be kept at its current position, as most transformers these days end with a final norm

will definitely remove the layernorm from the distill wrapper though (or make it an option, in the case that some of the variants do not have the final norm)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants