Skip to content

tk_v0.1 #2462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open

tk_v0.1 #2462

wants to merge 1 commit into from

Conversation

miao200years
Copy link
Contributor

U:Tokenizer v0.1
D:45T、45Lite、Qwen/Qwen2.5-7B-Instruct-1M、Qwen/Qwen3-32B

Copy link

paddle-bot bot commented Aug 22, 2025

Thanks for your contribution!

@@ -530,6 +529,7 @@ def get_static_model_on_pdc(remote_path, local_path, timeout, enable_flash_devic
Returns:
str: path to load static model
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

记个TODO,后续get_static_model_on_pdc这个函数可以删掉

@@ -568,6 +566,8 @@ def __init__(self, **kwargs):
self.use_filtered_label_loss = kwargs.pop("use_filtered_label_loss", False)
self.loss_subbatch_seqlen = kwargs.pop("loss_subbatch_seqlen", -1)

from ..quantization.quantization_config import QuantizationConfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PaddlePaddle/PaddleFormers/blob/develop/paddleformers/quantization/quantization_config.py#L116 QuantizationConfig中这个paddle的依赖也只是个判断而已,不如try:from paddle.nn.quant.quantized_linear import _get_arch_info 没有就不对GPU版本做判断就行

Adapted from transformers.AutoTokenizer.from_pretrained with modifications:
1. Added get_paddleformers_tokenizer_config() to extend tokenizer_config.json download source
2. Explicitly binds PaddleTokenizerMixin to the tokenizer class before final instantiation
绑定 PaddleTokenizerMixin,如果 Paddle 可用则绑定,否则返回原类
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释不要用中文

@@ -205,7 +245,9 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):

if tokenizer_class is None:
raise ValueError(f"Tokenizer class {tokenizer_class_name} is not currently imported.")
tokenizer_class = type(tokenizer_class.__name__, (PaddleTokenizerMixin, tokenizer_class), {})

# 绑定 PaddleTokenizerMixin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

@@ -14,6 +14,9 @@
# limitations under the License.
import transformers as hf

from ..tokenizer_utils import warp_tokenizer
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们的目标是处理掉PaddleTokenizerMixin中对Paddle冗余的依赖,而不是让paddleformers中的tokenizer不使用 PaddleTokenizerMixin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants