Huggingface tokenizer 在将 python 升级到 3.10 后无法加载模型

Question

我刚刚将 Python 更新到版本

3.10.8

。请注意，我使用 JupyterLab。

我必须重新安装很多软件包，但现在当我尝试加载 HuggingFace 模型的标记生成器时出现错误

这是我的代码：

# Import libraries
from transformers import pipeline, AutoTokenizer
# Define checkpoint
model_checkpoint = 'deepset/xlm-roberta-large-squad2'
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

请注意，

transformers

的版本是

4.24.0

。

这是我得到的错误：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [3], line 2
      1 # Tokenizer
----> 2 tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

File ~/.local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:637, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    635 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    636 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 637     return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    638 else:
    639     if tokenizer_class_py is not None:

File ~/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1777, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1774     else:
   1775         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1777 return cls._from_pretrained(
   1778     resolved_vocab_files,
   1779     pretrained_model_name_or_path,
   1780     init_configuration,
   1781     *init_inputs,
   1782     use_auth_token=use_auth_token,
   1783     cache_dir=cache_dir,
   1784     local_files_only=local_files_only,
   1785     _commit_hash=commit_hash,
   1786     **kwargs,
   1787 )

File ~/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1932, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
   1930 # Instantiate tokenizer.
   1931 try:
-> 1932     tokenizer = cls(*init_inputs, **init_kwargs)
   1933 except OSError:
   1934     raise OSError(
   1935         "Unable to load vocabulary from file. "
   1936         "Please check that the provided vocabulary is accessible and not corrupted."
   1937     )

File ~/.local/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py:155, in XLMRobertaTokenizerFast.__init__(self, vocab_file, tokenizer_file, bos_token, eos_token, sep_token, cls_token, unk_token, pad_token, mask_token, **kwargs)
    139 def __init__(
    140     self,
    141     vocab_file=None,
   (...)
    151 ):
    152     # Mask token behave like a normal word, i.e. include the space before it
    153     mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
--> 155     super().__init__(
    156         vocab_file,
    157         tokenizer_file=tokenizer_file,
    158         bos_token=bos_token,
    159         eos_token=eos_token,
    160         sep_token=sep_token,
    161         cls_token=cls_token,
    162         unk_token=unk_token,
    163         pad_token=pad_token,
    164         mask_token=mask_token,
    165         **kwargs,
    166     )
    168     self.vocab_file = vocab_file
    169     self.can_save_slow_tokenizer = False if not self.vocab_file else True

File ~/.local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:114, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    111     fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    112 elif slow_tokenizer is not None:
    113     # We need to convert a slow tokenizer to build the backend
--> 114     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
    115 elif self.slow_tokenizer_class is not None:
    116     # We need to create and convert a slow tokenizer to build the backend
    117     slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py:1162, in convert_slow_tokenizer(transformer_tokenizer)
   1154     raise ValueError(
   1155         f"An instance of tokenizer class {tokenizer_class_name} cannot be converted in a Fast tokenizer instance."
   1156         " No converter was found. Currently available slow->fast convertors:"
   1157         f" {list(SLOW_TO_FAST_CONVERTERS.keys())}"
   1158     )
   1160 converter_class = SLOW_TO_FAST_CONVERTERS[tokenizer_class_name]
-> 1162 return converter_class(transformer_tokenizer).converted()

File ~/.local/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py:438, in SpmConverter.__init__(self, *args)
    434 requires_backends(self, "protobuf")
    436 super().__init__(*args)
--> 438 from .utils import sentencepiece_model_pb2 as model_pb2
    440 m = model_pb2.ModelProto()
    441 with open(self.original_tokenizer.vocab_file, "rb") as f:

File ~/.local/lib/python3.10/site-packages/transformers/utils/sentencepiece_model_pb2.py:20
     18 from google.protobuf import descriptor as _descriptor
     19 from google.protobuf import message as _message
---> 20 from google.protobuf import reflection as _reflection
     21 from google.protobuf import symbol_database as _symbol_database
     24 # @@protoc_insertion_point(imports)

File /usr/lib/python3/dist-packages/google/protobuf/reflection.py:58
     56   from google.protobuf.pyext import cpp_message as message_impl
     57 else:
---> 58   from google.protobuf.internal import python_message as message_impl
     60 # The type of all Message classes.
     61 # Part of the public interface, but normally only used by message factories.
     62 GeneratedProtocolMessageType = message_impl.GeneratedProtocolMessageType

File /usr/lib/python3/dist-packages/google/protobuf/internal/python_message.py:69
     66   import copyreg as copyreg
     68 # We use "as" to avoid name collisions with variables.
---> 69 from google.protobuf.internal import containers
     70 from google.protobuf.internal import decoder
     71 from google.protobuf.internal import encoder

File /usr/lib/python3/dist-packages/google/protobuf/internal/containers.py:182
    177   collections.MutableMapping.register(MutableMapping)
    179 else:
    180   # In Python 3 we can just use MutableMapping directly, because it defines
    181   # __slots__.
--> 182   MutableMapping = collections.MutableMapping
    185 class BaseContainer(object):
    187   """Base container class."""

AttributeError: module 'collections' has no attribute 'MutableMapping'

我尝试了多种解决方案（例如，this和this），但似乎都不起作用。

根据这个链接，我应该将

collections.Mapping

更改为

collections.abc.Mapping

，但我不知道在哪里做。

另一个可能的解决方案是将 Python 降级到 3.9，但我想将其保留为最后的手段。

我该如何解决这个问题？

Answer 1

原来是与

protobuf

模块相关的问题。我将其更新到迄今为止的最新版本（

4.21.9

）。

这将错误更改为：

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

所以我将

protobuf

降级为

3.20.0

版本，并且成功了。

欲了解更多详情，请查看此处。

Answer 2

我在 2024 年 1 月 5 日也遇到同样的错误。我正在使用最新版本的 protobuf (5.27.0)，我必须将其降级回 3.20.x，以便我可以加载我的分词器。

Huggingface tokenizer 在将 python 升级到 3.10 后无法加载模型

问题描述投票：0回答：2

2个回答

最新问题

Huggingface tokenizer 在将 python 升级到 3.10 后无法加载模型

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2