目的:
离线使用 google-t5/t5-base 预训练模型, 行多种自然语言处理任务:翻译 可借不支持东亚语言。Project-22.Ai -1.T5-base 只能在: English, French, Romanian, German 间使用, code 非常简单,大概沾到本地/离线使用模型的皮毛。 运行这么小的模型,也使我的笔记拔高了,硬件要升级,空空空的钱包。
1. 下载 google-t5/t5-base 模型
a. 下载链接:
git clone [email protected]:google-t5/t5-base
注:如果有升级或重装过 git,要安装 git-lfs
# 首次安装 Git LFS 后执行一次
git lfs install # 只需执行一次
以后不再提示 git-lfs, 默认与我的环境同步。
b. 使用命令下载模型:
CMD: git clone [email protected]:google-t5/t5-base
[/share/Download/AI] # git lfs install
Updated Git hooks.
Git LFS initialized.
[/share/Download/AI] # git clone [email protected]:google-t5/t5-base
Cloning into 't5-base'...
remote: Enumerating objects: 78, done.
remote: Counting objects: 100% (13/13), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 78 (delta 8), reused 3 (delta 3), pack-reused 65 (from 1)
Receiving objects: 100% (78/78), 972.74 KiB | 1.42 MiB/s, done.
Resolving deltas: 100% (34/34), done.
Downloading flax_model.msgpack (892 MB)
Error downloading object: flax_model.msgpack (d96ab4b): Smudge error: Error downloading flax_model.msgpack (d96ab4b2e2ac1743c32e80669ec37905151c78d8136ff0ce4ba6566bde6e932f): batch response: Authentication required: Password authentication in git is no longer supported. You must use a user access token or an SSH key instead. See https://huggingface.co/blog/password-git-deprecation
Errors logged to '/share/CACHEDEV1_DATA/Download/Software/ALL-AI/t5-base/.git/lfs/logs/20241112T182252.368534759.log'.
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: flax_model.msgpack: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'
“Authentication required: Password authentication in git is no longer supported. You must use a user access token or an SSH key instead.”
看来要用 access token
2. 使用 HuggingFace PAT 登录
获取 Hugging Face 个人访问令牌(PAT)
- 登录到你的 Hugging Face 账号.
- 在右上角点击你的头像,选择“Settings”(设置)。
- 在左侧菜单中选择“Access Tokens”(访问令牌)。
先把它存到一个文件里,后面才会用到。
3. 在 QNAP NAS 安装 Python 3.12.2
a. 检查 python 版本
不记得这个python 2.7.18 是什么时候安装的,八成是自带的。
[~] # python -V
Python 2.7.18
[~] # python -V
-sh: python3: command not found
b. 安装 python 3.12 为了安装 huggingface hub 库/包
安装后,发现并不是想要的版本。因为路径原因:
[/share/CACHEDEV1_DATA/.qpkg/container-station/bin] # which python
/usr/local/bin/python
点开 QPython312 app, 里面列出路径:/opt/QPython312/bin, 如图:/opt/QPython312bin/python3
c. 修改全局变量 $PATH for python3
QNAP QTS-5.2 profile 文件位置:
[~] # vi /opt/etc/profile
在最后一行添加:
export PATH=/opt/QPython312/bin:/opt/sbin:/your/custom/path:$PATH
重新登录,检查版本:
[~] # python3 -V
Python 3.12.2
4. 安装 huggingface_hub
包
a. 升级一下 pip3
[~] # pip3 install --upgrade pip
Requirement already satisfied: pip in /opt/QPython312/lib/python3.12/site-packages (24.3.1)
b. 安装 HF hub
[~] # pip3 install huggingface_hub
Collecting huggingface_hub
Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: filelock in /opt/QPython312/lib/python3.12/site-packages (from huggingface_hub) (3.13.1)
...
Downloading fsspec-2024.10.0-py3-none-any.whl (179 kB)
Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 1.7 MB/s eta 0:00:00
Downloading tqdm-4.67.0-py3-none-any.whl (78 kB)
...
Successfully installed certifi-2024.8.30 charset-normalizer-3.4.0 fsspec-2024.10.0 huggingface_hub-0.26.2 idna-3.10 pyyaml-6.0.2 requests-2.32.3 tqdm-4.67.0 typing-extensions-4.12.2 urllib3-2.2.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
c. 使用 PAT (个人令牌)登录
用到 2 复制的那字符串,在"Enter your token" 鼠标右键 粘贴
[~] # huggingface-cli login
_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|
To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) y
Token is valid (permission: fineGrained).
The token `HFT` has been saved to /root/.cache/huggingface/stored_tokens
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.
git config --global credential.helper store
Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `HFT`
final: 再次使用 git 下载t5模型~成功
[/share/Download/AI] # ssh -T hf
Hi DaveNian, welcome to Hugging Face.
[/share/Download/AI] # git clone hf:google-t5/t5-base
fatal: destination path 't5-base' already exists and is not an empty directory.
[/share/Download/AI] # rm -fR t5-base/
[/share/Download/AI] # git clone hf:google-t5/t5-base
Cloning into 't5-base'...
remote: Enumerating objects: 78, done.
remote: Counting objects: 100% (13/13), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 78 (delta 8), reused 3 (delta 3), pack-reused 65 (from 1)
Receiving objects: 100% (78/78), 972.59 KiB | 1.46 MiB/s, done.
Resolving deltas: 100% (34/34), done.
Filtering content: 100% (5/5), 4.15 GiB | 6.27 MiB/s, done.
[/share/Download/AI] #
下载后查看文件:
实践:
成来是想写怎么使用这个模型, 但不支持中文啊,就写点儿留给自己的提示。
Google: T5-base 模型
介绍一下 模型里的各个文件:
C:\2024-MyProgramFiles\22.ai\t5-base>dir
Volume in drive C has no label.
Volume Serial Number is 5CA1-5BDC
Directory of C:\2024-MyProgramFiles\22.ai\t5-base
11/12/2024 08:16 PM <DIR> .
11/13/2024 01:00 AM <DIR> ..
11/12/2024 07:26 PM 1,208 config.json
11/12/2024 07:37 PM 891,625,348 flax_model.msgpack
11/12/2024 07:26 PM 147 generation_config.json
11/12/2024 07:37 PM 891,646,390 model.safetensors
11/12/2024 07:37 PM 891,691,430 pytorch_model.bin
11/12/2024 07:26 PM 8,477 README.md
11/12/2024 07:37 PM 891,679,884 rust_model.ot
11/12/2024 07:26 PM 791,656 spiece.model
11/12/2024 07:37 PM 892,146,080 tf_model.h5
11/12/2024 07:26 PM 1,389,353 tokenizer.json
10 File(s) 4,460,979,973 bytes
2 Dir(s) 85,797,109,760 bytes free
权重文件:
flax_model.msgpack model.safetensors pytorch_model.bin rust_model.ot
tf_model.h5 它们只是不同格式:
序号 | 文件名 | 格式 | 使用 | 唠叨两句 |
1 | flax_model.msgpack | 用于 JAX/Flax 框架 | 适合TPU运算 | 使用MessagePack格式序列化 |
2 | model.safetensors | Hugging Face 格式 | 可用于 PyTorch | 更安全、加载更快 |
3 | pytorch_model.bin | PyTorch原生格式 | 常用的格式 | 使用pickle序列化 |
4 | rust_model.ot | ONNX格式,用于Rust语言 | 适合生产环境部署 | 跨平台推理优化 |
5 | tf_model.h5 | TensorFlow格式 | HDF5文件格式 | Google的TensorFlow框架使用 |
以上总结可以看出来,下载其中的一个就可以用了。
其它的主要文件:
tokenizer.json 分词器配置文件 包含词表和分词规则
spiece.model SentencePiece分词器模型文件 用于文本预处理
generation_config.json 文本生成相关的配置参数 可以调整 beams 等
transformers 加载机制:(以 T5ForConditionalGeneration 加载模型权重文件 pytorch_model.bin ):
T5模型的权重文件加载是通过Hugging Face的transformers库的约定和自动化机制实现的,以下文件是必须存在:
t5-base/
├── config.json # 模型配置文件
├── pytorch_model.bin # 模型权重文件
├── tokenizer_config.json # 分词器配置
├── tokenizer.json # 分词器详细配置
└── spiece.model # 分词器词表文件(如果使用sentencepiece)
只要在代码中指定路径,模型就能被加载:
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained(
model_path,
local_files_only=True
)
transformers库会:
- 首先读取
config.json
文件,获取模型架构和参数 - 根据配置查找并加载
pytorch_model.bin
文件中的权重
练习:
英法翻译工具
结束语:
QNAP 上还是有很多第三方好 apps ,下图是我的 NAS 添加的第三方应用库
以前有个 qnapclub,已经不再可用。
标签:google,t5,huggingface,2024,NAS,base,git,model From: https://blog.csdn.net/davenian/article/details/143728876