diff --git "a/thirdparty/skynoon/user_cases/AI-BOX-MTSDK-1.3.0/AIBOX-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" "b/thirdparty/skynoon/user_cases/AI-BOX-MTSDK-1.3.0/AIBOX-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" index 3b92d8a340285b70b9302b9b879d454c1f508bf5..158d9f340b50c8a51feb6f4de8749669bcf6fe58 100644 --- "a/thirdparty/skynoon/user_cases/AI-BOX-MTSDK-1.3.0/AIBOX-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" +++ "b/thirdparty/skynoon/user_cases/AI-BOX-MTSDK-1.3.0/AIBOX-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" @@ -161,34 +161,7 @@ True # 常见问题 -## Q1:libmpi_cxx.so.40: cannot open shared object file: No such file or directory - -```shell -(vllm_musa) dev@localhost:~/model$ -(vllm_musa) dev@localhost:~/model$ python3 -c "import torch;import torch_musa;print(torch.musa.is_available())" -Traceback (most recent call last): - File "", line 1, in - File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 236, in - _load_global_deps() - File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 195, in _load_global_deps - raise err - File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 176, in _load_global_deps - ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL) - File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/ctypes/__init__.py", line 374, in __init__ - self._handle = _dlopen(self._name, mode) -OSError: libmpi_cxx.so.40: cannot open shared object file: No such file or directory -``` - -**解决方案** - -证明mpi so缺失,请下载以下依赖: - -```shell -sudo apt update - -``` - -## Q2:NumPy 2.2.6 as it may crash. To support both 1.x and 2.x +## Q1:NumPy 2.2.6 as it may crash. To support both 1.x and 2.x ```python A module that was compiled using NumPy 1.x cannot be run in @@ -224,7 +197,7 @@ conda activate vllm_musa pip3 install numpy==1.26.4 ``` -## Q3:ImportError: Please try running Python from a different directory! +## Q2:ImportError: Please try running Python from a different directory! ```python (vllm_musa) dev@localhost:~/model$ python3 -c "import torch;import torch_musa;print(torch.musa.is_available())" @@ -258,13 +231,13 @@ pip3 uninstall torch torch_musa torchaudio torchvision **注意:如果还不行,请重新安装一次musa环境,点击跳转:[环境安装](#envInstall),安装完成后再将torch相关包重新安装一次** -## Q4:ImportError: libmccl.so.2: cannot open shared object file: No such file or directory +## Q3:ImportError: libmccl.so.2: cannot open shared object file: No such file or directory **解决方案** 请重新安装一次musa环境,点击跳转:[环境安装](#envInstall),安装完成后再将torch相关包重新安装一次 -## Q5: MUSA driver initialization failed +## Q4: MUSA driver initialization failed ```shell Traceback (most recent call last): @@ -330,8 +303,8 @@ conda activate vllm_musa cd ${packagedirectory} pip3 install -r requirements.txt -pip3 install triton-3.1.0-cp310-cp310-linux_aarch64.whl -pip3 install vllm-0.9.2.dev257+g4747b491f-cp310-cp310-linux_aarch64.whl +pip3 install triton-3.1.0-cp310-cp310-linux_aarch64.whl --force-reinstall +pip3 install vllm-0.9.2.dev259+gc2cd4356d-cp310-cp310-linux_aarch64.whl pip3 install vllm_musa-1.3+m1000-cp310-cp310-linux_aarch64.whl ``` @@ -354,13 +327,42 @@ Error in cpuinfo: prctl(PR_SVE_GET_VL) failed # 此处正常 对于量化模型,我们提供以下加速版模型: -| 模型名称 | 魔塔地址 | git克隆地址 | -| -------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -| gptq-DeepSeek-R1-Distill-Qwen-7B | https://modelscope.cn/models/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B | https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git | -| gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/models/hiruyun/gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-7B-Instruct-v2.git | -| gptq-Qwen2.5-14B-Instruct | https://modelscope.cn/models/hiruyun/gptq-Qwen2.5-14B-Instruct | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-14B-Instruct.git | -| gptq-Qwen3-8B | https://www.modelscope.cn/models/hiruyun/gptq-Qwen3-8B | https://www.modelscope.cn/hiruyun/gptq-Qwen3-8B.git | -| Qwen3-30B-A3B-GPTQ-Int4 | https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen3-30B-A3B-GPTQ-Int4.git | +| 类别 | 模型名称 | 魔塔地址 | git克隆地址 | +| ------------ | -------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | +| DeepSeek | gptq-DeepSeek-R1-Distill-Qwen-7B | https://modelscope.cn/models/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B | https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git | +| | DeepSeek-R1-Distill-Qwen-1.5B | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.git | +| | DeepSeek-R1-0528-Qwen3-8B | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B.git | +| **Qwen2.5** | Qwen2.5-7B-Instruct-GPTQ-Int4 | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4.git | +| | Qwen2.5-14B-Instruct-GPTQ-Int4 | https://modelscope.cn/models/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4.git | +| | Qwen2.5-VL-3B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git | +| | Qwen2.5-7B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct.git | +| | gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/models/hiruyun/gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-7B-Instruct-v2.git | +| | gptq-Qwen2.5-14B-Instruct | https://modelscope.cn/models/hiruyun/gptq-Qwen2.5-14B-Instruct | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-14B-Instruct.git | +| | Qwen2.5-VL-7B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-VL-7B-Instruct.git | +| | Qwen2.5-VL-3B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git | +| | QwQ-32B-GPTQ-Int4 | https://modelscope.cn/models/tclf90/qwq-32b-gptq-int4 | https://www.modelscope.cn/tclf90/qwq-32b-gptq-int4.git | +| **Qwen3** | Qwen3-4B | https://modelscope.cn/models/Qwen/Qwen3-4B | https://www.modelscope.cn/Qwen/Qwen3-4B.git | +| | Qwen3-8B | https://modelscope.cn/models/Qwen/Qwen3-8B | https://www.modelscope.cn/Qwen/Qwen3-8B.git | +| | gptq-Qwen3-8B | https://www.modelscope.cn/models/hiruyun/gptq-Qwen3-8B | https://www.modelscope.cn/hiruyun/gptq-Qwen3-8B.git | +| | Qwen3-30B-A3B-GPTQ-Int4 | https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen3-30B-A3B-GPTQ-Int4.git | +| | Qwen3-Embedding-0.6B | https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B | https://www.modelscope.cn/Qwen/Qwen3-Embedding-0.6B.git | +| | Qwen3-Embedding-4B | https://modelscope.cn/models/Qwen/Qwen3-Embedding-4B | https://www.modelscope.cn/Qwen/Qwen3-Embedding-4B.git | +| | Qwen3-Embedding-8B | https://modelscope.cn/models/Qwen/Qwen3-Embedding-8B | https://www.modelscope.cn/Qwen/Qwen3-Embedding-8B.git | +| | Qwen3-Reranker-0.6B | https://modelscope.cn/models/Qwen/Qwen3-Reranker-0.6B | https://www.modelscope.cn/Qwen/Qwen3-Reranker-0.6B.git | +| | Qwen3-Reranker-8B | https://modelscope.cn/models/Qwen/Qwen3-Reranker-8B | https://www.modelscope.cn/Qwen/Qwen3-Reranker-8B.git | +| **OpenBMB** | MiniCPM-V-4 | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4 | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4.git | +| | MiniCPM-V-4_5 | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5 | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4_5.git | +| | MiniCPM4.1-8B | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B.git | +| | MiniCPM4.1-8B-GPTQ | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B-GPTQ | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B-GPTQ.git | +| | MiniCPM4-0.5B | https://modelscope.cn/models/OpenBMB/MiniCPM4-0.5B | https://www.modelscope.cn/OpenBMB/MiniCPM4-0.5B.git | +| | BitCPM4-1B | https://modelscope.cn/models/OpenBMB/BitCPM4-1B | https://www.modelscope.cn/OpenBMB/BitCPM4-1B.git | +| | BitCPM4-0.5B | https://modelscope.cn/models/OpenBMB/BitCPM4-0.5B | https://www.modelscope.cn/OpenBMB/BitCPM4-0.5B.git | +| | MiniCPM-V-2_6 | https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6 | https://www.modelscope.cn/OpenBMB/MiniCPM-V-2_6.git | +| | InternVL3_5-8B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-8B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-8B.git | +| | InternVL3_5-4B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-4B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-4B.git | +| | InternVL3_5-2B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-2B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-2B.git | +| | InternVL3_5-1B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-1B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-1B.git | +| **InfiniAI** | Megrez-3b-Instruct | https://modelscope.cn/models/InfiniAI/Megrez-3b-Instruct | https://www.modelscope.cn/InfiniAI/Megrez-3b-Instruct.git | **PS**:如果对量化模型有速度要求,需要经过模型转换,模型转换⼯具后期会开放 diff --git "a/thirdparty/skynoon/user_cases/AI-Book-MTSDK-1.3.0/AIBOOK-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" "b/thirdparty/skynoon/user_cases/AI-Book-MTSDK-1.3.0/AIBOOK-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" new file mode 100644 index 0000000000000000000000000000000000000000..22915d011e17dce9b626185bdd7c06c697c7d87f --- /dev/null +++ "b/thirdparty/skynoon/user_cases/AI-Book-MTSDK-1.3.0/AIBOOK-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" @@ -0,0 +1,877 @@ +

+ logo +

+

摩尔线程-AIBOOK

+

基于vllm-musa安装模型指引

+ +
+

lastAuthor:Roger.ye

+

lastDate:2026-03-11

+
+ + + + + + + + +# 准备工作 + +## 工作目录 + +为了方便后续操作,我们先约定一个工作目录。在接下来的指引中,我们将用 **`${WorkDir}`** 表示这个目录的路径。请您在开始前先**创建好您的工作目录**,并确保后续所有操作都在此目录下进行。 + +## 系统环境要求 + +| 组件 | 版本 | +| -------------- | ----------- | +| MUSA Driver | 3.1.3-AB100 | +| MUSA SDK | 4.1.2-rc2 | +| AIOS(操作系统) | 1.3.3 | + +## 查看musa环境 + +```shell +# 查看驱动 +dpkg -l|grep musa + +# 输出 +ii musa 1.3.3-AB100 arm64 Moore Threads MUSA driver [7f92281de] +ii musa-sdk 4.1.2-rc2 arm64 Moore Threads MTGPU Software Development Kit + +# 查看环境 +ll /usr/local/ |grep musa* + +# 输出 +lrwxrwxrwx 1 mt mt 10 12月 31 14:20 musa -> musa-4.1.2/ +drwxr-xr-x 13 mt mt 4096 3月 10 12:30 musa-4.1.2/ +``` + +或 + +```shell +# 执行 +musaInfo + +# 输出正常则环境配置正确 +compiler: mcc +-------------------------------------------------------------------------------- +device# 0 +Name: M1000 +pciBusID: 0x0 +pciDeviceID: 0x0 +pciDomainID: 0x0 +multiProcessorCount: 8 +maxThreadsPerMultiProcessor: 6144 +isMultiGpuBoard: 1 +clockRate: 0 Mhz +memoryClockRate: 0 Mhz +memoryBusWidth: 384 +totalGlobalMem: 31.05 GB +sharedMemPerMultiprocessor: 72.00 KB +totalConstMem: 8192 +sharedMemPerBlock: 72.00 KB +canMapHostMemory: 1 +regsPerBlock: 262144 +warpSize: 128 +l2CacheSize: 0 +computeMode: 0 +maxThreadsPerBlock: 1024 +maxThreadsDim.x: 1024 +maxThreadsDim.y: 1024 +maxThreadsDim.z: 1024 +maxGridSize.x: 2147483647 +maxGridSize.y: 2147483647 +maxGridSize.z: 2147483647 +major: 2 +minor: 2 +concurrentKernels: 1 +cooperativeLaunch: 0 +cooperativeMultiDeviceLaunch: 0 +isIntegrated: 1 +maxTexture1D: 32768 +maxTexture2D.width: 32768 +maxTexture2D.height: 32768 +maxTexture3D.width: 16384 +maxTexture3D.height: 16384 +maxTexture3D.depth: 2048 +peers: +non-peers: device#0 + +memInfo.total: 31.05 GB +memInfo.free: 19.02 GB (61%) + +``` + +## 升级系统(若不满足) + +AIBOOK系统升级参考以下文档 + +``` +https://ab-web.aibook.net.cn/WEB/pdf/爱簿智能AIBOOK%20OTA升级指导手册.pdf +``` + +**注:**系统升级包含了musa的版本更新,升级完成后无需再另行安装 + +# 安装miniconda + +```shell +curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh +chmod +x Miniconda3-latest-Linux-aarch64.sh && ./Miniconda3-latest-Linux-aarch64.sh +# 一路回车到底 +``` + +# 配置conda环境变量 + +```shell +export PATH=/home/kevin/miniconda3/bin:$PATH # 配置时/home/kevin/miniconda3/bin该路径请根据自身情况填写 +source ~/.bashrc +conda init +source ~/.bashrc +``` + +# 创建conda环境 + +```shell +conda create -n vllm_musa python=3.10 +``` + +# 激活conda环境 + +```shell +conda activate vllm_musa +``` + +# 配置PIP国内镜像源 + +```shell +pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/ +``` + +# 安装依赖 + +```shell +sudo apt update + +sudo apt install -y python3-pip git cmake wget build-essential g++ libstdc++-12-dev libnuma-dev openmpi-bin openmpi-common libopenblas-dev curl +``` + +# 下载安装包合集 + +```shell +# 进入工作目录 +cd ${WorkDir} + +# 下载 +wget https://mt-ai-data.tos-cn-shanghai.volces.com/vllm_musa/v1.3.1/release_1.3.3/20260302/AIBook-release_1.3.3-vllm_musa_1.3.1-torch_2.1.1.tar.gz + +# 解压 +tar zxvf AIBook-release_1.3.3-vllm_musa_1.3.1-torch_2.1.1.tar.gz +cd AIBook-release_1.3.3-vllm_musa_1.3.1-torch_2.1.1 +``` + +# 安装 torch_musa依赖 + +```shell +# 进入依赖包目录 +cd ${packagedirectory} + +# 确认安装环境 +conda activate vllm_musa + +pip3 install torch-2.5.0-cp310-cp310-linux_aarch64.whl +pip3 install torch_musa-2.1.1-cp310-cp310-linux_aarch64.whl +pip3 install torchaudio-2.5.0a0+56bc006-cp310-cp310-linux_aarch64.whl +pip3 install torchvision-0.20.0a0+afc54f7-cp310-cp310-linux_aarch64.whl + +# 此处需要保证numpy处于低版本 +pip3 install numpy==1.26.4 +``` + +# 验证torch_musa + +输出 true 证明torch_musa环境安装正确 + +```shell +# 命令 +python3 -c "import torch;import torch_musa;print(torch.musa.is_available())" + +# 期望结果 +Error in cpuinfo: prctl(PR_SVE_GET_VL) failed # 此处正常 +True +``` + +# 常见问题 + +## Q1:NumPy 2.2.6 as it may crash. To support both 1.x and 2.x + +```python +A module that was compiled using NumPy 1.x cannot be run in +NumPy 2.2.6 as it may crash. To support both 1.x and 2.x +versions of NumPy, modules must be compiled with NumPy 2.0. +Some module may need to rebuild instead e.g. with 'pybind11>=2.12'. + +If you are a user of the module, the easiest solution will be to +downgrade to 'numpy<2' or try to upgrade the affected module. +We expect that some modules will need time to support NumPy 2. + +Traceback (most recent call last): File "", line 1, in + File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 1471, in + from .functional import * # noqa: F403 + File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/functional.py", line 9, in + import torch.nn.functional as F + File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/nn/__init__.py", line 1, in + from .modules import * # noqa: F403 + File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/nn/modules/__init__.py", line 35, in + from .transformer import TransformerEncoder, TransformerDecoder, \ + File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 20, in + device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'), +/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.) + device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'), +``` + +**解决方案** + +```shell +# 确保正确python环境 +conda activate vllm_musa + +pip3 install numpy==1.26.4 +``` + +## Q2:ImportError: Please try running Python from a different directory! + +```python +(vllm_musa) dev@localhost:~/model$ python3 -c "import torch;import torch_musa;print(torch.musa.is_available())" +Traceback (most recent call last): + File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch_musa/__init__.py", line 41, in + import torch_musa._MUSAC +ImportError: /home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch_musa/lib/libmusa_python.so.2: undefined symbol: _ZN5torch12py_symbolizeERSt6vectorIPNS_17CapturedTracebackESaIS2_EE + +The above exception was the direct cause of the following exception: + +Traceback (most recent call last): + File "", line 1, in + File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch_musa/__init__.py", line 43, in + raise ImportError("Please try running Python from a different directory!") from err +ImportError: Please try running Python from a different directory! +``` + +**解决方案** + +**`torch_musa` 与当前 `torch` 版本不兼容** 的典型表现,可能安装过程中包的版本有误,请删除安装的包重新安装 + +```shell +# 确保正确python环境 +conda activate vllm_musa + +# 删除 +pip3 uninstall torch torch_musa torchaudio torchvision +``` + +此处跳转安装torch_musa依赖: [安装 torch_musa依赖](#安装 torch_musa依赖) + +**注意:如果还不行,请重新安装一次musa环境,点击跳转:[环境安装](#envInstall),安装完成后再将torch相关包重新安装一次** + +# 安装 VLLM 与 VLLM-MUSA + +```shell +# 确保正确python环境 +conda activate vllm_musa + +# 进入依赖包目录(已进入忽略) +cd ${packagedirectory} + +pip3 install -r requirements.txt +pip3 install triton-3.1.0-cp310-cp310-linux_aarch64.whl +pip3 install vllm-0.9.2.dev259+gc2cd4356d-cp310-cp310-linux_aarch64.whl +pip3 install vllm_musa-1.3+m1000-cp310-cp310-linux_aarch64.whl +``` + +# 验证 VLLM-MUSA + +```shell +python3 -c "from vllm_musa import _musa_custom_ops;_musa_custom_ops.decode_mla" + +# 正常输出如下: +(vllm_musa) dev@localhost:/home$ python3 -c "from vllm_musa import _musa_custom_ops;_musa_custom_ops.decode_mla" +Error in cpuinfo: prctl(PR_SVE_GET_VL) failed # 此处正常 + +``` + +# 模型下载与管理 + +## 支持的量化模型 + +当前版本对⻬ vllm 社区 v0.9.2,可以在 https://huggingface.co/ 直接下载开源模型。国内源可在魔塔上进行下载:https://modelscope.cn/home + +对于量化模型,我们提供以下加速版模型: + +| 类别 | 模型名称 | 魔塔地址 | git克隆地址 | +| ------------ | -------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | +| DeepSeek | gptq-DeepSeek-R1-Distill-Qwen-7B | https://modelscope.cn/models/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B | https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git | +| | DeepSeek-R1-Distill-Qwen-1.5B | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.git | +| | DeepSeek-R1-0528-Qwen3-8B | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B.git | +| **Qwen2.5** | Qwen2.5-7B-Instruct-GPTQ-Int4 | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4.git | +| | Qwen2.5-14B-Instruct-GPTQ-Int4 | https://modelscope.cn/models/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4.git | +| | Qwen2.5-VL-3B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git | +| | Qwen2.5-7B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct.git | +| | gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/models/hiruyun/gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-7B-Instruct-v2.git | +| | gptq-Qwen2.5-14B-Instruct | https://modelscope.cn/models/hiruyun/gptq-Qwen2.5-14B-Instruct | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-14B-Instruct.git | +| | Qwen2.5-VL-7B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-VL-7B-Instruct.git | +| | Qwen2.5-VL-3B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git | +| | QwQ-32B-GPTQ-Int4 | https://modelscope.cn/models/tclf90/qwq-32b-gptq-int4 | https://www.modelscope.cn/tclf90/qwq-32b-gptq-int4.git | +| **Qwen3** | Qwen3-4B | https://modelscope.cn/models/Qwen/Qwen3-4B | https://www.modelscope.cn/Qwen/Qwen3-4B.git | +| | Qwen3-8B | https://modelscope.cn/models/Qwen/Qwen3-8B | https://www.modelscope.cn/Qwen/Qwen3-8B.git | +| | gptq-Qwen3-8B | https://www.modelscope.cn/models/hiruyun/gptq-Qwen3-8B | https://www.modelscope.cn/hiruyun/gptq-Qwen3-8B.git | +| | Qwen3-30B-A3B-GPTQ-Int4 | https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen3-30B-A3B-GPTQ-Int4.git | +| | Qwen3-Embedding-0.6B | https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B | https://www.modelscope.cn/Qwen/Qwen3-Embedding-0.6B.git | +| | Qwen3-Embedding-4B | https://modelscope.cn/models/Qwen/Qwen3-Embedding-4B | https://www.modelscope.cn/Qwen/Qwen3-Embedding-4B.git | +| | Qwen3-Embedding-8B | https://modelscope.cn/models/Qwen/Qwen3-Embedding-8B | https://www.modelscope.cn/Qwen/Qwen3-Embedding-8B.git | +| | Qwen3-Reranker-0.6B | https://modelscope.cn/models/Qwen/Qwen3-Reranker-0.6B | https://www.modelscope.cn/Qwen/Qwen3-Reranker-0.6B.git | +| | Qwen3-Reranker-8B | https://modelscope.cn/models/Qwen/Qwen3-Reranker-8B | https://www.modelscope.cn/Qwen/Qwen3-Reranker-8B.git | +| **OpenBMB** | MiniCPM-V-4 | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4 | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4.git | +| | MiniCPM-V-4_5 | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5 | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4_5.git | +| | MiniCPM4.1-8B | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B.git | +| | MiniCPM4.1-8B-GPTQ | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B-GPTQ | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B-GPTQ.git | +| | MiniCPM4-0.5B | https://modelscope.cn/models/OpenBMB/MiniCPM4-0.5B | https://www.modelscope.cn/OpenBMB/MiniCPM4-0.5B.git | +| | BitCPM4-1B | https://modelscope.cn/models/OpenBMB/BitCPM4-1B | https://www.modelscope.cn/OpenBMB/BitCPM4-1B.git | +| | BitCPM4-0.5B | https://modelscope.cn/models/OpenBMB/BitCPM4-0.5B | https://www.modelscope.cn/OpenBMB/BitCPM4-0.5B.git | +| | MiniCPM-V-2_6 | https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6 | https://www.modelscope.cn/OpenBMB/MiniCPM-V-2_6.git | +| | InternVL3_5-8B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-8B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-8B.git | +| | InternVL3_5-4B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-4B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-4B.git | +| | InternVL3_5-2B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-2B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-2B.git | +| | InternVL3_5-1B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-1B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-1B.git | +| **InfiniAI** | Megrez-3b-Instruct | https://modelscope.cn/models/InfiniAI/Megrez-3b-Instruct | https://www.modelscope.cn/InfiniAI/Megrez-3b-Instruct.git | + +**PS**:如果对量化模型有速度要求,需要经过模型转换,模型转换⼯具后期会开放 + +## 下载模型 + +> 📌 提示:模型较大(7B约11GB,30B约32GB),请预留足够磁盘空间。 + +```shell +sudo apt update + +sudo apt install git-lfs + +git lfs install + +cd ${WorkDir} + +# 克隆 +git clone https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git +``` + +# 启用性能模式 + +界面右上角: 电源 -> 性能 + +# 启动模型服务 + +> **PS:启动命令在32G运行内存环境下运行** +> +> - 若再32G以下运行,需要自行调整上下文的长度 +> +> - 若调配参数仍然出现OOM情况,请点击[扩充虚拟内存](#扩充虚拟内存) + +## 清理缓存 + +建议启动服前先清除缓存: + +```shell +export TRITON_CACHE_DIR="/tmp/triton" && sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" +``` + +## 通用启动命令 + +> --num-gpu-blocks-override 1024 --max-model-len 16384 适用场景为单并发16k上下文 + +```shell +export TRITON_CACHE_DIR="/tmp/triton" && \ +sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" && \ +vllm serve gptq-DeepSeek-R1-Distill-Qwen-7B \ + --served_model_name gptq-DeepSeek-R1-Distill-Qwen-7B \ + -tp 1 \ + --gpu-memory-utilization 0.7 \ + --quantization gptq \ + --num-gpu-blocks-override 1024 \ + --max-model-len 16384 \ + --swap-space 0 \ + --block-size 32 + +# vllm serve gptq-DeepSeek-R1-Distill-Qwen-7Bq +# - gptq-DeepSeek-R1-Distill-Qwen-7B 为模型的路径,需要在模型上级目录运行 +# - ${modelDir}/gptq-DeepSeek-R1-Distill-Qwen-7B 可以在任意目录下执行 +# 不使用served_model_name指定模型id,那么模型id将依据【模型路径】来命名 +``` + +## Qwen3-30B 启动命令 + +```shell +export TRITON_CACHE_DIR="/tmp/triton" && \ +sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" && \ +vllm serve Qwen3-30B-A3B-GPTQ-Int4 \ + -tp 1 \ + --gpu_memory_utilization 0.7 \ + --quantization gptq \ + --max-model-len 16384 \ + --max-num-seqs 1 \ + --swap-space 0 \ + --num-gpu-blocks-override 512 \ + --enforce-eager \ + --block_size 32 + +# vllm serve Qwen3-30B-A3B-GPTQ-Int4 +# - Qwen3-30B-A3B-GPTQ-Int4 为模型的路径,需要在模型上级目录运行 +# - ${modelDir}/Qwen3-30B-A3B-GPTQ-Int4 可以在任意目录下执行 +# 不使用served_model_name指定模型id,那么模型id将依据【模型路径】来命名 +``` + +**默认使用 v0 engine, 想使用 v1 engine 需要指定 VLLM_USE_V1=1,如:** + +```shell +export TRITON_CACHE_DIR="/tmp/triton" && \ +sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" && \ +VLLM_USE_V1=1 \ +vllm serve Qwen3-30B-A3B-GPTQ-Int4 \ + -tp 1 \ + --gpu_memory_utilization 0.7 \ + --quantization gptq \ + --max-model-len 16384 \ + --max-num-seqs 1 \ + --swap-space 0 \ + --num-gpu-blocks-override 512 \ + --enforce-eager \ + --block_size 32 + +# vllm serve Qwen3-30B-A3B-GPTQ-Int4 +# - Qwen3-30B-A3B-GPTQ-Int4 为模型的路径,需要在模型上级目录运行 +# - ${modelDir}/Qwen3-30B-A3B-GPTQ-Int4 可以在任意目录下执行 +# 不使用served_model_name指定模型id,那么模型id将依据【模型路径】来命名 +``` + +**取消思考模式** + +```shell +wget https://qwen.readthedocs.io/en/latest/_downloads/c101120b5bebcc2f12ec504fc93a965e/qwen3_nonthinking.jinja + +export TRITON_CACHE_DIR="/tmp/triton" && \ +sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" && \ +VLLM_USE_V1=1 \ +vllm serve Qwen3-30B-A3B-GPTQ-Int4 \ + -tp 1 \ + --gpu_memory_utilization 0.7 \ + --quantization gptq \ + --max-model-len 16384 \ + --max-num-seqs 1 \ + --swap-space 0 \ + --num-gpu-blocks-override 512 \ + --enforce-eager \ + --block_size 32 \ + --chat-template ./qwen3_nonthinking.jinja + +# vllm serve Qwen3-30B-A3B-GPTQ-Int4 +# - Qwen3-30B-A3B-GPTQ-Int4 为模型的路径,需要在模型上级目录运行 +# - ${modelDir}/Qwen3-30B-A3B-GPTQ-Int4 可以在任意目录下执行 +# 不使用served_model_name指定模型id,那么模型id将依据【模型路径】来命名 +``` + +## VLLM参数说明 + +- **`model`**:模型路径 +- **`served_model_name`**:设置启动后模型的名称,默认使用model的路径命名 +- **`device`**:仅支持设置为`musa` +- **`tensor-parallel-size`**:目前仅支持tp=1 +- **`dtype`**: 支持默认值`auto,float16,bfloat16` +- **`kv-cache-dtype`**:仅支持默认值`auto` +- **`pipeline-parallel-size`**:仅支持默认值`1` +- **`max_num_batched_tokens`**,**`max_model_len`** :需要根据运行的序列长度进行配置,如果出现OOM可减小这两个参数值,仍然出现OOM情况,请点击[扩充虚拟内存](#扩充虚拟内存) +- **`enforce-eager`** : 表示立即执行,不启用musaGraph + +## 常见问题 + +### Q1:There appear to be 1 leaked semaphore objects to clean up at shutdow + +```python +INFO 07-31 10:06:10 executor_base.py:116] Maximum concurrency for 8192 tokens per request: 16.00x +Traceback (most recent call last): + File "/home/dev/miniconda3/envs/vllm_musa/bin/vllm", line 8, in + sys.exit(main()) + File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main + args.dispatch_function(args) + File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 34, in cmd + uvloop.run(run_server(args)) + File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run + return loop.run_until_complete(wrapper()) + File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete + File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper + return await main + File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 947, in run_server + async with build_async_engine_client(args) as engine_client: + File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/contextlib.py", line 199, in __aenter__ + return await anext(self.gen) + File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client + async with build_async_engine_client_from_engine_args( + File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/contextlib.py", line 199, in __aenter__ + return await anext(self.gen) + File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 233, in build_async_engine_client_from_engine_args + raise RuntimeError( +RuntimeError: Engine process failed to start. See stack trace for the root cause. +(vllm_musa) dev@localhost:~/model$ /home/dev/miniconda3/envs/vllm_musa/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown + warnings.warn('resource_tracker: There appear to be %d ' +``` + +**解决方案** + +- 可尝试重启设备 + +- 尝试清除缓存: + + ```shell + export TRITON_CACHE_DIR="/tmp/triton" && sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" + ``` + +- 尝试将以下配置的参数调低: + + ```shell + --num-gpu-blocks-override + --max-model-len + ``` + +- 扩充虚拟内存,请点击 [扩充虚拟内存](#扩充虚拟内存) + +# 模型服务调用与测试 + +## 查看模型列表 + +命令: + +```shell +curl http://localhost:8000/v1/models +``` + +输出: + +```json +{ + "object": "list", + "data": [ + { + "id": "gptq-DeepSeek-R1-Distill-Qwen-7B", + "object": "model", + "created": 1755164590, + "owned_by": "vllm", + "root": "gptq-DeepSeek-R1-Distill-Qwen-7B", + "parent": null, + "max_model_len": 16384, + "permission": [ + { + "id": "modelperm-d260b34830fe43b4a2b4dc2bff6adee5", + "object": "model_permission", + "created": 1755164590, + "allow_create_engine": false, + "allow_sampling": true, + "allow_logprobs": true, + "allow_search_indices": false, + "allow_view": true, + "allow_fine_tuning": false, + "organization": "*", + "group": null, + "is_blocking": false + } + ] + } + ] +} +``` + +## 发起对话请求 + +另开⼀个窗⼝调⽤,需要替换为本地的模型路径 + +```shell +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "gptq-DeepSeek-R1-Distill-Qwen-7B", + "temperature": 0.7, + "top_p": 0.8, + "top_k": 20, + "repetition_penalty":1.05, + "max_tokens": 1000, + "messages": [{"role": "user", "content": "介绍一下北京"}] + }' + +# 若运行时qwen3模型,该模型默认会进行思考,可在请求体中添加 "chat_template_kwargs": {"enable_thinking": false} 来取消思考 +# 结构如下: +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "某个Qwen3模型", + "temperature": 0.7, + "top_p": 0.8, + "top_k": 20, + "repetition_penalty":1.05, + "max_tokens": 1000, + "messages": [{"role": "user", "content": "介绍一下北京"}], + "chat_template_kwargs": {"enable_thinking": false} + }' +``` + +输出: + +```json +{ + "id": "chatcmpl-9b2d6b19177e46aea986aafa520bf2ee", + "object": "chat.completion", + "created": 1755164707, + "model": "gptq-DeepSeek-R1-Distill-Qwen-7B", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "reasoning_content": null, + "content": "嗯,我现在要介绍一下北京。首先,我得想想北京有哪些方面可以写。用户给的介绍挺详细的,有地理位置、历史、文化、景点、美食等等。我应该按照这个结构来,确保每个部分都涵盖到。\n\n先从地理位置开始,北京位于华北,地理位置很重要,因为它周围有很多自然景观和历史古迹。然后是历史,北京作为古都,有很多历史遗迹,比如故宫、天坛这些地方。还有历史人物,比如爱新觉罗·博克,他可能在1927年访问过北京,这显示了北京在历史上的重要性。\n\n接下来是文化,北京有很多特色,比如四合院、胡同,还有北京的 dialects 和 food。可能需要提到一些著名的小吃,比如炸酱面、烤鸭这些,这样读者能感受到当地的美食。\n\n然后是景点,我得列出一些著名景点,比如故宫、天坛、鸟巢、水立方这些。每个景点的特点要简单说明一下,让用户知道去那里有什么可以看、做。\n\n文化生活方面,北京有很多艺术展览,比如790艺术区,还有电影学院。体育方面,国家体育馆和奥林匹克公园都是亮点,应该提到它们的功能和意义。\n\n现代发展也不能少,北京的城市建设,比如 602地块,还有天安门广场的现代化升级。科技方面,比如 5G 网络和, beijing app,这些现代基础设施让北京更便利。\n\n美食方面,除了炸酱面和烤鸭,还有其他特色菜,比如烤鸭、涮羊肉这些,可以推荐一些餐厅,但要注意不要太详细,保持简洁。\n\n交通方面,地铁、公交和出租车都是主要的出行方式,再加上一些著名景点的路线,帮助游客规划行程。\n\n最后,总结一下北京的魅力,它不仅是一个古老的城市,也是一个充满活力和创新的地方,适合不同的人前来生活。\n\n现在,检查一下有没有遗漏的部分,比如 maybe 提到一些其他的景点或者特色,或者更详细的描述。不过,保持每个部分简短,重点突出,避免过于冗长。\n\n可能还需要考虑一下,用户的需求是什么。他们可能想知道北京的历史文化、美食,还是现代发展?根据之前的介绍,已经涵盖了这些方面,所以我觉得结构已经很全面了。\n\n另外,语言风格要口语化,避免使用任何markdown格式,用简单的中文表达,让读者容易理解。\n\n总的来说,我需要按照地理位置、历史、文化、美食、景点、现代发展这几个方面来组织内容,每个部分简明扼要,突出重点,让介绍既全面又易于阅读。\n\n\n北京,这座历史悠久的城市,位于中国华北,地理环境优越,拥有丰富的历史文化和现代化发展。以下是关于北京的详细介绍:\n\n### 地理与历史\n北京位于华北平原,地理位置优越,东距渤海约100公里,北距Inner Mongolia自治区西南部约100公里。历史悠久的古都,曾是历史上 Activity 的政治、经济和文化中心。作为古都,北京拥有众多历史遗迹,如故宫、天坛、北海 and 天干支.\n\n### 文化与历史\n北京是中国历史文化名城,拥有众多历史遗迹,如故宫、天坛、北海 and 天干支. 历史人物如爱新觉罗·博克曾在此访问,显示其重要性. 北京是中华文明的重要发源地,孕育了无数文化名人.\n\n### 文化与美食\n北京以其独特的 dialect 和美食闻名,如炸酱面、烤鸭和涮羊肉. 拥有众多特色餐厅,适合品尝当地美食. 历史与现代结合,使其成为美食爱好者的天堂.\n\n### 景点与活动\n北京拥有众多著名景点,如故宫、鸟巢、水立方和奥林匹克公园. 它们不仅是旅游胜地,也是举办各种活动的理想场所. 国家体育场等大型设施展示了现代化城市的精神.\n\n### 现代发展\n北京在现代化进程中不断进步,拥有现代化的基础设施和先进的科技. 例如,602地块的混合式地块计划,提升了城市面貌. 作为国际交往的中心,北京在现代生活中扮演重要角色.\n\n### 经济与交通\n北京作为北方重要的交通枢纽,拥有发达的经济基础. 市区拥有6条主轴线,成为东西向的经济带. 交通系统完善,地铁、公交和出租车等多种出行方式方便游客.\n\n### 结语\n北京以其悠久的历史和现代化发展,展现出独特魅力. 无论是历史爱好者还是美食探索者,都能在北京市找到满足需求的地方. 无论是古代文化还是现代科技,北京都以其独特的方式吸引着每一位访客.", + "tool_calls": [] + }, + "logprobs": null, + "finish_reason": "stop", + "stop_reason": null + } + ], + "usage": { + "prompt_tokens": 9, + "total_tokens": 989, + "completion_tokens": 980, + "prompt_tokens_details": null + }, + "prompt_logprobs": null +} +``` + +## python调用 + +### 流式输出 + +```python +from openai import OpenAI + +# Modify OpenAI's API key and API base to use vLLM's API server. +openai_api_key = "EMPTY" +openai_api_base = "http://localhost:8000/v1" + +client = OpenAI( + # defaults to os.environ.get("OPENAI_API_KEY") + api_key=openai_api_key, + base_url=openai_api_base, +) + +models = client.models.list() +model = models.data[0].id + +chat_completion = client.chat.completions.create( + messages=[{ + "role": "system", + "content": "You are a helpful assistant." + }, { + "role": "user", + "content": "北京有哪些名胜古迹?" + }], + model=model, + temperature=0.7, + top_p=0.8, + extra_body={ + 'top_k':20, + 'repetition_penalty':1.05, # 惩罚重复,vllm默认没有加载需要添加,参考:https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct/file/view/master?fileName=generation_config.json&status=1#L9 + # "chat_template_kwargs": {"enable_thinking": False}, # 对于Qwen3模型取消思考 + }, + max_tokens=512, + stream=True, # 启用流式输出 +) + +# 处理流式响应 +print("Chat response (streaming):") +for chunk in chat_completion: + if chunk.choices: + delta = chunk.choices[0].delta + content = delta.content + if content: + print(content, end='', flush=True) +print("\n - Chat response (end) -\n") +``` + +### 非流式输出 + +```python +from openai import OpenAI + +# Modify OpenAI's API key and API base to use vLLM's API server. +openai_api_key = "EMPTY" +openai_api_base = "http://localhost:8000/v1" + +client = OpenAI( + # defaults to os.environ.get("OPENAI_API_KEY") + api_key=openai_api_key, + base_url=openai_api_base, +) + +models = client.models.list() +model = models.data[0].id + +chat_completion = client.chat.completions.create( + messages=[{ + "role": "system", + "content": "You are a helpful assistant." + }, { + "role": "user", + "content": "北京有哪些名胜古迹?" + }], + model=model, + temperature=0.7, + top_p=0.8, + extra_body={ + 'top_k':20, + 'repetition_penalty':1.05, + # "chat_template_kwargs": {"enable_thinking": False}, # 对于Qwen3模型取消思考 + }, + max_tokens=512, + stream=False, +) + +print("Chat completion results:") +print(chat_completion) +``` + +# 性能测试方法 + +> ⼀个窗⼝吊起模型服务,另⼀个窗⼝跑性能测试的脚本 。测试时推荐使用性能模式:[启用性能模式](#启用性能模式) + +## VLLM压测 + +```shell +# 进入工作目录 +cd ${WorkDir} + +# 确保正确python环境 +conda activate vllm_musa + +git clone https://github.com/vllm-project/vllm.git + +cd vllm + +git checkout v0.9.2 + +cd benchmarks/ + +# model 参数需要对照已启动模型id进行填写 +# ${modelDir}注意替换成模型的存放目录 +python3 benchmark_serving.py \ + --base-url http://127.0.0.1:8000 \ + --model gptq-DeepSeek-R1-Distill-Qwen-7B \ + --tokenizer ${modelDir}/models/gptq-DeepSeek-R1-Distill-Qwen-7B \ + --dataset_name random \ + --random_input_len 128 \ + --random_output_len 128 \ + --num-prompts 1 \ + --trust-remote-code \ + --ignore-eos + +``` + +### 常见问题 + +#### Q1:`GLIBCXX_3.4.30' not found + +```shell +Traceback (most recent call last): + File "/home/skysi/04-vllm-musa/vllm/benchmarks/benchmark_serving.py", line 37, in + from backend_request_func import (ASYNC_REQUEST_FUNCS, + File "/home/skysi/04-vllm-musa/vllm/benchmarks/backend_request_func.py", line 15, in + from transformers import (AutoTokenizer, PreTrainedTokenizer, + File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/transformers/__init__.py", line 27, in + from . import dependency_versions_check + File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/transformers/dependency_versions_check.py", line 16, in + from .utils.versions import require_version, require_version_core + File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/transformers/utils/__init__.py", line 24, in + from .args_doc import ( + File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/transformers/utils/args_doc.py", line 30, in + from .generic import ModelOutput + File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/transformers/utils/generic.py", line 46, in + import torch # noqa: F401 + File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 368, in + from torch._C import * # noqa: F403 +ImportError: /home/skysi/miniconda3/envs/vllm_musa/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/lib/libtorch_python.so) +``` + +**解决** + +```shell +# 确保正确python环境 +conda activate vllm_musa + +conda install -c conda-forge libstdcxx-ng=12.1.0 +``` + +## EvalScope 压测 + +> EvalScope 是魔搭社区推出的**模型评测与性能基准测试框架** + +```shell +# 创建虚拟环境 +conda create -n evalscope python=3.10 -y + +# 激活虚拟环境 +conda activate evalscope + +# 安装 EvalScope +pip install -U 'evalscope[perf]' plotly gradio wandb + + +# 10 请求 1并发 +evalscope perf \ + --url "http://localhost:8000/v1/chat/completions" \ + --api-key "" \ + --model gptq-DeepSeek-R1-Distill-Qwen-7B \ + --number 10 \ + --parallel 1 \ + --api openai \ + --dataset openqa \ + --stream +``` + +# 扩充虚拟内存 + +PS:16G盒⼦能跑通7B模型8k上下⽂需要配置这⼀步 + +## 配置 swap 分区 + +```SHELL +# 先把当前所有分区都关闭了 +swapoff -a +# 创建要作为 Swap 分区⽂件,这⼀步耗时略久 +dd if=/dev/zero of=/var/swapfile bs=1G count=8 +# 建⽴ Swap 的⽂件系统 +mkswap /var/swapfile +# 启⽤ Swap 分区 +swapon /var/swapfile +# 查看 Linux 当前分区,需要看到 swap 部分 total 是 8xxx +free -m +# 开机启动 +sudo swapon --show +echo '/var/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab +``` + +## 配置显存 Page Size + +```shell +# 1. 执⾏ +sudo su +# 2. 修改 /etc/modprobe.d/mtgpu.conf 内容为: +options mtgpu mtgpu_drm_major=2 GeneralSVMHeapPageSize=0x1000 +# 3. 执⾏以下命令: +update-initramfs -u # 会看到 cryptsetup 相关的 ERROR WARNING,是正常的 +reboot +``` diff --git "a/thirdparty/skynoon/user_cases/T035EVB-MTSDK-1.3.0/T035-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" "b/thirdparty/skynoon/user_cases/T035EVB-MTSDK-1.3.0/T035-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" index 35a89fa82c833a051b4124d0ff95d7dc0cb8a152..2c8bd97f90d591923db1499ae779ea1ad4b8005d 100644 --- "a/thirdparty/skynoon/user_cases/T035EVB-MTSDK-1.3.0/T035-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" +++ "b/thirdparty/skynoon/user_cases/T035EVB-MTSDK-1.3.0/T035-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" @@ -163,34 +163,7 @@ True # 常见问题 -## Q1:libmpi_cxx.so.40: cannot open shared object file: No such file or directory - -```shell -(vllm_musa) dev@localhost:~/model$ -(vllm_musa) dev@localhost:~/model$ python3 -c "import torch;import torch_musa;print(torch.musa.is_available())" -Traceback (most recent call last): - File "", line 1, in - File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 236, in - _load_global_deps() - File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 195, in _load_global_deps - raise err - File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 176, in _load_global_deps - ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL) - File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/ctypes/__init__.py", line 374, in __init__ - self._handle = _dlopen(self._name, mode) -OSError: libmpi_cxx.so.40: cannot open shared object file: No such file or directory -``` - -**解决方案** - -证明mpi so缺失,请下载以下依赖: - -```shell -sudo apt update - -``` - -## Q2:NumPy 2.2.6 as it may crash. To support both 1.x and 2.x +## Q1:NumPy 2.2.6 as it may crash. To support both 1.x and 2.x ```python A module that was compiled using NumPy 1.x cannot be run in @@ -226,7 +199,7 @@ conda activate vllm_musa pip3 install numpy==1.26.4 ``` -## Q3:ImportError: Please try running Python from a different directory! +## Q2:ImportError: Please try running Python from a different directory! ```python (vllm_musa) dev@localhost:~/model$ python3 -c "import torch;import torch_musa;print(torch.musa.is_available())" @@ -260,13 +233,13 @@ pip3 uninstall torch torch_musa torchaudio torchvision **注意:如果还不行,请重新安装一次musa环境,点击跳转:[环境安装](#envInstall),安装完成后再将torch相关包重新安装一次** -## Q4:ImportError: libmccl.so.2: cannot open shared object file: No such file or directory +## Q3:ImportError: libmccl.so.2: cannot open shared object file: No such file or directory **解决方案** 请重新安装一次musa环境,点击跳转:[环境安装](#envInstall),安装完成后再将torch相关包重新安装一次 -## Q5: MUSA driver initialization failed +## Q4: MUSA driver initialization failed ```shell Traceback (most recent call last): @@ -332,8 +305,8 @@ conda activate vllm_musa cd ${packagedirectory} pip3 install -r requirements.txt -pip3 install triton-3.1.0-cp310-cp310-linux_aarch64.whl -pip3 install vllm-0.9.2.dev257+g4747b491f-cp310-cp310-linux_aarch64.whl +pip3 install triton-3.1.0-cp310-cp310-linux_aarch64.whl --force-reinstall +pip3 install vllm-0.9.2.dev259+gc2cd4356d-cp310-cp310-linux_aarch64.whl pip3 install vllm_musa-1.3+m1000-cp310-cp310-linux_aarch64.whl ``` @@ -356,13 +329,42 @@ Error in cpuinfo: prctl(PR_SVE_GET_VL) failed # 此处正常 对于量化模型,我们提供以下加速版模型: -| 模型名称 | 魔塔地址 | git克隆地址 | -| -------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -| gptq-DeepSeek-R1-Distill-Qwen-7B | https://modelscope.cn/models/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B | https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git | -| gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/models/hiruyun/gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-7B-Instruct-v2.git | -| gptq-Qwen2.5-14B-Instruct | https://modelscope.cn/models/hiruyun/gptq-Qwen2.5-14B-Instruct | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-14B-Instruct.git | -| gptq-Qwen3-8B | https://www.modelscope.cn/models/hiruyun/gptq-Qwen3-8B | https://www.modelscope.cn/hiruyun/gptq-Qwen3-8B.git | -| Qwen3-30B-A3B-GPTQ-Int4 | https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen3-30B-A3B-GPTQ-Int4.git | +| 类别 | 模型名称 | 魔塔地址 | git克隆地址 | +| ------------ | -------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | +| DeepSeek | gptq-DeepSeek-R1-Distill-Qwen-7B | https://modelscope.cn/models/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B | https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git | +| | DeepSeek-R1-Distill-Qwen-1.5B | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.git | +| | DeepSeek-R1-0528-Qwen3-8B | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B.git | +| **Qwen2.5** | Qwen2.5-7B-Instruct-GPTQ-Int4 | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4.git | +| | Qwen2.5-14B-Instruct-GPTQ-Int4 | https://modelscope.cn/models/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4.git | +| | Qwen2.5-VL-3B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git | +| | Qwen2.5-7B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct.git | +| | gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/models/hiruyun/gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-7B-Instruct-v2.git | +| | gptq-Qwen2.5-14B-Instruct | https://modelscope.cn/models/hiruyun/gptq-Qwen2.5-14B-Instruct | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-14B-Instruct.git | +| | Qwen2.5-VL-7B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-VL-7B-Instruct.git | +| | Qwen2.5-VL-3B-Instruct | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git | +| | QwQ-32B-GPTQ-Int4 | https://modelscope.cn/models/tclf90/qwq-32b-gptq-int4 | https://www.modelscope.cn/tclf90/qwq-32b-gptq-int4.git | +| **Qwen3** | Qwen3-4B | https://modelscope.cn/models/Qwen/Qwen3-4B | https://www.modelscope.cn/Qwen/Qwen3-4B.git | +| | Qwen3-8B | https://modelscope.cn/models/Qwen/Qwen3-8B | https://www.modelscope.cn/Qwen/Qwen3-8B.git | +| | gptq-Qwen3-8B | https://www.modelscope.cn/models/hiruyun/gptq-Qwen3-8B | https://www.modelscope.cn/hiruyun/gptq-Qwen3-8B.git | +| | Qwen3-30B-A3B-GPTQ-Int4 | https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen3-30B-A3B-GPTQ-Int4.git | +| | Qwen3-Embedding-0.6B | https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B | https://www.modelscope.cn/Qwen/Qwen3-Embedding-0.6B.git | +| | Qwen3-Embedding-4B | https://modelscope.cn/models/Qwen/Qwen3-Embedding-4B | https://www.modelscope.cn/Qwen/Qwen3-Embedding-4B.git | +| | Qwen3-Embedding-8B | https://modelscope.cn/models/Qwen/Qwen3-Embedding-8B | https://www.modelscope.cn/Qwen/Qwen3-Embedding-8B.git | +| | Qwen3-Reranker-0.6B | https://modelscope.cn/models/Qwen/Qwen3-Reranker-0.6B | https://www.modelscope.cn/Qwen/Qwen3-Reranker-0.6B.git | +| | Qwen3-Reranker-8B | https://modelscope.cn/models/Qwen/Qwen3-Reranker-8B | https://www.modelscope.cn/Qwen/Qwen3-Reranker-8B.git | +| **OpenBMB** | MiniCPM-V-4 | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4 | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4.git | +| | MiniCPM-V-4_5 | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5 | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4_5.git | +| | MiniCPM4.1-8B | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B.git | +| | MiniCPM4.1-8B-GPTQ | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B-GPTQ | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B-GPTQ.git | +| | MiniCPM4-0.5B | https://modelscope.cn/models/OpenBMB/MiniCPM4-0.5B | https://www.modelscope.cn/OpenBMB/MiniCPM4-0.5B.git | +| | BitCPM4-1B | https://modelscope.cn/models/OpenBMB/BitCPM4-1B | https://www.modelscope.cn/OpenBMB/BitCPM4-1B.git | +| | BitCPM4-0.5B | https://modelscope.cn/models/OpenBMB/BitCPM4-0.5B | https://www.modelscope.cn/OpenBMB/BitCPM4-0.5B.git | +| | MiniCPM-V-2_6 | https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6 | https://www.modelscope.cn/OpenBMB/MiniCPM-V-2_6.git | +| | InternVL3_5-8B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-8B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-8B.git | +| | InternVL3_5-4B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-4B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-4B.git | +| | InternVL3_5-2B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-2B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-2B.git | +| | InternVL3_5-1B | https://modelscope.cn/models/OpenGVLab/InternVL3_5-1B | https://www.modelscope.cn/OpenGVLab/InternVL3_5-1B.git | +| **InfiniAI** | Megrez-3b-Instruct | https://modelscope.cn/models/InfiniAI/Megrez-3b-Instruct | https://www.modelscope.cn/InfiniAI/Megrez-3b-Instruct.git | **PS**:如果对量化模型有速度要求,需要经过模型转换,模型转换⼯具后期会开放 @@ -1009,109 +1011,3 @@ update-initramfs -u # 会看到 cryptsetup 相关的 ERROR WARNING,是正常 reboot ``` -# 性能参考 - -> 以下是T035开发板在性能模式下,配备32GB运行内存性能测试结果。 - -## gptq-DeepSeek-R1-Distill-Qwen-7B - -> EvalScope 压测 10请求1并发 - -| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** | -| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- | -| 10% | 0.1439 | 0.0930 | 0.1011 | 51.8112 | 17 | 512 | 9.4239 | 9.5926 | -| 25% | 0.1570 | 0.0974 | 0.1020 | 54.4006 | 20 | 530 | 9.4354 | 9.7942 | -| 50% | 0.1604 | 0.1045 | 0.1028 | 61.1391 | 26 | 597 | 9.7261 | 10.1899 | -| 66% | 0.1658 | 0.1100 | 0.1034 | 70.4440 | 28 | 681 | 9.7425 | 10.2138 | -| 75% | 0.1666 | 0.1139 | 0.1058 | 102.6116 | 29 | 967 | 9.7646 | 10.2680 | -| 80% | 0.4193 | 0.1167 | 0.1060 | 108.1039 | 34 | 1020 | 9.8820 | 10.3675 | -| 90% | 0.4246 | 0.1233 | 0.1151 | 235.7246 | 38 | 2048 | 10.0055 | 10.7530 | -| 95% | 0.4246 | 0.1281 | 0.1151 | 235.7246 | 38 | 2048 | 10.0055 | 10.7530 | -| 98% | 0.4246 | 0.1338 | 0.1151 | 235.7246 | 38 | 2048 | 10.0055 | 10.7530 | -| 99% | 0.4246 | 0.1381 | 0.1151 | 235.7246 | 38 | 2048 | 10.0055 | 10.7530 | - -## gptq-Qwen2.5-7B-Instruct-v2 - -> EvalScope 压测 10请求1并发 - -| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** | -| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- | -| 10% | 0.3876 | 0.0838 | 0.0935 | 16.0308 | 41 | 169 | 10.2386 | 11.8103 | -| 25% | 0.3910 | 0.0876 | 0.0942 | 21.1318 | 44 | 215 | 10.3771 | 12.0867 | -| 50% | 0.4023 | 0.0918 | 0.0945 | 26.9652 | 50 | 283 | 10.4653 | 12.7522 | -| 66% | 0.4243 | 0.0958 | 0.0945 | 27.1373 | 52 | 284 | 10.4950 | 12.9189 | -| 75% | 0.4272 | 0.1009 | 0.0950 | 27.8147 | 53 | 294 | 10.5140 | 13.2493 | -| 80% | 0.4291 | 0.1032 | 0.0953 | 35.3929 | 58 | 368 | 10.5422 | 13.5364 | -| 90% | 0.4336 | 0.1080 | 0.0967 | 43.4659 | 62 | 457 | 10.5699 | 14.6113 | -| 95% | 0.4336 | 0.1149 | 0.0967 | 43.4659 | 62 | 457 | 10.5699 | 14.6113 | -| 98% | 0.4336 | 0.1172 | 0.0967 | 43.4659 | 62 | 457 | 10.5699 | 14.6113 | -| 99% | 0.4336 | 0.1183 | 0.0967 | 43.4659 | 62 | 457 | 10.5699 | 14.6113 | - -## gptq-Qwen3-8B - -> EvalScope 压测 10请求1并发 - -| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** | -| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- | -| 10% | 0.2059 | 0.1117 | 0.1357 | 163.6918 | 20 | 1202 | 6.9136 | 7.0131 | -| 25% | 0.2197 | 0.1248 | 0.1359 | 164.0903 | 23 | 1209 | 7.0599 | 7.2642 | -| 50% | 0.2248 | 0.1406 | 0.1390 | 191.3759 | 29 | 1375 | 7.3108 | 7.4726 | -| 66% | 0.2374 | 0.1503 | 0.1400 | 192.6090 | 31 | 1376 | 7.3431 | 7.4985 | -| 75% | 0.2386 | 0.1557 | 0.1414 | 200.7116 | 32 | 1417 | 7.3526 | 7.5299 | -| 80% | 0.4726 | 0.1579 | 0.1446 | 231.1392 | 37 | 1598 | 7.3642 | 7.5446 | -| 90% | 0.4842 | 0.1668 | 0.1504 | 276.7050 | 41 | 1839 | 7.3679 | 7.5691 | -| 95% | 0.4842 | 0.1738 | 0.1504 | 276.7050 | 41 | 1839 | 7.3679 | 7.5691 | -| 98% | 0.4842 | 0.1821 | 0.1504 | 276.7050 | 41 | 1839 | 7.3679 | 7.5691 | -| 99% | 0.4842 | 0.1872 | 0.1504 | 276.7050 | 41 | 1839 | 7.3679 | 7.5691 | - -## gptq-Qwen2.5-14B-Instruct - -> EvalScope 压测 10请求1并发 - -| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** | -| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- | -| 10% | 0.8191 | 0.1756 | 0.1882 | 44.1347 | 41 | 230 | 5.1026 | 5.7233 | -| 25% | 0.8244 | 0.1817 | 0.1889 | 44.2783 | 44 | 231 | 5.1032 | 5.7562 | -| 50% | 0.8650 | 0.1914 | 0.1919 | 62.6467 | 50 | 323 | 5.1758 | 5.9859 | -| 66% | 0.8704 | 0.1962 | 0.1935 | 74.4713 | 52 | 380 | 5.2113 | 6.1430 | -| 75% | 0.8746 | 0.1994 | 0.1942 | 76.0923 | 53 | 388 | 5.2170 | 6.3552 | -| 80% | 0.8829 | 0.2017 | 0.1943 | 76.6646 | 58 | 393 | 5.2316 | 6.5066 | -| 90% | 0.8848 | 0.2094 | 0.1944 | 77.4029 | 62 | 395 | 5.2960 | 6.6161 | -| 95% | 0.8848 | 0.2168 | 0.1944 | 77.4029 | 62 | 395 | 5.2960 | 6.6161 | -| 98% | 0.8848 | 0.2248 | 0.1944 | 77.4029 | 62 | 395 | 5.2960 | 6.6161 | -| 99% | 0.8848 | 0.2305 | 0.1944 | 77.4029 | 62 | 395 | 5.2960 | 6.6161 | - -## Qwen3-30B-A3B-GPTQ-Int4 - -> EvalScope 压测 10请求1并发 - -| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** | -| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- | -| 10% | 0.4120 | 0.0712 | 0.0721 | 77.0906 | 20 | 1040 | 10.7279 | 11.0010 | -| 25% | 0.4657 | 0.0714 | 0.0727 | 85.1089 | 23 | 1160 | 13.4448 | 13.6327 | -| 50% | 0.5578 | 0.0722 | 0.0731 | 103.3427 | 29 | 1350 | 13.6296 | 13.9616 | -| 66% | 0.5616 | 0.0733 | 0.0734 | 106.3631 | 31 | 1369 | 13.6809 | 13.9950 | -| 75% | 0.7475 | 0.0747 | 0.0741 | 117.1708 | 32 | 1403 | 13.6890 | 14.0225 | -| 80% | 0.7949 | 0.0757 | 0.0927 | 122.4263 | 37 | 1456 | 13.7975 | 14.0497 | -| 90% | 0.8312 | 0.0840 | 0.1048 | 143.7252 | 41 | 1646 | 13.8157 | 14.2232 | -| 95% | 0.8312 | 0.0982 | 0.1048 | 143.7252 | 41 | 1646 | 13.8157 | 14.2232 | -| 98% | 0.8312 | 0.1420 | 0.1048 | 143.7252 | 41 | 1646 | 13.8157 | 14.2232 | -| 99% | 0.8312 | 0.2052 | 0.1048 | 143.7252 | 41 | 1646 | 13.8157 | 14.2232 | - -## Qwen3-30B-A3B-GPTQ-Int4: v1 engine - -> EvalScope 压测 10请求1并发 - -| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** | -| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- | -| 10% | 0.3906 | 0.0699 | 0.0713 | 75.1009 | 20 | 1040 | 13.8222 | 14.1051 | -| 25% | 0.4237 | 0.0702 | 0.0714 | 83.3085 | 23 | 1160 | 13.8480 | 14.1867 | -| 50% | 0.5412 | 0.0707 | 0.0715 | 97.1227 | 29 | 1350 | 13.9020 | 14.2191 | -| 66% | 0.5843 | 0.0711 | 0.0716 | 98.4750 | 31 | 1369 | 13.9186 | 14.2378 | -| 75% | 0.7587 | 0.0714 | 0.0717 | 101.5034 | 32 | 1403 | 13.9241 | 14.2482 | -| 80% | 0.7930 | 0.0719 | 0.0718 | 104.6081 | 37 | 1456 | 13.9780 | 14.3940 | -| 90% | 0.8313 | 0.0742 | 0.0724 | 119.5024 | 41 | 1646 | 13.9812 | 14.4125 | -| 95% | 0.8313 | 0.0786 | 0.0724 | 119.5024 | 41 | 1646 | 13.9812 | 14.4125 | -| 98% | 0.8313 | 0.0821 | 0.0724 | 119.5024 | 41 | 1646 | 13.9812 | 14.4125 | -| 99% | 0.8313 | 0.0834 | 0.0724 | 119.5024 | 41 | 1646 | 13.9812 | 14.4125 | -